Information 
Geometry

Geert Verdoolaege

www.mdpi.com/journal/entropy

Edited by

Printed Edition of the Special Issue Published in Entropy


Information Geometry



Information Geometry

Special Issue Editor

Geert Verdoolaege

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade


Special Issue Editor

Geert Verdoolaege

Ghent University

Belgium

Editorial Office

MDPI

St. Alban-Anlage 66

4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal Entropy

(ISSN 1099-4300) in 2014 (available at: https://www.mdpi.com/journal/entropy/special issues/

information-geometry)

For citation purposes, cite each article independently as indicated on the article page online and as

indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Article Number,

Page Range.

ISBN 978-3-03897-632-5 (Pbk)

ISBN 978-3-03897-633-2 (PDF)

Cover image courtesy of Geert Verdoolaege.

c⃝ 2019 by the authors. Articles in this book are Open Access and distributed under the Creative

Commons Attribution (CC BY) license, which allows users to download, copy and build upon

published articles, as long as the author and publisher are properly credited, which ensures maximum

dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons

license CC BY-NC-ND.


Contents

About the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Preface to ”Information Geometry” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix

Shun-ichi Amari
Information Geometry of Positive Measures and Positive-DefiniteMatrices: 
Decomposable Dually Flat Structure
Reprinted from: Entropy 2014, 16, 2131–2145, doi:10.3390/e16042131 . . . . . . . . . . . . . . . . .
1

Harsha K. V. and Subrahamanian Moosath K S
F-Geometry and Amari’s α−Geometry on a Statistical Manifold
Reprinted from: Entropy 2014, 16, 2472–2487, doi:10.3390/e16052472 . . . . . . . . . . . . . . . . .
14

Frank Critchley and Paul Marriott
Computational Information Geometry in Statistics: Theory and Practice
Reprinted from: Entropy 2014, 16, 2454–2471, doi:10.3390/e16052454 . . . . . . . . . . . . . . . . .
29

Paul Vos and Karim Anaya-Izquierdo
Using Geometry to Select One Dimensional Exponential Families That Are Monotone
Likelihood Ratio in the Sample Space, Are Weakly Unimodal and Can Be Parametrized by a
Measure of Central Tendency
Reprinted from: Entropy 2014, 16, 4088–4100, doi:10.3390/e16074088 . . . . . . . . . . . . . . . . .
44

Guido Mont´ufar, Johannes Rauh and Nihat Ay
On the Fisher Metric of Conditional Probability Polytopes
Reprinted from: Entropy 2014, 16, 3207–3233, doi:10.3390/e16063207 . . . . . . . . . . . . . . . . .
56

Andr´e Klein
Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes
Reprinted from: Entropy 2014, 16, 2023–2055, doi:10.3390/e16042023 . . . . . . . . . . . . . . . . .
80

Keisuke Yano and Fumiyasu Komaki
Asymptotically Constant-Risk Predictive Densities When the Distributions of Data and Target
Variables Are Different
Reprinted from: Entropy 2014, 16, 3026–3048, doi:10.3390/e16063026 . . . . . . . . . . . . . . . . . 110

Samuel Livingstone and Mark Girolami
Information-Geometric Markov Chain Monte Carlo Methods Using Diffusions
Reprinted from: Entropy 2014, 16, 3074–3102, doi:10.3390/e16063074 . . . . . . . . . . . . . . . . . 131

Hui Zhao and Paul Marriott
Variational Bayes for Regime-Switching Log-Normal Models
Reprinted from: Entropy 2014, 16, 3832–3847, doi:10.3390/e16073832 . . . . . . . . . . . . . . . . . 155

Frank Nielsen, Richard Nock and Shun-ichi Amari
On Clustering Histograms with k-Means by Using Mixed α-Divergences
Reprinted from: Entropy 2014, 16, 3273–3301, doi:10.3390/e16063273 . . . . . . . . . . . . . . . . . 169

Salem Said, Lionel Bombrun and Yannick Berthoumieu
New Riemannian Priors on the Univariate Normal Model
Reprinted from: Entropy 2014, 16, 4015–4031, doi:10.3390/e16074015 . . . . . . . . . . . . . . . . . 194

v


Luigi Malag`o and Giovanni Pistone
Combinatorial Optimization with Information Geometry: The Newton Method
Reprinted from: Entropy 2014, 16, 4260–4289, doi:10.3390/e16084260 . . . . . . . . . . . . . . . . . 209

Domenico Felice, Carlo Cafaro and Stefano Mancini
Information Geometric Complexity of a Trivariate Gaussian Statistical Model
Reprinted from: Entropy 2014, 16, 2944–2958, doi:10.3390/e16062944 . . . . . . . . . . . . . . . . . 237

Alexandre Levada
Learning from Complex Systems: On the Roles of Entropy and Fisher Information in Pairwise
Isotropic Gaussian Markov Random Fields
Reprinted from: Entropy 2014, 16, 1002–1036, doi:10.3390/e16021002 . . . . . . . . . . . . . . . . . 250

Masatoshi Funabashi
Network Decomposition and Complexity Measures: An Information Geometrical Approach
Reprinted from: Entropy 2014, 16, 4132–4167, doi:10.3390/e16074132 . . . . . . . . . . . . . . . . . 283

Roger Balian
The Entropy-Based Quantum Metric
Reprinted from: Entropy 2014, 16, 3878–3888, doi:10.3390/e16073878 . . . . . . . . . . . . . . . . . 315

Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li
Extending the Extreme Physical Information to Universal Cognitive Models via a Confident
Information First Principle
Reprinted from: Entropy 2014, 16, 3670–3688, doi:10.3390/e16073670 . . . . . . . . . . . . . . . . . 324

vi


About the Special Issue Editor

Geert Verdoolaege obtained an M.Sc.
degree in Theoretical Physics in 1999 and the Ph.D. in

Engineering Physics in 2006, both at Ghent University (UGent, Belgium). His Ph.D. work concerned

applications of Bayesian probability theory to plasma spectroscopy in fusion devices.
He was a

postdoctoral researcher in the field of computer vision at the University of Antwerp (2007–2008),

working on probabilistic modeling of image textures using information geometry. From 2008 to 2010,

he was with the Department of Data Analysis at UGent, where he worked on modeling and

estimation of brain activity, based on functional magnetic resonance imaging. In 2010, he returned

to the Department of Applied Physics at UGent, first as a postdoctoral assistant and from 2014

onwards, as a part-time assistant professor.
Since 2013, he has held a cross-appointment as a

researcher at the Laboratory for Plasma Physics of the Royal Military Academy (LPP-ERM/KMS)

in Brussels. His research activities comprise development of data analysis techniques using methods

from probability theory, machine learning and information geometry, and their application to nuclear

fusion experiments. He also teaches a Master course on Continuum Mechanics at Ghent University.

He serves on the editorial board of the multidisciplinary journal Entropy and is a member of the

scientific committees of several conferences (IAEA Technical Meeting on Fusion Data Processing,

Validation and Analysis; International Workshop on Bayesian Inference and Maximum Entropy

Methods in Science and Engineering; Conference on Geometric Science of Information). In addition,

he is a consulting expert in the International Tokamak Physics Activity (ITPA) Transport and

Confinement Topical Group and member of the General Assembly of the European Fusion Education

Network (FuseNet).

vii



Preface to ”Information Geometry”

The mathematical field of information geometry originated from the observation the Fisher

information can be used to define a Riemannian metric on manifolds of probability distributions.

This led to a geometrical description of probability theory and statistics, which over the years has

developed into a rich mathematical field with a broad range of applications in the data sciences.

Moreover, similar to the concept of entropy, there are various connections to and applications of

information geometry in statistical mechanics, quantum mechanics, and neuroscience.

It has been a pleasure to act as a guest editor for this first Special Issue on information geometry

in the journal Entropy. For me, as a physicist working on the development and application of

data science techniques in the context of nuclear fusion experiments, the interdisciplinary character

of information geometry has always been one of the main reasons for its appeal.
There are, of

course, many other domains in physics where geometrical notions play a key role, including classical

mechanics, continuum mechanics (which I have been teaching at Ghent University for several years

now), general relativity, and much of modern physics. This interplay between the beautiful and

elegant formalism of differential geometry on the one hand and physics and data science on the

other hand is both fascinating and inspiring. The variety of topics covered by this Special Issue is a

reflection of this cross-fertilization between disciplines.

“Information Geometry I” has been a great success, and although the papers were published

already several years ago, it was decided that it was worthwhile to reprint the collection of papers

in book form. Indeed, even though all papers present original research, many have a strong tutorial

character, and we were honored to receive multiple contributions by authorities in the field. The

papers have been structured according to their main subject area, or field of application, and we

briefly discuss each of them in the following.

We start with two papers related to the foundations of information geometry. We were very

pleased to receive a contribution by one of the founders of the field of information geometry, prof.

Shun-ichi Amari. In his paper, the dually flat structure of the manifold of positive measures is

discussed, derived from a class of Bregman divergences. These so-called (ρ,τ)-divergences, originally

proposed by J. Zhang, are defined in terms of two monotone, scalar functions (ρ and τ) and form a

unique class of dually flat, decomposable divergences. This is extended to the set of positive-definite

matrices, additionally requiring invariance of the divergence under matrix transformations. It is well

known that such dually flat manifolds have computationally desirable properties in applications to

classification and information retrieval.

Harsha K. V. and Subrahamanian Moosath K. S. introduce F-geometry as a generalization

of α-geometry, based on a representation of a probability density function through a function F.

They then combine this with another function G to define a weighted expectation, from which

an (F,G)-metric and connection are deduced. A condition for two of such structures to lead

to dual connections is also derived. However, it was shown by Zhang (J. Zhang, Entropy 17,

pp. 4485–4499, 2015) that this framework is equivalent to the (ρ,τ)-geometry introduced earlier by

him. Although the present paper is slightly different in perspective, it should be read with this

equivalence in mind.

The next four papers deal with applications of information geometry in statistics. The paper

by Frank Critchley and Paul Marriott presents an important research program aimed at rendering

some of the most useful results of information geometry more accessible to statisticians in

ix


practical applications. Indeed, the formalism of differential geometry and tensor algebra can

appear daunting at first sight and may present an obstacle to adoption of many useful results

by practitioners. The paper describes a computational framework that facilitates implementation

of results from information geometry, based on an embedding of various important statistical

models in a (sufficiently large) simplex. Challenges related to extension of the framework to the

infinite–dimensional case are touched upon as well.

In the paper by Paul Vos and Karim Anaya-Izquierdo, the goal is to identify one-dimensional

exponential families enjoying a number of properties that are convenient for statistical modeling,

i.e., parametrization by a measure of central tendency, unimodality, and monotone likelihood ratio.

The basis for the framework is the multinomial distribution, modeled geometrically by the simplex.

The selection of exponential families with desirable properties is then based on a partitioning of the

natural parameter space of the family of multinomial distributions by means of convex cones.

Guido Mont´ufar and co-workers consider various possibilities to define natural Riemannian

metrics on polytopes of stochastic matrices, which describe the conditional probability distribution

of two categorical random variables. Inspired by the classical result regarding the uniqueness of the

Fisher metric by requiring invariance under Markov morphisms, they define metrics derived from

a natural class of stochastic maps between such polytopes, or, alternatively, through embeddings in

various possible model spaces. They provide recommendations as to which metric to use, depending

on the application.

Andr´e Klein, in his article, provides a survey of several matrix algebraic properties of the Fisher

information matrix corresponding to weakly stationary time series. The link with various structured

matrices arising from a number of time series models is demonstrated. A statistical distance measure

is built using the Fisher information matrix in the context of classical and quantum information.

Finally, conditions are obtained for the Fisher information of a stationary process to obey certain

forms of the Stein equation.

We continue with three papers concerning applications of information geometry in Bayesian

inference and simulation. Keisuke Yano and Fumiyasu Komaki, in their paper, construct constant-risk

Bayesian predictive densities using the Kullback-Leibler loss function when the distributions of the

data and the target variable to be predicted are different but have a common unknown parameter.

Specifically, the issue of prior selection is investigated, and several applications are given.

Samuel Livingstone and Mark Girolami provide an introduction to recent enhancements of

Markov chain Monte Carlo simulation techniques inspired by information geometry. They apply

this to the Metropolis–Hastings algorithm driven by Langevin diffusion, gradually transforming the

ingredients to the setting of a Riemannian manifold equipped with a metric similar to the Fisher

information metric. Pointers to various applications are given. The paper is written in a way that also

makes it accessible to practitioners with little background in stochastic processes and geometry.

The paper by Hui Zhao and Paul Marriott concerns Bayesian inference making use of variational

methods for approximating the posterior distribution. In the context of inference for time series

models that switch between different regimes, variational Bayes is shown to be a computationally

attractive alternative to Markov chain Monte Carlo simulations. The geometry related to the

projection of the posterior onto a computationally tractable family of distributions is elucidated by

means of a simple example. This is followed by an application wherein it is shown that variational

Bayes is successful in estimating the regime-switching model, including the number of regimes.

Applications of information geometry in machine learning are represented by the following

x


three papers. The article by Frank Nielsen and colleagues considers κ-means histogram clustering,

with applications to, e.g., information retrieval. Based on the α-divergences as similarity measures,

clustering is performed using either the sided (asymmetric) or symmetrized divergence, or by means

of the interesting notion of a mixed divergence. An important computational advantage is that the

centroids based on the sided and mixed divergences have a closed-form expression. Next, the scheme

is extended to algorithms with optimized initialization of cluster centroids, as well as soft clustering.

Salem Said and co-workers present a class of distributions on the manifold of the univariate

normal model equipped with the Fisher information metric. Expressed in terms of the Fisher-Rao

distance, the distributions are used as priors for modeling the classes in Bayesian classification

problems of normal distributions. Characteristics of this “Gaussian” distribution on the manifold are

discussed, as well as estimation of its parameters and the posterior using the Laplace approximation.

In an application to classification of image textures, the improved performance of these priors over

conjugate priors is demonstrated.

Luigi Malag`o and Giovanni Pistoni address optimization on manifolds of exponential

distributions on a discrete state space using Newton’s method, which is based on second-order

calculus. In particular, the goal is to find maxima of the expectation of a function with respect to the

distribution (stochastic relaxation). Details of the computation are provided, including calculation

of the Riemannian Hessian. A nonparametric formalism is used, with a view to extension to the

infinite–dimensional case.

The next three papers are related to the role of information geometry in complex systems

research. Domenico Felice and colleagues consider the time-averaged volume explored by geodesics

on a statistical manifold as an indicator of complexity of the entropic dynamics of a system.

The parameters of the model play the role of macrovariables conveying information on the

system’s microstate. Examples are given for the case of univariate, bivariate, and trivariate normal

distributions, providing interesting results depending on correlations between the microvariables.

Alexandre Levada investigates the role of entropy and Fisher information in pairwise isotropic

Gaussian Markov random fields, acting as models for complex systems. Expressions for these

quantities are derived and the evolution of the Fisher information, and entropy is studied as

a function of the inverse temperature of the system. An interesting interpretation is given of

asymmetries between these curves in terms of the arrow of time.

Masatoshi Funabashi presents a framework for measuring statistical dependence between

subsystems of a stochastic model, based on the model’s graph representation.
A description in

terms of the mixed coordinates of the system is used to quantify the complexity loss incurred by

cutting an edge of the graph. In addition, a complexity measure is defined as a geometric mean of

Kullback–Leibler divergences between decompositions of the system in terms of subsystems with

fewer statistical dependencies. This quantifies the degree to which the system can be decomposed.

The following paper concerns an application to physics, specifically quantum mechanics. Roger

Balian gives an overview of a geometrical framework for measuring information loss in quantum

systems resulting from the mixing of states. A Riemannian metric is defined, based on the von

Neumann entropy, generating a mapping between states and observables. The metric is compared

to other quantum metrics, as well as the Fisher–Rao metric, and various geometrical properties are

derived. Applications are given to quantum information, as well as equilibrium and non-equilibrium

quantum statistical mechanics.

The final paper in the Special Issue is situated at the interface between physics and neuroscience.

xi


Xiaozhao Zhao and colleagues consider the principle of extreme physical information based on the

Fisher information, which has been used before in an attempt to establish an information-theoretical

basis for physical laws. They extend the idea to cognitive systems and aim at narrowing the gap

between the information bound and the data information for such complex systems, by transforming

the model to a simpler one. This is done by means of a dimensionality reduction technique, also based

on the Fisher information. The approach is applied to derive the model for single-layer Boltzmann

machines and interpret their learning algorithms.

We are convinced that the varied collection of papers in this Special Issue will be useful for

scientists who are new to the field, while providing an excellent reference for the more seasoned

researcher. Furthermore, it is worth mentioning that the second Entropy Special Issue in this series,

“Information Geometry II”, will also be published as a book, and that a third Special Issue is being

prepared. We hope that the reader will enjoy browsing and reading through this collection of papers

as much as we enjoyed guest editing this Special Issue “Information Geometry I”.

Finally, I would like to thank the Editor-in-Chief of Entropy, Prof. Dr. Kevin H. Knuth, for

suggesting the opportunity to guest-edit a Special Issue on information geometry.
Furthermore,

I wish to thank the editorial staff at MDPI for their great help with contacting authors, organizing

paper reviews, and editing the original Special Issue in Entropy, as well as the reprinted version in the

present book.

Geert Verdoolaege

Special Issue Editor

xii




entropy

Article
Information Geometry of Positive Measures and
Positive-Definite Matrices: Decomposable Dually
Flat Structure

Shun-ichi Amari

RIKEN Brain Science Institute, Hirosawa 2-1, Wako-shi, Saitama 351-0198, Japan;
E-Mail: amari@brain.riken.jp; Tel.: +81-48-467-9669; Fax: +81-48-467-9687

Received: 14 February 2014; in revised form: 9 April 2014 / Accepted: 10 April 2014 /
Published: 14 April 2014

Abstract: Information geometry studies the dually flat structure of a manifold, highlighted by
the generalized Pythagorean theorem. The present paper studies a class of Bregman divergences
called the (ρ, τ)-divergence. A (ρ, τ)-divergence generates a dually flat structure in the manifold of
positive measures, as well as in the manifold of positive-definite matrices. The class is composed of
decomposable divergences, which are written as a sum of componentwise divergences. Conversely,
a decomposable dually flat divergence is shown to be a (ρ, τ)-divergence. A (ρ, τ)-divergence is
determined from two monotone scalar functions, ρ and τ. The class includes the KL-divergence, α-,
β- and (α, β)-divergences as special cases. The transformation between an affine parameter and its
dual is easily calculated in the case of a decomposable divergence. Therefore, such a divergence
is useful for obtaining the center for a cluster of points, which will be applied to classification and
information retrieval in vision. For the manifold of positive-definite matrices, in addition to the dually
flatness and decomposability, we require the invariance under linear transformations, in particular
under orthogonal transformations. This opens a way to define a new class of divergences, called the
(ρ, τ)-structure in the manifold of positive-definite matrices.

Keywords: information geometry; dually flat structure; decomposable divergence; (ρ, τ)-structure

1. Introduction

Information geometry, originated from the invariant structure of a manifold of probability
distributions, consists of a Riemannian metric and dually coupled affine connections with respect to
the metric [1]. A manifold having a dually flat structure is particularly interesting and important. In
such a manifold, there are two dually coupled affine coordinate systems and a canonical divergence,
which is a Bregman divergence. The highlight is given by the generalized Pythagorean theorem and
projection theorem. Information geometry is useful not only for statistical inference, but also for
machine learning, pattern recognition, optimization and even for neural networks. It is also related to
the statistical physics of Tsallis q-entropy [2–4].
The present paper studies a general and unique class of decomposable divergence functions
in Rn+, the manifold of n-dimensional positive measures. This is the (ρ, τ)-divergences, introduced
by Zhang [5,6], from the point of view of “representation duality”. They are Bregman divergences
generating a dually flat structure. The class includes the well-known Kullback-Leibler divergence,
α-divergence, β-divergence and (α, β)-divergence [1,7–9] as special cases. The merit of a decomposable
Bregman divergence is that the θ-η Legendre transformation is computationally tractable, where θ
and η are two affine coordinates systems coupled by the Legendre transformation. When one uses
a dually flat divergence to define the center of a cluster of elements, the center is easily given by
the arithmetic mean of the dual coordinates of the elements [10,11]. However, we need to calculate
its primal coordinates. This is the θ-η transformation. Hence, our new type of divergences has an

Entropy 2014, 16, 2131–2145; doi:10.3390/e16042131
www.mdpi.com/journal/entropy
1


Entropy 2014, 16, 2131–2145

advantage of calculating θ-coordinates for clustering and related pattern matching problems. The most
general class of dually flat divergences, not necessarily decomposable, is further given in Rn+. They are
the (ρ, τ) divergence.
Positive-definite (PD) matrices appear in many engineering problems, such as convex
programming, diffusion tensor analysis and multivariate statistical analysis [12–20]. The manifold,
PDn, of n × n PD matrices form a cone, and its geometry is by itself an important subject of research.
If we consider the submanifold consisting of only diagonal matrices, it is equivalent to the manifold
of positive measures. Hence, PD matrices can be regarded as a generalization of positive measures.
There are many studies on geometry and divergences of the manifold of positive-definite matrices. We
introduce a general class of dually flat divergences, the (ρ, τ)-divergence. We analyze the cases when
a (ρ, τ)-divergence is invariant under the general linear transformations, Gl(n), and invariant under
the orthogonal transformations, O(n). They not only include many well-known divergences of PD
matrices, but also give new important divergences.
The present paper is organized as follows. Section 2 is preliminary, giving a short introduction
to a dually flat manifold and the Bregman divergence. It also defines the cluster center due to a
divergence. Section 3 defines the (ρ, τ)-structure in Rn+. It gives dually flat decomposable affine
coordinates and a related canonical divergence (Bregman divergence). Section 4 is devoted to the
(ρ, τ)-structure of the manifold, PDn, of PD matrices. We first study the class of divergences that are
invariant under O(n). We further study a decomposable divergence that is invariant under Gl(n).
It coincides with the invariant divergence derived from zero-mean Gaussian distributions with PD
covariance matrices. They not only include various known divergences, but new remarkable ones.
Section 5 discusses a general class of non-decomposable flat divergences and miscellaneous topics.
Section 6 is the conclusions.

2. Preliminaries to Information Geometry of Divergence

2.1. Dually Flat Manifold

A manifold is said to have the dually flat Riemannian structure, when it has two affine coordinate
systems θ =
�
θ1, · · · , θn�
and η = (η1, · · · , ηn) (with respect to two flat affine connections) together
with two convex functions, ψ(θ) and ϕ(η), such that the two coordinates are connected by the Legendre
transformations:
η = ∇ψ(θ),
θ = ∇ϕ(η),
(1)

where ∇ is the gradient operator. The Riemannian metric is given by:

�
gij(θ)
� = ∇∇ψ(θ),
�
gij(η)
�
= ∇∇ϕ(η)
(2)

in the respective coordinate systems. A curve that is linear in the θ-coordinates is called a θ-geodesic,
and a curve linear in the η-coordinates is called an η-geodesic.
A dually flat manifold has a unique canonical divergence, which is the Bregman divergence
defined by the convex functions,

D[P : Q] = ψ (θP) + ϕ
�
ηQ
� − θP · ηQ,
(3)

where θP is the θ-coordinates of P, ηQ is the η-coordinates of Q and θP · ηQ = ∑i
�
θi
P
� �
ηQi
�
, where θi
P
and ηQi are components of θp and ηQ, respectively. The Pythagorean and projection theorems hold in
a dually flat manifold:
Pythagorean Theorem
Given three points, P, Q, R, when the η-geodesic connecting P and Q is
orthogonal to the θ-geodesic connecting Q and R with respect to the Riemannian metric,

D[P : Q] + D[Q : R] = D[P : R].
(4)

2


Entropy 2014, 16, 2131–2145

Projection Theorem
Given a smooth submanifold, S, let PS be the minimizer of divergence
from P to S,
PS = min
Q∈S D[P : Q].
(5)

Then, PS is the η-geodesic projection of P to S, that is the η-geodesic connecting P and PS is orthogonal
to S.
We have the dual of the above theorems where θ- and η-geodesics are exchanged and D[P : Q] is
replaced by its dual D[Q : P].

2.2. Decomposable Divergence

A divergence, D[P
:
Q], is said to be decomposable, when it is written as a sum of
component-wise divergences,

D[P : Q] =
n
∑
i=1
d
�
θi
P, θi
Q
�
,
(6)

where θi
P and θi
Q are the components of θP and θQ and d
�
θi
P, θi
Q
�
is a scalar divergence function.
An f-divergence:

Df [P : Q] = ∑ pi f
� qi

pi

�
(7)

is a typical example of decomposable divergence in the manifold of probability distributions, where
P = (p) and Q = (q) are two probability vectors with ∑ pi = ∑ qi = 1. A convex function, ψ(θ), is
said to be decomposable, when it is written as:

ψ(θ) =
n
∑
i=1
˜ψ
�
θi�
(8)

by using a scalar convex function, ˜ψ(θ). The Bregman divergence derived from a decomposable convex
function is decomposable.
When ψ(θ) is a decomposable convex function, its Legendre dual is also decomposable. The
Legendre transformation is given componentwise as:

ηi = ˜ψ′ (θi) ,
(9)

where ′ is the differentiation of a function, so that it is computationally tractable.
Its inverse
transformation is also componentwise,
θi = ˜ϕ′ (ηi) ,
(10)

where ˜ϕ is the Legendre dual of ˜ψ.

2.3. Cluster Center

Consider a cluster of points P1, · · · , Pm of which θ-coordinates are θ1, · · · , θm and η-coordinates
are η1, · · · , ηm. The center, R, of the cluster with respect to the divergence, D[P : Q], is defined by:

R = arg min
Q

m
∑
i=1
D [Q : Pi] .
(11)

By differentiating ∑ D [Q : Pi] by θ (the θ-coordinates of Q), we have:

∇ψ (θR) = 1

m

m
∑
i=1
ηi.
(12)

Hence, the cluster-center theorem due to Banerjee et al. [10] follows; see also [11]:

3


Entropy 2014, 16, 2131–2145

Cluster-Center Theorem
The η-coordinates ηR of the cluster center are given by the arithmetic
average of the η-coordinates of the points in the cluster:

ηR = 1

m

m
∑
i=1
ηi.
(13)

When we need to obtain the θ-coordinates of the cluster center, it is given by the θ-η transformation
from ηR,
θR = ∇ϕ (ηR) .
(14)

However, in many cases, the transformation is computationally heavy and intractable when the
dimensions of a manifold is large. The transformation is easy in the case of a decomposable divergence.
This is motivation for considering a general class of decomposable Bregman divergences.

3. (ρ, τ) Dually Flat Structure in Rn
+

3.1. (ρ, τ)-Coordinates of Rn
+

Let Rn+ be the manifold of positive measures over n elements x1, · · · , xn. A measure (or a weight)
of xi is given by:
ξi = m (xi) > 0
(15)

and ξ = (ξ1, · · · , ξn) is a distribution of measures. When ∑ ξi = 1 is satisfied, it is a probability
measure. We write:
R+
n = {ξ |ξi > 0}
(16)

and ξ forms a coordinate system of Rn+.
Let ρ(ξ) and τ(ξ) be two monotonically increasing differentiable functions. We call:

θ = ρ(ξ),
η = τ(ξ)
(17)

the ρ- and τ-representations of positive measure ξ. This is a generalization of the ±α representations [1]
and was introduced in [5] for a manifold of probability distributions. See also [6].
By using these functions, we construct new coordinate systems θ and η of Rn+. They are given, for
θ =
�
θi�
and η = (ηi), by componentwise relations,

θi = ρ (ξi) ,
ηi = τ (ξi) .
(18)

They are called the ρ- and τ-representations of ξ ∈ Rn+, respectively. We search for convex functions,
ψρ,τ(θ) and ϕρ,τ(η), which are Legendre duals to each other, such that θ and η are two dually coupled
affine coordinate systems.

3.2. Convex Functions

We introduce two scalar functions of θ and η by:

˜ψρ,τ(θ)
=
� ρ−1(θ)

0
τ(ξ)ρ′(ξ)dξ,
(19)

˜ϕρ,τ(η)
=
� τ−1(η)

0
ρ(ξ)τ′(ξ)dξ.
(20)

Then, the first and second derivatives of ˜ψρ,τ are:

˜ψ′
ρ,τ(θ)
=
τ(ξ),
(21)

˜ψ′′
ρ,τ(θ)
=
τ′(ξ)
ρ′(ξ) .
(22)

4


Entropy 2014, 16, 2131–2145

Since ρ′(ξ) > 0, τ′(ξ) > 0, we see that ˜ψρ,τ(θ) is a convex function. So is ˜ϕρ,τ(η). Moreover, they are
Legendre duals, because:

˜ψρ,τ(θ) + ˜ϕρ,τ(η) − θη
=
� ξ

0 τ(ξ)ρ′(ξ)dξ +
� ξ

0 ρ(ξ)τ′(ξ)dξ − ρ(ξ)τ(ξ)
(23)

=
0.
(24)

We then define two decomposable convex functions of θ and η by:

ψρ,τ(θ)
= ∑ ˜ψρ,τ
�
θi�
,
(25)

ϕρ,τ(η)
= ∑ ˜ϕρ,τ (ηi) .
(26)

They are Legendre duals to each other.

3.3. (ρ, τ)-Divergence

The (ρ, τ)-divergence between two points, ξ, ξ′ ∈ R+
n , is defined by:

Dρ,τ
�
ξ : ξ′�
=
ψρ,τ (θ) + ϕρ,τ
�
η′� − θ · η′
(27)

=
n
∑
i=1

�� ξi

0
τ(ξ)ρ′(ξ)dξ +
� ξ′
i

0
ρ(ξ)τ′(ξ)dξ − ρ (ξi) τ
�
ξ′
i
��
,
(28)

where θ and η′ are ρ- and τ-representations of ξ and ξ′, respectively.
The (ρ, τ)-divergence gives a dually flat structure having θ and η as affine and dual affine
coordinate systems. This is originally due to Zhang [5] and a generalization of our previous results
concerning the q and deformed exponential families [4]. The transformation between θ and η is simple
in the (ρ, τ)-structure, because it can be done componentwise,

θi
=
ρ
�
τ−1 (ηi)
�
,
(29)

ηi
=
τ
�
ρ−1 �
θi��
.
(30)

The Riemannian metric is:

gij(ξ) = τ′ (ξi)

ρ′ (ξi) δij,
(31)

and hence Euclidean, because the Riemann-Christoffel curvature due to the Levi-Civita connection
vanishes, too.
The following theorem is new, characterizing the (ρ, τ)-divergence.

Theorem 1. The (ρ, τ)-divergences form a unique class of divergences in Rn+ that are dually flat and
decomposable.

3.4. Biduality: α-(ρ, τ) Divergence

We have dually flat connections,
�
∇ρ,τ, ∇∗
ρ,τ
�
, represented in terms of covariant derivatives,
which are derived from Dρ,τ. This is called the representation duality by Zhang [5]. We further have
the α-(ρ, τ) connections defined by:

∇(α)
ρ,τ = 1 + α

2
∇ρ,τ + 1 − α

2
∇∗
ρ,τ.
(32)

The α-(−α) duality is called the reference duality [5]. Therefore, ∇(α)
ρ,τ possesses the biduality, one
concerning α and (−α), and the other with respect to ρ and τ.

5


Entropy 2014, 16, 2131–2145

The Riemann-Christoffel curvature of ∇(α)
ρ,τ is:

R(α)
ρ,τ = 1 − α2

4
R(0)
ρ,τ = 0
(33)

for any α. Hence, there exists unique canonical divergence D(α)
ρ,τ and α-(ρ, τ) affine coordinate systems.
It is an interesting future problem to obtain their explicit forms.

3.5. Various Examples

As a special case of the (ρ, τ)-divergence, we have the (α, β)-divergence obtained from the
following power functions,

ρ(ξ) = 1

αξα, τ(ξ) = 1

βξβ.
(34)

This was introduced by Cichocki, Cruse and Amari in [7,8].
The affine and dual affine coordinates are:

θi = 1

α (ξi)α ,
ηi = 1

β (ξi)β
(35)

and the convex functions are:

ψ(θ) = cα,β ∑ θ

α+β

i α
,
ϕ(η) = cβ,α ∑ η

α+β

β
i
,
(36)

where:
cα,β =
1

β(α + β)α
α+β

α .
(37)

The induced (α, β)-divergence has a simple form,

Dα,β[ξ : ξ′] =
1

αβ (α + β) ∑
�
αξα+β
i
+ βξ′α+β
i
− (α + β)ξα
i ξ′β
i
�
,
(38)

for ξ, ξ′ ∈ Rn+. It is defined similarly in the manifold, Sn, of probability distributions, but it is not
a Bregman divergence in Sn. This is because the total mass constraint ∑ ξi = 1 is not linear in θ- or
η-coordinates in general.
The α-divergence is a special case of the (α, β)-divergence, so that it is a (ρ, τ)-divergence. By
putting:

ρ(ξ) =
2

1 − αξ
1−α

2 ,
τ(ξ) =
2

1 + αξ
1+α

2 ,
(39)

we have:

Dα
�
ξ : ξ′� =
4

1 − α2 ∑

�1 − α

2
ξi + 1 + α

2
ξ

1−α

i 2
− ξα
i
�
ξ′
i
� 1+α

2
�
.
(40)

The β-divergence [19] is obtained from:

ρ(ξ) = ξ,
τ(ξ) = 1

βξ1+β.
(41)

It is written as:

Dβ
�
ξ : ξ′� =
1

β(β + 1) ∑
i

�
ξβ+1
i
+ (β + 1)ξ′
i −
�
ξ′
i
�β+1 − (β + 1)ξi
�
ξ′
i
�β�
.
(42)

The β-divergence is special in the sense that it gives a dually flat structure, even in Sn. This is because
u(ξ) is linear in ξ.

6


Entropy 2014, 16, 2131–2145

The classes of α-divergences and β-divergences intersect at the KL-divergence, and their duals are
different in general. They are the only intersecting points of the two classes.
When ρ(ξ) = ξ and τ(ξ) = U′(ξ) where U is a convex function, (ρ, τ)-divergence is Eguchi’s
U-divergence [21].
Zhang already introduced the (α, β)-divergence in [5], which is not a (ρ, τ)-divergence in Rn+ and
different from ours. We regret for our confusing the naming of the (α, β)-divergence.

4. Invariant, Flat Decomposable Divergences in the Manifold of Positive-Definite Matrices

4.1. Invariant and Decomposable Convex Function

Let P be a positive-definite matrix and ψ(P) be a convex function. Then, a Bregman divergence is
defined between two positive definite matrices, P and Q, by:

D[P : Q] = ψ(P) − (Q) − ∇ (P) · (P − Q)
(43)

where ∇ is the gradient operator with respect to matrix P =
�
Pij
�
, so that ∇ψ(P) is a matrix and the
inner product of two matrices is defined by:

∇ψ(Q) · P = tr {∇ψ(Q)P} ,
(44)

where tr is the trace of a matrix.
It induces a dually flat structure to the manifold of positive-definite matrices, where the affine
coordinate system (θ-coordinates) is
= P and the dual affine coordinate system (η-coordinates) is:

H = ∇ψ(P).
(45)

A convex function, ψ(P), is said to be invariant under the orthogonal group O(n), when:

ψ(P) = ψ
�
OTPO
�
(46)

holds for any orthogonal transformation O, where OT is the transpose of O. An invariant function is
written as a symmetric function of n eigenvalues λ1, · · · , λn of P. See Dhillon and Tropp [12]. When
an invariant convex function of P is written, by using a convex function, f, of one variable, in the
additive form:
ψ(P) = ∑ f (λi) ,
(47)

it is said to be decomposable. We have:

ψ(P) = trf (P).
(48)

4.2. Invariant, Flat and Decomposable Divergence

A divergence D[P : Q] is said to be invariant under O(n), when it satisfies:

D[P : Q] = D
�
OTPO : OTQO
�
.
(49)

When it is derived from a decomposable convex function, ψ(P), it is invariant, flat and decomposable.
We give well-known examples of decomposable convex functions and the divergences derived
from them:

7


Entropy 2014, 16, 2131–2145

(1) For f (λ) = (1/2)λ2, we have:

ψ(P)
=
1
2 ∑ λ2
i ,
(50)

D[P : Q]
=
1
2∥P − Q∥2,
(51)

where ∥P∥2 is the Frobenius norm:
∥P∥2 = ∑ P2
ij.
(52)

(2) For f (λ) = − log λ

ψ(P)
=
− log (det |P|) ,
(53)

D[P : Q]
=
tr
�
PQ−1�
− log
�
det
���PQ−1���
�
− n.
(54)

The affine coordinate system is P, and the dual coordinate system is P−1. The derived geometry is
the same as that of multivariate Gaussian probability distributions with mean zero and covariance
matrix P.
(3) For f (λ) = λ log λ − λ,

ψ(P)
=
tr (P log P − P) ,
(55)

D[P : Q]
=
tr (P log P − P log Q − P + Q) .
(56)

This divergence is used in quantum information theory. The affine coordinate system is P, and
the dual affine coordinate system is log P; and, ψ(P) is called the negative von Neuman entropy.

4.3. (ρ, τ)-Structure in Positive Definite Matrices

We extend the (ρ, τ)-structure in the previous section to the matrix case and introduce a general
dually flat invariant decomposable divergence in the manifold of positive-definite matrices. Let:

Θ = ρ(P),
H = τ(P)
(57)

be ρ- and τ-representations of matrices.
We use two functions,
˜ψρ,τ(θ) and ˜ϕρ,τ(η), defined
in Equations (19) and (20), for defining a pair of dually coupled invariant and decomposable
convex functions,

ψ(Θ)
=
tr ˜ψρ,τ {Θ} ,
(58)

ϕ(H)
=
tr ˜ϕρ,τ {H} .
(59)

They are not convex with respect to P, but are convex with respect to Θ and H, respectively. The
derived Bregman divergence is:

D[P : Q] = ψ {Θ(P)} + ϕ {H(Q)} − Θ(P) · H(Q).
(60)

Theorem 2. The (ρ, τ)-divergences form a unique class of invariant, decomposable and dually flat
divergences in the manifold of positive matrices.

8


Entropy 2014, 16, 2131–2145

The Euclidean, Gaussian and von Neuman divergences given in Equations (51), (54) and (56) are
special examples of (ρ, τ)-divergences. They are given, respectively, by:

(1)
ρ(ξ) = τ(ξ) = ξ,
(61)

(2)
ρ(ξ) = ξ,
τ(ξ) = −1

ξ ,
(62)

(3)
ρ(ξ) = ξ,
τ(ξ) = log ξ.
(63)

When ρ and τ are power functions, we have the (α, β)-structure in the manifold of positive-definite
matrices.

(4) (α-β)-divergence.

By using the (α, β) power functions given by Equation (34), we have:

ψ(Θ) =
α

α + βtr Θ
α+β

α
=
α

α + βtr Pα+β,
(64)

ϕ(H) =
β

α + βtr H

α+β

β
=
β

α + βtr Pα+β
(65)

so that the (α, β)-divergence of matrices is:

D[P : Q] = tr
�
α

α + βPα+β +
β

α + βQα+β − PαQβ
�
.
(66)

This is a Bregman divergence, where the affine coordinate system is Θ = Pα and its dual is
H = Pβ.
(5) The α-divergence is derived as:

Θ(P)
=
2

1 − αP
1−α

2 ,
(67)

ψ(Θ)
=
2

1 + αP,
(68)

Dα[P : Q]
=
4

1 − α2 tr
�
−P
1−α

2 Q
1+α

2 + 1 − α

2
P + 1 + α

2
Q
�
.
(69)

The affine coordinate system is
2

1−αP
1−α

2 , and its dual is
2

1+αP
1+α

2 .
(6) The β-divergence is derived from Equation (41) as:

Dβ[P : Q] =
1

β(β + 1)tr
�
Pβ+1 + (β + 1)Q − Qβ+1 − (β + 1)PQβ�
.
(70)

4.4. Invariance Under Gl(n)

We extend the concept of invariance under the orthogonal group to that under the general linear
group, Gl(n), that is the set of invertible matrices, L, det |L| ̸= 0. This is a stronger condition. A
divergence is said to be invariant under Gl(n), when:

D[P : Q] = D
�
LTPL : LTQL
�
(71)

holds for any L ∈ Gl(n).
We identify matrix P with the zero-mean Gaussian distribution:

p(x, P) = exp
�
−1

2xTP−1x − 1

2 log det |P| − c
�
,
(72)

9


Entropy 2014, 16, 2131–2145

where c is a constant. We know that an invariant divergence belongs to the class of f-divergences in
the case of a manifold of probability distributions, where the invariance means the geometry does
not change under a one-to-one mapping of x to y. Moreover, the only invariant flat divergence is the
KL-divergence [22]. These facts suggest the following conjecture.

Proposition. The invariant, flat and decomposable divergence under Gl(n) is the KL-divergence
given by:

DKL[P : Q] = tr
�
PQ−1�
− log
�
det
���PQ−1|
�
− n.
(73)

5. Non-Decomposable Divergence

We have focused on flat and decomposable divergences.
There are many interesting
non-decomposable divergences. We first discuss a general class of flat divergences in Rn+ and then
touch upon interesting flat and non-flat divergences in the manifold of positive-definite matrices.

5.1. General Class of Flat Divergences in Rn
+

We can describe a general class of flat divergence in Rn+, which are not necessarily decomposable.
This is introduced in [23], which studies the conformal structure of general total Bregman divergences
([11,13]). When Rn+ is endowed with a dually flat structure, it has a θ-coordinate system given by:

θ = ρ(ξ)
(74)

which is not necessarily a componentwise function. Any pair of invertible θ = ρ(ξ) and convex
function ψ(θ) defines a dually flat structure and, hence, a Bregman divergence in Rn+.
The dual coordinates η = τ(ξ) are given by:

η = ∇ψ(θ)
(75)

so that we have:
η = τ(ξ) = ∇ψ {ρ(ξ)} .
(76)

This implies that a pair (ρ, τ) of coordinate systems can define dually coupled affine coordinates and,
hence, a dually flat structure, when and only when η = τ
�
ρ−1(θ)
�
is a gradient of a convex function.
This is different from the case of decomposable divergence, where any monotone pair of ρ(ξ) and
τ(ξ) gives a dually flat structure.

5.2. Non-Decomposable Flat Divergence in PDn

Ohara and Eguchi [15,16] introduced the following function:

ψV(P) = V (det |P|) ,
(77)

where V(ξ) is a monotonically decreasing scalar function. ψV is convex when and only when:

1 + V′′(ξ)ξ2

V′(ξ)
< 1

n.
(78)

In such a case, we can introduce dually flat structure to PDn, where P is an affine coordinate system
with convex ψV(P), and the dual affine coordinate system is:

H = V′(det ∥P∥)P−1.
(79)

10


Entropy 2014, 16, 2131–2145

The derived divergence is:

DV[P : Q] = V(det |P) − V(det |Q)|
(80)

+ V′(det |Q|)tr
�
Q−1(Q − P)
�
.
(81)

When V(ξ) = − log ξ, it reduces to the case of Equation (54), which is invariant under Gl(n) and
decomposable. However, the divergence DV[P : Q] is not decomposable. It is invariant under O(n)
and more strongly so under SGl(n) ⊂ Gl(n), defined by det |L| = ±1.

5.3. Flat Structure Derived from q-Escort Distribution

A dually flat structure is introduced in the manifold of probability distributions [4] as:

˜Dα[p : q] =
1

1 − q
1

Hq(p)

�
1 − ∑ p1−q
i
qq
i
�
,
(82)

where:

Hq(p)
= ∑ pq
i ,
(83)

q
=
1 + α

2
.
(84)

The dual affine coordinates are the q-escort distribution: [4]

ηi =
1

Hq(p) pq
i .
(85)

The divergence, ˜Dq, is flat, but not decomposable.
We can generalize it to the case of PDn,

˜Dq[P : Q] =
1

1 − q
1

tr Pq
�
(1 − q) tr (P) + q tr (Q) − tr
�
P1−qQq��
.
(86)

This is flat, but not decomposable.

5.4. γ-Divergence in PDn

The γ-divergence is introduced by Fujisawa and Eguchi [24]. It gives a super-robust estimator. It
is interesting to generalize it to PDn,

Dγ[P : Q] =
1

γ(γ − 1)

�
log tr Pγ − (γ − 1) log tr Qγ−1 − γ log tr PQγ−1�
.
(87)

This is not flat nor decomposable. This is a projective divergence in the sense that, for any c, c′ > 0,

Dγ
�
cP : c′Q
� = Dγ[P : Q].
(88)

Therefore, it can be defined in the submanifold of tr P = 1.

6. Concluding Remarks

We have shown that the (ρ, τ)-divergence introduced by Zhang [5] is a general dually flat
decomposable structure of the manifold of positive measures. We then extended it to the manifold
of positive-definite matrices, where the criterion of invariance under linear transformations (in
particular, under orthogonal transformations) were added. The decomposability is useful from the

11


Entropy 2014, 16, 2131–2145

computational point of view, because the θ-η transformation is tractable. This is the motivation for
studying decomposable flat divergences.
When we treat the manifold of probability distributions, it is a submanifold of the manifold of
positive measures, where the total sum of measures are restricted to one. This is a nonlinear constraint
in the θ or η coordinates, so that the manifold is not flat, but curved in general. Hence, our arguments
hold in this case only when at least one of the ρ and τ functions are linear. The U-divergence [21] and
β-divergence [19] are such cases. However, for clustering, we can take the average of the η-coordinates
of member probability distributions in the larger manifold of positive measures and then project it
to the manifold of probability distributions. This is called the exterior average, and the projection is
simply a normalization of the result. Therefore, the (ρ, τ)-structure is useful in the case of probability
distributions. The same situation holds in the case of positive-definite matrices.
Quantum information theory deals with positive-definite Hermitian matrices of trace one [25,26].
We need to extend our discussions to the case of complex matrices. The trace one constraint is not
linear with respect to θ- or η-coordinates, as is the same in the case of probability distributions. Many
interesting divergence functions have been introduced in the manifold of positive-definite Hermitian
matrices. It is an interesting future problem to apply our theory to quantum information theory.

Conflicts of Interest: The author declares no conflicts of interest.

References

1.
Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society and Oxford
University Press: Rhode Island, RI, USA, 2000.
2.
Tsallis, C. Introduction to Nonextensive Statistical Mechanics:
Approaching a Complex World; Springer:
Berlin/Heidelberg, Germany, 2009.
3.
Naudts, J. Generalized Thermostatistics; Springer: Berlin/Heidelberg, Germany, 2011.
4.
Amari,
S.;
Ohara,
A.;
Matsuzoe,
H. Geometry of deformed exponential families:
Invariant,
dually-flat and conformal geometries. Physica A 2012, 391, 4308–4319.
5.
Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195.
6.
Zhang, J. Nonparametric information geometry: From divergence function to referential-representational
biduality on statistical manifolds. Entropy 2013, 15, 5384–5418.
7.
Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of
similarities. Entropy 2010, 12, 1532–1568.
8.
Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust
nonnegative matrix factorization. Entropy 2011, 13, 134–170.
9.
Minami, M.; Eguchi, S. Robust blind source separation by beta-divergence. Neural Comput. 2002 14,
1859–1886.
10.
Banerjee, A.; Merugu, S.; Dhillon I.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res.
2005, 6, 1705–1749.
11.
Liu, M.; Vemuri, B.C.; Amari, S.; Nielsen, F. Shape retrieval using hierarchical total Bregman soft clustering.
IEEE Trans. Pattern Anal. Mach. Learn. 2012, 24, 3192–3212.
12.
Dhillon, I.S.; Tropp, J.A. Matrix nearness problems with Bregman divergences. SIAM J. Matrix Anal. Appl.
2007, 29, 1120–1146.
13.
Vemuri, B.C.; Liu, M.; Amari, S.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis.
IEEE Trans. Med. Imaging 2011, 30, 475–483.
14.
Ohara, A.; Suda, N.; Amari, S. Dualistic differential geometry of positive definite matrices and its applications
to related problems. Linear Algebra Appl. 1996 247, 31–53.
15.
Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by
beta-divergence. Entropy 2013, 15, 4732–4747.
16.
Ohara, A.; Eguchi, S. Geometry on positive definite matrices induced from V-potential functions. In Geometric
Science of Information; Nielsen, F., Barbaresco, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2013;
pp. 621–629.

12


Entropy 2014, 16, 2131–2145

17.
Chebbi,
Z.;
Moakher,
M.
Means
of
Hermitian
positive-definite
matrices
based
on
the
log-determinant alpha-divergence function. Linear Algebra Appl. 2012, 436, 1872–1889.
18.
Tsuda, K.; Ratsch, G.; Warmuth, M.K. Matrix exponentiated gradient updates for on-line learning and
Bregman projection. J. Mach. Learn. Res. 2005, 6, 995–1018.
19.
Nock, R.; Magdalou, B.; Briys, E.; Nielsen, F. Mining matrix data with Bregman matrix divergences for
portfolio selection. In Matrix Information Geometry; Nielsen, F., Bhatia, R., Eds.; Springer: Berlin/Heidelberg,
Germany, 2013; Chapter 15, pp. 373–402.
20.
Nielsen, F., Bhatia, R., Eds. Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013.
21.
Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expo. 2006, 19, 197–216.
22.
Amari, S. α-divergence is unique, belonging to both f-divergence and Bregman divergence classes.
IEEE Trans. Inf. Theory 2009, 55, 4925–4931.
23.
Nock, R.; Nielsen, F.; Amari, S. On conformal divergences and their population minimizers. IEEE Trans. Inf.
Theory 2014, submitted for publication.
24.
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination.
J. Multivar. Anal. 2008, 99, 2053–2081.
25.
Petz, P. Monotone metrics on matrix spaces. Linear Algebra Appl. 1996, 244, 81–96.
26.
Hasegawa, H. α-divergence of the non-commutative information geometry. Rep. Math. Phys. 1993, 33, 87–93.

c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

13


entropy

Article
F-Geometry and Amari’s α−Geometry on a
Statistical Manifold

Harsha K. V. * and Subrahamanian Moosath K S *

Indian Institute of Space Science and Technology, Department of Space, Government of India, Valiamala P.O,
Thiruvananthapuram-695547, Kerala, India
*
E-Mails: harsha.11@iist.ac.in (K.V.H.); smoosath@iist.ac.in (K.S.S.M.); Tel.: +91-95-6736-0425 (K.V.H.);
+91-94-9574-3148 (K.S.S.M.).

Received: 13 December 2013; in revised form: 21 April 2014 / Accepted: 25 April 2014 /
Published: 6 May 2014

Abstract: In this paper, we introduce a geometry called F-geometry on a statistical manifold S using
an embedding F of S into the space RX of random variables. Amari’s α−geometry is a special
case of F−geometry. Then using the embedding F and a positive smooth function G, we introduce
(F, G)−metric and (F, G)−connections that enable one to consider weighted Fisher information
metric and weighted connections. The necessary and sufficient condition for two (F, G)−connections
to be dual with respect to the (F, G)−metric is obtained. Then we show that Amari’s 0−connection is
the only self dual F−connection with respect to the Fisher information metric. Invariance properties
of the geometric structures are discussed, which proved that Amari’s α−connections are the only
F−connections that are invariant under smooth one-to-one transformations of the random variables.

Keywords:
embedding; Amari’s α−connections;
F−metric;
F−connections; (F, G)−metric;
(F, G)−connections; invariance

1. Introduction

Geometric study of statistical estimation has opened up an interesting new area called the
Information Geometry. Information geometry achieved a remarkable progress through the works of
Amari [1,2], and his colleagues [3,4]. In the last few years, many authors have considerably contributed
in this area [5–9]. Information geometry has a wide variety of applications in other areas of engineering
and science, such as neural networks, machine learning, biology, mathematical finance, control system
theory, quantum systems, statistical mechanics, etc.
A statistical manifold of probability distributions is equipped with a Riemannian metric and a pair
of dual affine connections [2,4,9]. It was Rao [10] who introduced the idea of using Fisher information
as a Riemannian metric in the manifold of probability distributions. Chentsov [11] introduced a family
of affine connections on a statistical manifold defined on finite sets. Amari [2] introduced a family of
affine connections called α−connections using a one parameter family of functions, the α−embeddings.
These α−connections are equivalent to those defined by Chentsov. The Fisher information metric and
these affine connections are characterized by invariance with respect to the sufficient statistic [4,12] and
play a vital role in the theory of statistical estimation. Zhang [13] generalized Amari’s α−representation
and using this general representation together with a convex function he defined a family of divergence
functions from the point of view of representational and referential duality. The Riemannian metric
and dual connections are defined using these divergence functions.
In this paper, Amari’s idea of using α−embeddings to define geometric structures is extended to
a general embedding. This paper is organized as follows. In Section 2, we define an affine connection
called F−connection and a Riemannian metric called F−metric using a general embedding F of
a statistical manifold S into the space of random variables. We show that F−metric is the Fisher

Entropy 2014, 16, 2472–2487; doi:10.3390/e16052472
www.mdpi.com/journal/entropy
14


Entropy 2014, 16, 2472–2487

information metric and Amari’s α−geometry is a special case of F−geometry. Further, we introduce
(F, G)−metric and (F, G)−connections using the embedding F and a positive smooth function G.
In Section 3, a necessary and sufficient condition for two (F, G)−connections to be dual with
respect to the (F, G)−metric is derived and we prove that Amari’s 0−connection is the only self
dual F−connection with respect to the Fisher information metric. Then we prove that the set of all
positive finite measures on X, for a finite X, has an F−affine manifold structure for any embedding F.
In Section 4, invariance properties of the geometric structures are discussed. We prove that the
Fisher information metric and Amari’s α−connections are invariant under both the transformation
of the parameter and the transformation of the random variable. Further we show that Amari’s
α−connections are the only F−connections that are invariant under both the transformation of the
parameter and the transformation of the random variable.
Let (X, B) be a measurable space, where X is a non-empty subset of R and B is the σ-field of
subsets of X. Let RX be the space of all real valued measurable functions defined on (X, B). Consider
an n−dimensional statistical manifold S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn}, with coordinates
ξ = [ξ1, ..., ξn], defined on X. S is a subset of P(X), the set of all probability measures on X given by

P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X);
�

X
p(x)dx = 1}.
(1)

The tangent space to S at a point pξ is given by

Tξ(S) = {
n
∑
i=1
αi∂i / αi ∈ R}
where ∂i =
∂
∂ξi .
(2)

Define ℓ(x; ξ) = log p(x; ξ) and consider the partial derivatives { ∂ℓ

∂ξi = ∂iℓ ; i = 1, ...., n} which are
called scores. For the statistical manifold S, ∂iℓ’s are linearly independent functions in x for a fixed ξ.
Let T1
ξ (S) be the n-dimensional vector space spanned by n functions {∂iℓ ; i = 1, ...., n} in x. So

T1
ξ (S) = {
n
∑
i=1
Ai∂iℓ / Ai ∈ R}.
(3)

Then there is a natural isomorphism between these two vector spaces Tξ(S) and T1
ξ (S) given by

∂i ∈ Tξ(S) ←→ ∂iℓ(x; ξ) ∈ T1
ξ (S).
(4)

Obviously, a tangent vector A = ∑n
i=1 Ai∂i ∈ Tξ(S) corresponds to a random variable A(x) =
∑n
i=1 Ai∂iℓ(x; ξ) ∈ T1
ξ (S) having the same components Ai. Note that Tξ(S) is the differentiation

operator representation of the tangent space, while T1
ξ (S) is the random variable representation of the

same tangent space. The space T1
ξ (S) is called the 1-representation of the tangent space.
Let A and B be two tangent vectors in Tξ(S) and A(x) and B(x) be the 1−representations of A and B
respectively. We can define an inner product on each tangent space Tξ(S) by

gξ(A, B) =< A, B >ξ = Eξ[A(x)B(x)] =
�
A(x)B(x)p(x; ξ)dx.
(5)

Especially the inner product of the basis vectors ∂i and ∂j is

gij(ξ) = < ∂i, ∂j >ξ = Eξ[∂iℓ ∂jℓ] =
�
∂iℓ(x; ξ)∂jℓ(x; ξ)p(x; ξ)dx.
(6)

15


Entropy 2014, 16, 2472–2487

Note that g =<, > defines a Riemannian metric on S called the Fisher information metric.
On the Riemannian manifold (S, g =<, >), define n3 functions Γijk by

Γijk(ξ) = Eξ[(∂i∂jℓ(x; ξ))(∂kℓ(x; ξ))].
(7)

These functions Γijk uniquely determine an affine connection ∇ on S by

Γijk(ξ) =< ∇∂i∂j, ∂k >ξ .
(8)

∇ is called the 1−connection or the exponential connection.
Amari [2] defined a one parameter family of functions called the α−embeddings given by

Lα(p) =

�
2

1−α p
1−α

2
α ̸= 1
log p
α = 1
(9)

Using these, we can define n3 functions Γα
ijk by

Γα
ijk =
�
∂i∂jLα(p(x; ξ))∂kL−α(p(x; ξ))dx
(10)

These Γα
ijk uniquely determine affine connections ∇α on the statistical manifold S by

Γα
ijk = < ∇α
∂i∂j, ∂k >
(11)

which are called ff−connections.

2. F−Geometry of a Statistical Manifold

On a statistical manifold S, the Fisher information metric and exponential connection are defined
using the log embedding. In a similar way, α−connections are defined using a one parameter family of
functions, the α−embeddings. In general, we can give other geometric structures on S using different
embeddings of the manifold S into the space of random variables RX.
Let F : (0, ∞) −→ R be an injective function that is at least twice differentiable. Thus we have
F′(u) ̸= 0, ∀ u ∈ (0, ∞). F is an embedding of S into RX that takes each p(x; ξ) �−→ F(p(x; ξ)).
Denote F(p(x; ξ)) by F(x; ξ) and ∂iF can be written as

∂iF(x; ξ) = p(x; ξ)F′(p(x; ξ))∂iℓ(p(x; ξ)).
(12)

It is clear that ∂iF(x; ξ);
i = 1, ..., n are linearly independent functions in x for fixed ξ since
∂iℓ(p(x; ξ)); i = 1, .., n are linearly independent. Let TF(pξ)F(S) be the n-dimensional vector space
spanned by n functions ∂iF; i = 1, ...., n in x for fixed ξ. So

TF(pξ)F(S) = {
n
∑
i=1
Ai∂iF / Ai ∈ R}
(13)

Let the tangent space TF(pξ)(F(S)) to F(S) at the point F(pξ) be denoted by TF
ξ (S). There is a natural

isomorphism between the two vector spaces Tξ(S) and TF
ξ (S) given by

∂i ∈ Tξ(S) ←→ ∂iF(x; ξ) ∈ TF
ξ (S).
(14)

TF
ξ (S) is called the F−representation of the tangent space Tξ(S).

16


Entropy 2014, 16, 2472–2487

For any A = ∑n
i=1 Ai∂i ∈ Tξ(S), the corresponding A(x) = ∑n
i=1 Ai∂iF ∈ TF
ξ (S) is called the

F−representation of the tangent vector A and is denoted by AF(x). Note that TF
ξ (S) ⊆ TF(pξ)(RX).

Since RX is a vector space, its tangent space TF(pξ)(RX) can be identified with RX. So TF
ξ (S) ⊆ RX.

Definition 1. F−expectation of a random variable
f
with respect to the distribution p(x; ξ) is
defined as

EF
ξ ( f ) =
�
f (x)
1

p(F′(p))2 dx.
(15)

We can use this F−expectation to define an inner product in RX by

< f, g >F
ξ = EF
ξ [ f (x)g(x)],
(16)

which induces an inner product on Tξ(S) by

< A, B >F
ξ = EF
ξ [AF(x)BF(x)] ; A, B ∈ Tξ(S).
(17)

Proposition 1. The induced metric <, >F on S is the Fisher information metric g =<, > on S.

Proof. For any basis vectors ∂i, ∂j ∈ Tξ(S)

< ∂i, ∂j >F
ξ
=
EF
ξ [∂iF ∂jF]

=
�
∂iF ∂jF
1

p(F′(p))2 dx

=
�
(p F′(p) ∂iℓ) (p F′(p) ∂jℓ)
1

p(F′(p))2 dx
(18)

=
�
∂iℓ ∂jℓ p(x; ξ) dx

=
Eξ[∂iℓ ∂jℓ]

=
gij(ξ)

=
< ∂i, ∂j >ξ .

So the metric <, >F on S induced by the embedding F of S into RX is the Fisher information metric
g =<, > on S.

We can induce a connection on S using the embedding F.
Let πF
|pξ : RX −→ TF
ξ (S) be the projection map.

Definition 2. The connection induced by the embedding F on S, the F−connection, is defined as

∇F
∂i∂j
=
πF
|pξ(∂i∂jF)

= ∑
n ∑
m
gmn < ∂i∂jF, ∂mF >F
ξ ∂n.
(19)

where [gmn(ξ)] is the inverse of the Fisher information matrix G(ξ) = [gmn(ξ)]. Note that the F−connections
are symmetric.

Lemma 1. The F−connection and its components can be written in terms of scores as

∇F
∂i∂j = ∑
n ∑
m
gmnEξ

�
(∂i∂jℓ + (1 + pF′′(p)

F′(p) )∂iℓ ∂jℓ)(∂mℓ)
�
∂n
(20)

17


Entropy 2014, 16, 2472–2487

and

ΓF
ijk(ξ) = Eξ

�
(∂i∂jℓ + (1 + pF′′(p)

F′(p) )∂iℓ ∂jℓ)(∂kℓ)
�
(21)

Proof. From Equation (12), we have

∂i∂jF = pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ.
(22)

Therefore

< ∂i∂jF, ∂mF >F
ξ
=
�
∂i∂jF ∂mF
1

p(F′(p))2 dx

=
� �
pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ
�
∂mℓ
F′(p)dx
(23)

=
� �
∂i∂jℓ ∂mℓ + (1 + pF′′(p)

F′(p) )∂iℓ ∂jℓ ∂mℓ
�
pdx

=
Eξ

�
(∂i∂jℓ + (1 + pF′′(p)

F′(p) )∂iℓ ∂jℓ)(∂mℓ)
�
.

Hence we can write

∇F
∂i∂j
=
πF
|pξ(∂i∂jF)

= ∑
n ∑
m
gmnEξ

�
(∂i∂jℓ + (1 + pF′′(p)

F′(p) )∂iℓ ∂jℓ)(∂mℓ)
�
∂n.
(24)

Then we have the Christoffel symbols of the F−connection

Γn
ij = ∑
m
gmnEξ

�
(∂i∂jℓ + (1 + pF′′(p)

F′(p) )∂iℓ ∂jℓ)(∂mℓ)
�
(25)

and components of the F−connection are given by

ΓF
ijk(ξ) =< ∇F
∂i∂j, ∂k >ξ= Eξ

�
(∂i∂jℓ + (1 + pF′′(p)

F′(p) )∂iℓ ∂jℓ)(∂kℓ)
�
.
(26)

Theorem 1. Amari’s α−geometry is a special case of the F−geometry.

Proof. Let F(p) = Lα(p), Lα(p) is the α−embedding of Amari.
The components Γα
ijk of the α−connection are given by

Γα
ijk(ξ)
=
< ∇α
∂i∂j, ∂k >ξ

=
Eξ

�
(∂i∂jℓ + 1 − α

2
∂iℓ ∂jℓ)(∂kℓ)
�
.
(27)

From Equation (26), when F(p) = Lα(p)
we have

F′(p) = L′
α(p)
=
p−( 1+α

2 )
(28)

F′′(p) = L′′
α(p)
=
−1 + α

2
p−( 3+α

2 ).
(29)

18


Entropy 2014, 16, 2472–2487

Then we get

1 + pF′′(p)

F′(p) = 1 + pL′′
α(p)

L′α(p) = 1 − α

2
(30)

Hence

ΓF
ijk(ξ) =< ∇F
∂i∂j, ∂k >ξ
=
Eξ

�
(∂i∂jℓ + (1 + pF′′(p)

F′(p) )∂iℓ ∂jℓ)(∂kℓ)
�

=
Eξ

�
(∂i∂jℓ + 1 − α

2
∂iℓ ∂jℓ)(∂kℓ)
�
(31)

=
Γα
ijk(ξ)

which are the components of the α−connection. Hence F−connection reduces to α−connection.
Thus we obtain that α−geometry is a special case of F−geometry.

Remark 1. Burbea [14] introduced the concept of weighted Fisher information metric using a positive
continuous function. We use this idea to define weighted F−metric and weighted F−connections. Let
G : (0, ∞) −→ R be a positive smooth function and F be an embedding, define (F, G)−expectation of a
random variable with respect to the distribution pξ as

EF,G
ξ
( f ) =
�
f (x)
G(p)

p(F′(p))2 dx.
(32)

Define (F, G)−metric <, >F,G
ξ
in Tpξ(S) by

< ∂i, ∂j >F,G
ξ
=
EF,G
ξ
[∂iF ∂jF]

=
�
∂iF ∂jF
G(p)

p(F′(p))2 dx
(33)

=
�
∂iℓ ∂jℓ G(p) p dx

=
Eξ[G(p) ∂iℓ ∂jℓ].

Define (F, G)−connection as

ΓF,G
ijk
=
< ∇F,G
∂i ∂j, ∂k >ξ

=
Eξ

��
(∂i∂jℓ + (1 + pF′′(p)

F′(p) )∂iℓ ∂jℓ)(∂kℓ)
�
(G(p))
�
.
(34)

When G(p) = 1, (F, G)−connection reduces to the F−connection and the metric <, >F,G reduces to the Fisher
information metric. This is a more general way of defining Riemannian metrics and affine connections on a
statistical manifold.

3. Dual Affine Connections

Definition 3. Let M be a Riemannian manifold with a Riemannian metric g. Two affine connections, ∇ and
∇∗ on the tangent bundle are said to be dual connections with respect to the metric g if

Zg(X, Y) = g(∇ZX, Y) + g(X, ∇∗
ZY)
(35)

holds for any vector fields X, Y, Z on M.

19


Entropy 2014, 16, 2472–2487

Theorem 2. Let F, H be two embeddings of statistical manifold S into the space RX of random variables. Let G
be a positive smooth function on (0, ∞). Then the (F, G)−connection ∇F,G and the (H, G)−connection ∇H,G

are dual connections with respect to the (F, G)−metric iff the functions F and H satisfy

H′(p) = G(p)

pF′(p).
(36)

We call such an embedding H as a G−dual embedding of F.
The components of the dual connection ∇H,G can be written as

ΓH,G
ijk
=
� �
∂i∂jℓ + ( pG′(p)

G(p) − pF′′(p)

F′(p) )∂iℓ ∂jℓ
�
∂kℓ G(p)p dx.
(37)

Proof. ∇F,G and ∇H,G are dual connections with respect to the G−metric means,

∂k < ∂i, ∂j >F,G=< ∇F,G
∂k ∂i, ∂j >F,G + < ∂i, ∇H,G
∂k
∂j >F,G .
(38)

for any basis vectors ∂i, ∂j, ∂k ∈ Tξ(S).

∂k < ∂i, ∂j >F,G
=
�
∂k∂jℓ ∂iℓ pG(p)dx +
�
∂k∂iℓ ∂jℓ pG(p)dx

+
�
(1 + pG′(p)

G(p) )∂iℓ ∂jℓ ∂kℓ pG(p)dx.
(39)

< ∇F,G
∂k ∂i, ∂j >F,G + < ∂i, ∇H,G
∂k
∂j >F,G
=
�
∂k∂iℓ ∂jℓ pG(p)dx

+
�
1 + pF′′(p)

F′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx

+
�
1 + pH′′(p)

H′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx

+
�
∂k∂jℓ ∂iℓ pG(p)dx
(40)

Then the condition (38) holds iff

�
[2 + pF′′(p)

F′(p) + pH′′(p)

H′(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx =

�
[1 + pG′(p)

G(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx
(41)

⇐⇒ [2 + pF′′(p)

F′(p) + pH′′(p)

H′(p) ] = 1 + pG′(p)

G(p) .
(42)

⇐⇒ 1 + pH′′(p)

H′(p) = pG′(p)

G(p) − pF′′(p)

F′(p)
(43)

⇐⇒ H′′(p)

H′(p) = G′(p)

G(p) − F′′(p)

F′(p) − 1

p ⇐⇒ H′(p) = G(p)

pF′(p).
(44)

Hence ∇F,G and ∇H,G are dual connections with respect to the (F, G)−metric iff Equation (36) holds.
From Equation (43), we can rewrite the components of dual connection ∇H,G as

ΓH,G
ijk
=
� �
∂i∂jℓ + ( pG′(p)

G(p) − pF′′(p)

F′(p) )∂iℓ ∂jℓ
�
∂kℓ G(p)p dx.
(45)

20


Entropy 2014, 16, 2472–2487

Corollary 1. Amari’s 0−connection is the only self dual F−connection with respect to the Fisher information
metric.

Proof. From Theorem 2, for G(p) = 1 the F−connection ∇F and the H−connection ∇H are dual
connections with respect to the Fisher information metric iff the functions F and H satisfy

H′(p) =
1

pF′(p)
(46)

Thus the F−connection ∇F is self dual iff the embedding F satisfies the condition

F′(p) =
1

pF′(p) ⇐⇒ F′(p) = p−( 1

2 ) ⇐⇒ F(p) = 2p
1
2 = L0(p).
(47)

That is, Amari’s 0−connection is the only self dual F−connection with respect to the Fisher
information metric.

So far, we have considered the statistical manifold S as a subset of P(X), the set of all probability
measures on X. Now we relax the condition �
p(x)dx = 1, and consider S as a subset of ˜P(X), which
is defined by
˜P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X);
�

X
p(x)dx < ∞}.
(48)

Definition 4. Let M be a Riemannian manifold with a Riemannian metric g. Let ∇ be an affine connection on
M. If there exists a coordinate system [θi] of M such that ∇∂i∂j = 0 then we say that ∇ is flat, or alternatively
M is flat with respect to ∇, and we call such a coordinate system [θi] an affine coordinate system for ∇.

Definition 5. Let S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn} be an n−dimensional statistical manifold. If
for some coordinate system [θi]; i = 1, ..., n

∂i∂jF(p(x; θ)) = 0
(49)

then we can see from Equation (19) that [θi] is an F−affine coordinate system and that S = {pθ} is F−flat. We
call such S as an F−affine manifold.
The condition (49) is equivalent to the existence of the functions C, F1, .., Fn on X such that

F(p(x; θ)) = C(x) +
n
∑
i=1
θiFi(x)
(50)

Theorem 3. For any embedding F, ˜P(X) is an F−affine manifold for finite X.

Proof. Let X = {x1, ...., xn} be a finite set constituted by n elements. Let Fi : X −→ R be the functions
defined by Fi(xj) = δij for i, j = 1, .., n. Let us define n coordinates [θi] by

θi = F(p(xi))
(51)

Then we get F(p(x))
=
∑n
i=1 θiFi(x).
Therefore
˜P(X) is an F−affine manifold for any
embedding F(p).

Remark 2. Zhang [13] introduced ρ-representation, which is a generalization of α-representation of Amari.
Zhang’s geometry is defined using this ρ-representation together with a convex function. Zhang also defined the
ρ-affine family of density functions and discussed its dually flat structure. The F−geometry defined using a

21


Entropy 2014, 16, 2472–2487

general F-representation is different from the Zhang’s geometry. The metric defined in the F-embedding approach
is the Fisher information metric and the Riemannian metric defined using the ρ-representation is different from
the Fisher information metric. The F-connections defined are not in general dually flat and are different from the
dual connections defined by Zhang.

Remark 3. On a statistical manifold S, we introduced a dualistic structure (g, ∇F, ∇H), where g is the Fisher
information metric and ∇F, ∇H are the dual connections with respect to the Fisher information metric. Since
F-connections are symmetric, the manifold S is flat with respect to ∇F iff S is flat with respect to ∇H. Thus if
S is flat with respect to ∇F, then (S, g, ∇F, ∇H) is a dually flat space. The dually flat spaces are important in
statistical estimation [4].

4. Invariance of the Geometric Structures

For the statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn}, the parameters are merely labels attached
to each point p ∈ S, hence the intrinsic geometric properties should be independent of these labels.
Consequently, it is natural to consider the invariance properties of the geometric structures under
suitable transformations of the variables in a statistical manifold. Here we can consider two kinds of
invariance of the geometric structures; covariance under re-parametrization of the parameter of the
manifold and invariance under the transformations of the random variable [15]. Now let us investigate
the invariance properties of the F-geometric structures defined in Section 2.

4.1. Covariance under Re-Parametrization

Let [θi] and [ηj] be two coordinate systems on S, which are related by an invertible transformation
η = η(θ). Let us denote ∂i =
∂
∂θi and ∂j =
∂
∂ηj . Let the coordinate expressions of the metric g be given

by gij =< ∂i, ∂j > and ˜gij =< ∂i, ∂j >. Let the components of the connection ∇ with respect to the
coordinates [θi] and [ηj] be given by Γijk, ˜Γijk respectively.
Then the covariance of the metric g and the connection ∇ under the re-parametrization means,

˜gij
= ∑
m ∑
n

∂θm

∂ηi

∂θn

∂ηj
gmn
(52)

˜Γijk
=
∑
m,n,h

∂θm

∂ηi

∂θn

∂ηj

∂θh

∂ηk
Γmnh + ∑
m,h

∂θh

∂ηk

∂2θm

∂ηi∂ηj
gmh
(53)

Lemma 2. The Fisher information metric g is covariant under re-parametrization.

Proof. The components of the Fisher information metric with respect to the coordinate system [θi] are
given by

gij(θ) = < ∂i, ∂j >θ =
�
∂ip(x; θ)∂jp(x; θ)
1

p(x; θ)dx.
(54)

Let ˜p(x; η) = p(x; θ(η)). Then the components of the Fisher information metric with respect to the
coordinate system [ηj] are given by

˜gij(η) = < ∂i, ∂j >η =
�
∂i ˜p(x; η)∂j ˜p(x; η)
1

˜p(x; η)dx.
(55)

Since

∂i ˜p(x; η) = ∑
m

∂θm

∂ηi

∂p(x; θ(η))

∂θm
(56)

22


Entropy 2014, 16, 2472–2487

we can write

˜gij(η)
=
�
∂i ˜p(x; η)∂j ˜p(x; η)
1

˜p(x; η)dx

=
�
∑
m

∂θm

∂ηi

∂p(x; θ)

∂θm
∑
n

∂θn

∂ηj

∂p(x; θ)

∂θn
1

p(x; θ)dx
(57)

= ∑
m ∑
n

∂θm

∂ηi

∂θn

∂ηj

�
∂mp(x; θ)∂np(x; θ)
1

p(x; θ)dx.

=

�
∑
m ∑
n

∂θm

∂ηi

∂θn

∂ηj
gmn(θ)

�

θ=θ(η)

Lemma 3. The F−connection ∇F is covariant under re-parametrization.

Proof. Let the components of ∇F with respect to the coordinates [θi] and [ηj] be given by Γijk,
˜Γijk respectively.
Let ˜p(x; η) = p(x; θ(η)). Let us denote log p(x; θ) by ℓ(x; θ) and log ˜p(x; η) by ˜ℓ(x; η).
The components of the F−connection ∇F with respect to the coordinate system [θi] are given by

Γijk =
� �
∂i∂jℓ(x; θ) + (1 + pF′′(p)

F′(p) )∂iℓ(x; θ) ∂jℓ(x; θ)
�
∂kℓ(x; θ)p(x; θ)dx
(58)

The components of ∇F with respect to the coordinate system [ηj] are given by

˜Γijk =
� �
∂i∂j˜ℓ(x; η) + (1 + ˜pF′′( ˜p)

F′( ˜p) )∂i˜ℓ(x; η) ∂j˜ℓ(x; η)
�
∂k˜ℓ(x; η) ˜p(x; η)dx
(59)

We can write

∂i˜ℓ(x; η) = ∑
m

∂θm

∂ηi

∂ℓ(x; θ(η))

∂θm
(60)

Then

∂i∂j˜ℓ(x; η) = ∑
m,n

∂θm

∂ηi

∂θn

∂ηj

∂2ℓ(x; θ(η))

∂θm∂θn
+ ∑
m

∂2θm

∂ηi∂ηj

∂ℓ(x; θ(η))

∂θm
(61)

∂i˜ℓ(x; η) ∂j˜ℓ(x; η) = ∑
m,n

∂θm

∂ηi

∂θn

∂ηj

∂ℓ(x; θ(η))

∂θm
∂ℓ(x; θ(η))

∂θn
(62)

∂k˜ℓ(x; η) = ∑
h

∂θh

∂ηk

∂ℓ(x; θ(η))

∂θh
(63)

Hence we get

˜Γijk
=
�
∑
m,n,h

∂θm

∂ηi

∂θn

∂ηj

∂θh

∂ηk

∂2ℓ(x; θ(η))

∂θm∂θn
∂ℓ(x; θ(η))

∂θh
p(x; θ(η))dx +

�
∑
m,h

∂2θm

∂ηi∂ηj

∂θh

∂ηk

∂ℓ(x; θ(η))

∂θm
∂ℓ(x; θ(η))

∂θh
p(x; θ(η))dx +
(64)

�
(1 + pF′′(p)

F′(p) ) ∑
m,n,h

∂θm

∂ηi

∂θn

∂ηj

∂θh

∂ηk

∂ℓ(x; θ(η))

∂θm
∂ℓ(x; θ(η))

∂θn
∂ℓ(x; θ(η))

∂θh
p(x; θ(η))dx

23


Entropy 2014, 16, 2472–2487

=
∑
m,n,h

∂θm

∂ηi

∂θn

∂ηj

∂θh

∂ηk

� ∂2ℓ(x; θ(η))

∂θm∂θn
∂ℓ(x; θ(η))

∂θh
p(x; θ(η))dx +

∑
m,h

∂2θm

∂ηi∂ηj

∂θh

∂ηk

� ∂ℓ(x; θ(η))

∂θm
∂ℓ(x; θ(η))

∂θh
p(x; θ(η))dx +

∑
m,n,h

∂θm

∂ηi

∂θn

∂ηj

∂θh

∂ηk

�
(1 + pF′′(p)

F′(p) )∂ℓ(x; θ(η))

∂θm
∂ℓ(x; θ(η))

∂θn
∂ℓ(x; θ(η))

∂θh
p(x; θ(η))dx

=
∑
m,n,h

∂θm

∂ηi

∂θn

∂ηj

∂θh

∂ηk
Γmnh + ∑
m,h

∂θh

∂ηk

∂2θm

∂ηi∂ηj
gmh

Hence we showed that F−connections are covariant under re-parametrization of the parameter.
The covariance under re-parametrization actually means that the metric and connections are coordinate
independent. Hence we obtained that the F−geometry is coordinate independent.

4.2. Invariance Under the Transformation of the Random Variable

Amari and Nagaoka [4] defined the invariance of Riemannian metric and connections on a
statistical manifold under a transformation of the random variable as follows,

Definition 6. Let S = {p(x; ξ) | ξ ∈ E ⊆ Rn} be a statistical manifold defined on a sample space
X.
Let x, y be random variables defined on sample spaces X, Y respectively and φ be a transformation
of x to y.
Assume that this transformation induces a model S′ = {q(y; ξ) | ξ ∈ E ⊆ Rn} on Y.
Let λ : S −→ S′ be a diffeomorphism defined as

λ(pξ) = qξ
(65)

Let g =<>, g′ =<>′ be two Riemannian metrics defined on S and S′ respectively. Let ∇, ∇
′ be two affine
connections on S and S′ respectively. Then the invariance properties are given by

< X, Y >p
=
< λ∗(X), λ∗(Y) >′
λ(p) ∀ X, Y ∈ Tp(S)
(66)

λ∗(∇XY)
=
∇
′
λ∗(X)λ∗(Y)
(67)

where λ∗ is the push forward map associated with the map λ, which is defined by

λ∗(X)λ(p) = (dλ)p(X)
(68)

Now we discuss the invariance properties of the F−geometry under suitable transformations
of the random variable. Let us restrict ourselves to the case of smooth one-to-one transformations
of the random variable that are in fact statistically interesting. Amari and Nagaoka [4] mentioned a
transformation, the sufficient statistic of the parameter of the statistical model, which is widely used in
statistical estimation. In fact the one-to-one transformations of the random variable are trivial examples
of sufficient statistic.
Consider a statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn} defined on a sample space X. Let φ be
a smooth one-to-one transformation of the random variable x to y. Then the density function q(y; ξ) of
the induced model S′ takes the form

q(y : ξ) = p(w(y); ξ)w′(y)
(69)

where w is a function such that x = w(y) and φ′(x) =
1

w′(φ(x)).
Let us denote log q(y; ξ) by ℓ(qy) and log p(x; ξ) by ℓ(px).

24


Entropy 2014, 16, 2472–2487

Lemma 4. The Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one
transformations of the random variable.

Proof. Let φ be a smooth one-to-one transformation of the random variable x to y.
From Equation (69)

p(x; ξ)
=
q(φ(x); ξ)φ′(x)
(70)

∂iℓ(qy)
=
∂iℓ(pw(y))
(71)

∂iℓ(qφ(x))
=
∂iℓ(px)
(72)

The Fisher information metric g′ on the induced manifold S′ is given by

g′
ij(qξ)
=
�

Y ∂iℓ(qy) ∂jℓ(qy) q(y; ξ)dy

=
�

X ∂iℓ(qφ(x)) ∂jℓ(qφ(x)) q(φ(x); ξ) φ′(x)dx
(73)

=
�

X ∂iℓ(px) ∂jℓ(px) p(x; ξ)dx

=
gij(pξ)

which is the Fisher information metric on S.
The components of Amari’s α−connections on the induced manifold S′ are given by

´Γα
ijk(qξ)
=
�

Y ∂i∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy +
�

Y
1 − α

2
∂iℓ(qy) ∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy

=
�

X ∂i∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx +
�

X
1 − α

2
∂iℓ(qφ(x)) ∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx
(74)

=
�

X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
�

X
1 − α

2
∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx

=
Γα
ijk(pξ)

which are the components of Amari’s α−connections on the manifold S. Thus we obtained that
the Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one
transformations of the random variable.

Now we prove that α-connections are the only F−connections that are invariant under smooth
one-to-one transformations of the random variable.

Theorem 4. Amari’s α-connections are the only F−connections that are invariant under smooth one-to-one
transformations of the random variable.

25


Entropy 2014, 16, 2472–2487

Proof. Let φ be a smooth one-to-one transformation of the random variable x to y.
The components of the F−connection of the induced manifold S′ are

´ΓF
ijk(qξ)
=
�

Y

�
∂i∂jℓ(qy) + (1 + qF′′(q)

F′(q) )∂iℓ(qy) ∂jℓ(qy)
�
∂kℓ(qy) q(y; ξ)dy

=
�

X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +
(75)

�

X(1 + q(φ(x); ξ)F′′(q(φ(x); ξ))

F′(q(φ(x); ξ))
)∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx.

and the components of the F−connection of the manifold S are

ΓF
ijk(pξ)
=
�

X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx +

�

X(1 + p(φ(x); ξ)F′′(p(x; ξ))

F′(p(x; ξ))
)∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx.
(76)

Then by equating the components ´ΓF
ijk(qξ), ΓF
ijk(pξ) of the F−connection, we get

� q(φ(x); ξ)F′′(q(φ(x); ξ))

F′(q(φ(x); ξ))
∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx =

� p(x; ξ)F′′(p(x; ξ))

F′(p(x; ξ))
∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx
(77)

Then it follows that the condition for F−connection to be invariant under the transformation φ is
given by
pF′′(p)
F′(p) = k,
(78)

where k is a real constant.
Hence it follows from the Euler’s homogeneous function theorem that the function F′ is a positive
homogeneous function in p of degree k. So

F′(λp) = λkF′(p) for λ > 0.
(79)

Since F′ is a positive homogeneous function in the single variable p, without loss of generality
we can take,
F′(p) = pk.
(80)

Therefore

F(p) =

�
pk+1

k+1
k ̸= −1
log p
k = −1
(81)

Let

k = −(1 + α)

2
, α ∈ R.
(82)

we get

F(p) =

�
2

1−α p
1−α

2
α ̸= 1
log p
α = 1
(83)

which is nothing but Amari’s α−embeddings Lα(p). Hence we obtain that Amari’s α−connections
are the only F−connections that are invariant under smooth one-to-one transformations of the
random variable.

26


Entropy 2014, 16, 2472–2487

Remark 4. In Section 2, we defined (F, G)-connections using a general embedding function F and a positive
smooth function G. We can show that (F, G)-connection is invariant under smooth one-to-one transformation
of the random variable when G(p) = c, where c is a real constant and F(p) = Lα(p) (proof is similar to that of
Theorem 4). The notion of (F, G)−metric and (F, G)−connection provides a more general way of introducing
geometric structures on a manifold. We were able to show that the Fisher information metric (up to a constant)
and Amari’s α−connections are the only metric and connections belonging to this class that are invariant under
both the transformation of the parameter and the one-to-one transformation of the random variable.

5. Conclusions

The Fisher information metric and Amari’s α−connections are widely used in the theory
of information geometry and have an important role in the theory of statistical estimation.
Amari’s α−connections are defined using a one parameter family of functions, the α−embeddings.
We generalized this idea to introduce geometric structures on a statistical manifold S. We considered
a general embedding function F of S into RX and obtained a geometric structure on S called the
F−geometry. Amari’s α−geometry is a special case of F−geometry. A more general way of defining
Riemannian metrics and affine connections on a statistical manifold S is given using a positive
continuous function G and the embedding F.
Amari’s α−geometry is the only F−geometry that is invariant under both the transformation of
the parameter and the random variable or equivalently under the sufficient statistic. We can relax the
condition of invariance under the sufficient statistic and can consider other statistically significant
transformations as well, which then gives an F−geometry other than α−geometry that is invariant
under these statistically significant transformations. We believe that the idea of F−geometry can be
used in the further development of the geometric theory of q-exponential families. We look forward to
studying these problems in detail later.

Acknowledgments: We are extremely thankful to Shun-ichi Amari for reading this article and encouraging our
learning process. We would like to thank the reviewer who mentioned the references [13,16] that are of great
importance in our future work.

Author Contributions: The authors contributed equally to the presented mathematical framework and the writing
of the paper.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1.
Amari, S. Differential geometry of curved exponential families-curvature and information loss.
Ann. Statist. 1982, 10, 357–385.
2.
Amari, S. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics, Volume 28; Springer-Verlag:
New York, NY, USA, 1985.
3.
Amari, S.; Kumon, M. Differential geometry of Edgeworth expansions in curved exponential family. Ann. Inst.
Statist. Math. 1983, 35, 1–24.
4.
Amari, S.; Nagaoka, H. Methods of Information Geometry, Translations of Mathematical Monographs;
Oxford University Press: Oxford, UK, 2000.
5.
Barndorff-Nielsen, O.E.; Cox, D.R.; Reid, N. The role of differential geometry in statistical theory.
Internat. Statist. Rev. 1986, 54, 83–96.
6.
Dawid, A.P. A Discussion to Efron’s paper. Ann. Statist. 1975, 3, 1231–1234.
7.
Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency).
Ann. Statist. 1975, 3, 1189–1242.
8.
Efron, B. The geometry of exponential families. Ann. Statist. 1978, 6, 362–376.
9.
Murray,
M.K.;
Rice,
R.W.
Differential
Geometry
and
Statistics;
Chapman
&
Hall:
London,
UK, 1995.
10.
Rao,
C.R.
Information
and
accuracy
attainable
in
the
estimation
of
statistical
parameters.
Bull. Calcutta. Math. Soc. 1945, 37, 81–91.

27


Entropy 2014, 16, 2472–2487

11.
Chentsov, N.N. Statistical Decision Rules and Optimal Inference; Transted in English, Translation of the
Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 1982.
12.
Corcuera, J.M.; Giummole, F. A characterization of monotone and regular divergences.
Ann.
Inst.
Statist. Math. 1998, 50, 433–450.
13.
Zhang, J. Divergence function, duality and convex analysis. Neur. Comput. 2004, 16, 159–195.
14.
Burbea, J. Informative geometry of probability spaces. Expo Math. 1986, 4, 347–378.
15.
Wagenaar,
D.A.
Information
Geometry
for
Neural
Networks.
Available
online:
http://www.danielwagenaar.net/res/papers/98-Wage2.pdf (accessed on 13 December 2013).
16.
Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually flat and
conformal geometries. Physica A 2012, 391, 4308–4319.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

28


entropy

Article
Computational Information Geometry in Statistics:
Theory and Practice

Frank Critchley 1 and Paul Marriott 2,*

1 Department of Mathematics and Statistics, The Open University, Walton Hall, Milton Keynes,
Buckinghamshire MK7 6AA, UK; E-Mail: f.critchley@open.ac.uk
2 Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo,
ON N2L 3G1, Canada
*
E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567.

Received: 27 March 2014; in revised form: 25 April 2014 / Accepted: 29 April 2014 /
Published: 2 May 2014

Abstract: A broad view of the nature and potential of computational information geometry in
statistics is offered.
This new area suitably extends the manifold-based approach of classical
information geometry to a simplicial setting, in order to obtain an operational universal model space.
Additional underlying theory and illustrative real examples are presented. In the infinite-dimensional
case, challenges inherent in this ambitious overall agenda are highlighted and promising new
methodologies indicated.

Keywords: information geometry; computational geometry; statistical foundations

1. Introduction

The application of geometry to statistical theory and practice has seen a number of different
approaches developed. One of the most important can be defined as starting with Efron’s seminal
paper [? ] on statistical curvature and subsequent landmark references, including the book by Kass
and Vos [? ]. This approach, a major part of which has been called information geometry, continues
today, a primary focus being invariant higher-order asymptotic expansions obtained through the use
of differential geometry. A somewhat representative example of the type of result it generates is taken
from [? ], where the notation is defined:

Example 1. The bias correction of a first-order efficient estimator, ˆβ, is defined by:

ba(β) = − 1

2n gaa′ �
gbcΓ(−1)
a′bc + gκλh(−1)
κλa′
�
,

and has the property that if ˆβ∗ := ˆβ − b(β) then:

Eβ( ˆβ∗ − β) = O(n−3/2).

The strengths usually claimed of such a result are that, for a worker fluent in the language of
information geometry, it is explicit, insightful as to the underlying structure and of clear utility in
statistical practice. We agree entirely. However, the overwhelming evidence of the literature is that,
while the benefits of such inferential improvements are widely acknowledged in principle, in practice,
the overhead of first becoming fluent in information geometry prevents their routine use. As a result, a
great number of powerful results of practical importance lay severely underused, locked away behind
notational and conceptual bars.
This paper proposes that this problem can be addressed computationally by the development
of what we call computational information geometry. This gives a mathematical and numerical

Entropy 2014, 16, 2454–2471; doi:10.3390/e16052454
www.mdpi.com/journal/entropy
29


Entropy 2014, 16, 2454–2471

computational framework in which the results of information geometry can be encoded as “black-box"
numerical algorithms, allowing direct access to their power. Essentially, this works by exploiting the
structural properties of information geometry, which are such that all formulae can be expressed in
terms of four fundamental building blocks: defined and detailed in Amari [? ], these are the +1 and −1
geometries, the way that these are connected via the Fisher information and the foundational duality
theorem. Additionally, computational information geometry enables a range of methodologies and
insights impossible without it; notably, those deriving from the operational, universal model space,
which it affords; see, for example, [? ? ? ].
The paper is structured as follows. Section 2 looks at the case of distributions on a finite number
of categories where the extended multinomial family provides an exhaustive model underlying the
corresponding information geometry. Since the aim is to produce a computational theory, a finite
representation is the ultimate aim, making the results of this section of central importance. The
paper also emphasises how the simplicial structures introduced here are foundational to a theory of
computational information geometry. Being intrinsically constructive, a simplicial approach is useful
both theoretically and computationally. Section 3 looks at how simplicial structures, defined for finite
dimensions, can be extended to the infinite dimensional case.

2. Finite Discrete Case

2.1. Introduction

This section shows how the results of classical information geometry can be applied in a purely
computational way. We emphasise that the framework developed here can be implemented in
a purely algorithmic way, allowing direct access to a powerful information geometric theory of
practical importance.
The key tool, as explained in [? ], is the simplex:

Δk :=

�

ß = (ß0, ß1, . . . , ßk)⊤ : ßi ≥ 0 ,
k
∑
i=0
ßi = 1

�

,
(1)

with a label associated with each vertex. Here, k is chosen to be sufficiently large, so that any statistical
model—by which we mean a sample space, a set of probability distributions and selected inference
problem—can be embedded. The embedding is done in such a way that all the building blocks
of information geometry (i.e., manifold, affine connections and metric tensor) can be numerically
computed explicitly. Within such a simplex, we can embed a large class of regular exponential families;
see [? ] for details. This class includes exponential family random graph models, logistic regression,
log-linear and other models for categorical data analysis. Furthermore, the multinomial family on k + 1
categories is naturally identified with the relative interior of this space, int(Δk), while the extended
family, Equation (??), is a union of distributions with different support sets.
This paper builds on the theory of information geometry following that introduced by [? ] via
the affine space construction introduced by [? ] and extended by [? ]. Since this paper concentrates
on categorical random variables, the following definitions are appropriate. Consider a finite set of
disjoint categories or bins B = {Bi}i∈A. Any distribution over this finite set of categories is defined
by a set, {πi}i∈A, which defines the corresponding probabilities. With “mix” connoting mixtures of
distributions, we have:

Definition 1. The −1-affine space structure over distributions on B := {Bi}i∈A is (Xmix, Vmix, +) where:

Xmix =

�

{xi}i∈A| ∑
i∈A
xi = 1

�

, Vmix =

�

{vi}i∈A| ∑
i∈A
vi = 0

�

and the addition operator, +, is the usual addition of sequences.

30


Entropy 2014, 16, 2454–2471

In Definition ??, the space of (discretised) distributions is a −1-convex subspace of the affine
space, (Xmix, Vmix, +). A similar affine structure for the +1-geometry, once the support has been fixed,
can be derived from the definitions in [? ].

2.2. Examples

Examples ?? and ?? are used for illustration. The second of these is a moderately high dimensional
family, where the way that the boundaries of the simplex are attached to the model is of great
importance for the behaviour of the likelihood and of the maximum likelihood estimate. In general,
working in a simplex, boundary effects mean that standard first order asymptotic results can fail,
while the much more flexible higher order methods can be very effective. The other example is a
continuous curved exponential family, where both higher order asymptotic sampling theory results
and geometrically-based dimension reduction are described.

Example 2. The paper [? ] models survival times for leukaemia patients. These times, recorded in days, start
at the time of diagnosis, and there are 43 observations; see [? ] for details. We further assume that the data is
censored at a fixed value. It was observed that a censored exponential distribution gives a reasonable, but not
exact, fit. As discussed in [? ], this gives a one-dimensional curved exponential family inside a two-dimensional
regular exponential family of the form:

exp
�
λ1x + λ2y − log
� 1

λ2

�
eλ2t − 1
�
+ eλ1+λ2t
��
,
(2)

where y = min(z, t) and x = I(z ≥ t), and the embedding map is given by (λ1(θ), λ2(θ)) = (− log θ, −θ).
As shown in [? ], the loss due to discretisation can be made arbitrarily small for all information geometry
objects. Thus, for example, using this computational approach, it is straightforward to compute the bias
correction described in Example ??. Each of the terms in the asymptotic bias, i.e., the metric, gij, its inverse, gij,

the Christoffel symbols, Γ(−1)
ijk
, and curvature term, h(−1), can be directly numerically coded as appropriate finite
difference approximations to derivatives. Thus, “black-box” code can directly calculate the numerical value of
the asymptotic bias, and this numerical value can then be used by those who are not familiar with information
geometry. For example this calculation establishes the fact that, with this particular data set, the sample size is
such that the bias is inferentially unimportant.

1

2

3

4

Figure 1. Undirected graphical model showing the cyclic graph of order four.

Example 3. The paper [? ] discusses an undirected graphical model based on the cyclic graph of order four,
shown in Figure ??, with binary random variables at each node. Without any constraints, there are 16 possible
values for the graph, so model space can be thought of as a 15-dimensional simplex, including the relative

31


Entropy 2014, 16, 2454–2471

boundary. However, the conditional independence relations encoded by the graph impose linear constraints in the
natural parameters of the exponential family. Thus, the resultant model is a lower dimensional full exponential
family and its closure.
As described in [? ], the four cycle model is a seven dimensional exponential family, which is a +1-affine
subspace of the +1-affine structure of the 15-dimensional simplex. The model can be written in the form:

⎛

⎝
πi exp
�
∑8
h=1 ηhvhi
�

∑15
j=0 πj exp
�
∑8
h=1 ηhvhj
�

⎞

⎠

15

i=0

(3)

for a given set of linearly independent vectors {vh}8
h=1. The existence of the maximum likelihood estimate for
η = (ηh) will depend on how the limit points of Model (??) meet the observed face of Δ15; that is, the span of the
vertices (bins) having positive counts. Thus, a key computational task is to learn how a full exponential family,
defined by a representation of the form of (??), is attached to boundary sub-simplices of the high-dimensional
embedding simplex.
In order to visualise the geometric aspects of this problem, consider a lower dimensional version. Define
a two-dimensional full exponential family by the vectors v1 = (1, 2, 3, 4), v2 = (1, 4, 9, −1) and the uniform
distribution base point, πi, embedded in the three-dimensional simplex. The two-dimensional family is defined
by the +1-affine space through (0.25, 0.25, 0.25, 0.25) spanned by the space of vectors of the form:

α(1, 2, 3, 4) + β(1, 4, 9, −1) = (α + β, 2α + 4β, 3α + 9β, 4α − β).

Consider directions from the origin obtained by writing α = θβ, giving, for each θ, a one-dimensional, full
exponential family parameterized by β in the direction β(θ + 1, 2θ + 4, 3θ + 9, 4θ − 1). The aspect of this vector,
which determines the connection to the boundary, is the rank order of its elements. For example, suppose the first
component was the maximum and the last the minimum. Then, as β → ±∞, this one-dimensional family will
be connected to the first and fourth vertex of the embedding four simplex, respectively. Note that changing the
value of θ changes the rank structure, as illustrated in Figure ??. This plot shows the four element-wise linear
functions of θ (dashed lines) and the salient overall feature of their rank order; that is, their upper and lower
envelopes (solid lines). From this analysis of the envelopes of a set of linear functions, it can be seen that the
function 2θ + 4 is redundant. The consequence of this is shown in Figure ??, which shows a direct computation
of the two-dimensional family. It is clear that, indeed, only three of the four vertexes have been connected by
the model.
In general, the problem of finding the limit points in full exponential families inside simplex models is a
problem of finding redundant linear constraints. As shown in [? ], this can be converted, via convex duality, into
the problem of finding extremal points in a finite dimensional affine space. In the four-cycle model, this technique
can construct all sub-simplices containing limit points of the four-cycle model. For example, it can be shown
that all of the 16 vertices are part of the boundary. Once the boundary points have been identified as necessary
and sufficient, conditions for the existence of the maximum likelihood in the +1-parameters can easily be found
computationally [? ].

32


Entropy 2014, 16, 2454–2471

���
���
��
�
�
��
��

���
���
���
�
��
��
��

Envelope of linear functions

�

���������������

Figure 2. The envelope of a set of linear functions. Functions, dashed lines; envelope, solid lines.

Figure 3. Attaching a two-dimensional example to the boundary of the simplex.

2.3. Tensor Analysis and Numerical Stability

One of the most powerful set of results from classical information geometry is the way that
geometrically-based tensor analysis is perfect for use in multi-dimensional higher order asymptotic
analysis; see [? ] or [? ]. The tensorial formulation does, however, present a couple of problems in
practice. For many, its very tight and efficient notational aspects can obscure rather than enlighten,
while the resulting formulae tend to have a very large number of terms, making them rather
cumbersome to work with explicitly. These are not problems at all for the computational approach
described in this paper. Rather, the clarity of the tensorial approach is ideal for coding, where large
numbers of additive terms, of course, are easy to deal with.
Two more fundamental issues, which the global geometric approach of this paper highlights,
concern numerical stability. The ability to invert the Fisher information matrix is vital in most tensorial

33


Entropy 2014, 16, 2454–2471

formulae, and so understanding its spectrum, discussed in Section ??, is vital. Secondly, numerical
underflow and overflow near boundaries require careful analysis, and so, understanding the way
that models are attached to the boundaries of the extended multinomial models is equally important.
The four-cycle model, to which we now return, illustrates computational information geometry doing
this effectively.

Example 4. The multivariate Edgeworth approximation to the sampling distribution of part of the sufficient
statistic for the four-cycle model is shown in Figure ??. Using the techniques described above, a point near the
boundary of the 15-simplex has been selected as the data generation process. For illustration, we focus on the
marginal distribution of two components of the sufficient statistic, though any number could have been chosen.
The boundary forces constraints on the range of the sufficient statistics, shown by the dashed line in the plot.
The points, jittered for clarity, show the distribution computed by simulation. It is typical that such boundary
constraints prevent standard first order methods from performing well, but the greater flexibility of higher
order methods can be seen to work well here. As discussed above, methods, such as the multivariate Edgeworth
expansion, can be strongly exploited in a computational framework, such as ours. Note, the discretization that
can be observed in the figure is extensively discussed in [? ].

�
��
��
��

��
�
�
��
��
��

��������������������

��������������������

Figure 4. Using the Edgeworth expansion near the boundary of four-cycle model.

2.4. Spectrum of Fisher Information

We focus now on the second numerical issue identified above. In any multinomial, the Fisher
information matrix and its inverse are explicit. Indeed, the 0-geodesics and the corresponding geodesic
distance are also explicit; see [? ] or [? ]. However, since the simplex glues together multinomial
structures with different supports and the computational theory is in high dimensions, it is a fact
that the Fisher information matrix can be arbitrarily close to being singular. It is therefore of central
interest that the spectral decomposition of the Fisher information itself has a very nice structure, as
shown below.

Example 5. Consider a multinomial distribution based on 81 equal width categories on [−5, 5], where the
probability associated to a bin is proportional to that of the standard normal distribution for that bin. The Fisher
information for this model is an 80 × 80 matrix, whose spectrum is shown in Figure ??. By inspection, it can
be seen that there are exponentially small eigenvalues, so that while the matrix is positive definite, it is also
arbitrarily close to being singular. Furthermore, it can be seen that the spectrum has the shape of a half-normal
density function and that the eigenvalues seem to come in pairs. These facts are direct consequences of the general
results below.

34


Entropy 2014, 16, 2454–2471

With π−0 denoting the vector of all bin probabilities, except π0, we can write the Fisher
information matrix (in the +1 form) as N times:

I(π) := diag(π−0) − π−0πT
−0.

This has an explicit spectral decomposition, which can be computed by using interlacing
eigenvalue results (see for example [?
], Chapter 4).
In particular, if the diagonal matrices,
diag(π1, . . . , πk) and diag(λ1Im1| · · · |λgImg), agree up to a row-and-column permutation, where g > 1
and λ1 > · · · > λg > 0, then I(π) has ordered spectrum:

λ1 > ˜λ1 > · · · > λg > ˜λg ≥ 0,
(4)

with ˜λg > 0 ⇐⇒ π0 > 0, each λi having multiplicity mi − 1, while each ˜λg is simple.

0
20
40
60
80

0.00
0.01
0.02
0.03
0.04
0.05
0.06

Eigenvalues

rank

Eigenvalues

Figure 5. Spectrum of the Fisher information matrix of a discretised normal distribution.

We give a complete account of the spectral decomposition (SpD) of I(π). There are four cases to
consider, the last having the generic spectrum of (??). Without loss, after permutation, assume now
π1 ≥ · · · ≥ πk. The four cases are:

Case 1 For some l < k, the last k − l elements of π−0 vanish: the sub-case l = 0 ⇐⇒ π0 = 1 ⇐⇒
I(π) = 0 is trivial. Otherwise, writing π+ = (π1, . . . , πl)T and Π+ = diag(π+), the SpD of:

I(π) =

�
Π+ − π+πT+
0

0
0

�

follows at once from that of Π+ − π+πT+, given below.
Case 2 k = 1: this case is trivial.
Case 3 k > 1, π = λ1k, λ > 0: the SpD of I(π) is:

λCk + λ(1 − kλ)Jk

where Ck = Ik − Jk and Jk = k−11k1T
k . Here, λ has multiplicity k − 1 and eigenspace [Span(1k)]⊥,
while ˜λ := λ(1 − kλ) has multiplicity one and eigenspace Span(1k).
In particular, since
1 − π0 = kλ, it follows that:
I(π) is singular ⇐⇒ π0 = 0.

35


Entropy 2014, 16, 2454–2471

Case 4 π−0 = (λ11T
m1| . . . |λg1T
mg)T, g > 1 and λ1 > · · · > λg > 0:

This is the generic case, having the spectrum of (??) above. Denoting by Om the zero matrix of
order m × m and by P(ν) the rank one orthogonal projector onto Span(ν), (ν ̸= 0), the SpD is:

g
∑
i=1,mi>1
λidiag(Omi−, Cmi, Omi−) +

g
∑
i=1
˜λiP

⎛

⎝
�
λ1

˜λi − λ1
1T
m1, . . . ,
λg

˜λi − λg
1T
mg

�T⎞

⎠ ,

where: mi− = ∑{mj|j < i}, mi+ = ∑{mj|j > i} and the ˜λi are the zeros of:

h(˜λ) := 1 +

g
∑
i=1

miλ2
i

˜λ − λi
= (1 −

g
∑
i=1
miλi) + ˜λ

� g
∑
i=1

miλi
˜λ − λi

�

.

In particular, {˜λi : i = 1, · · · , g} are simple eigenvalues satisfying (??) while, whenever mi > 1,
λi, is also an eigenvalue having multiplicity mi − 1. Further, expanding det(I(π)), we again find:

I(π) is singular
⇐⇒ π0 = 0,

so that �λg
>
0 ⇔
π0
>
0, as claimed.
Finally, we note that each �λi (i <
g) is
typically (much) closer to λi than to λi+1.
For, considering the graph of x
→
1/x,
h ((λi + λi+1)/2 + δ(λi − λi+1)/2) (−1 < δ < +1) is well-approximated by:

1 −
2miλ2
i

(λi − λi+1)(1 − δ) +
2mi+1λ2
i+1

(λi − λi+1)(1 + δ)

whose unique zero δ∗ over (−1, 1) is positive whenever, as will typically be the case, mi = mi+1
(both will usually be one), while (miλi + mi+1λi+1) < 1/2. Indeed, a straightforward analysis
shows that, for any mi and mi+1, δ∗ = 1 + O(λi) as λi → 0.

2.5. Total Positivity and Local Mixing

Mixture modelling is an exemplar of a major area of statistics in which computational information
geometry enables distinctive methodological progress. The −1-convex hull of an exponential family is
of great interest, mixture models being widely used in many areas of statistical science. In particular,
they are explored further in [? ]. Here, we simply state the main result, a simple consequence of the
total positivity of exponential families [? ], that, generically, convex hulls are of maximal dimension. In
this result, “generic” means that the +1 tangent vector, which defines the exponential family as having
components that are all distinct.

Theorem 1. The −1-convex hull of an open subset of a generic one-dimensional exponential family is of full
dimension.

Proof. For any (πi) ∈ Δk with each πi > 0, θ0 < · · · < θk and s0 < · · · < sk, let B = (π(θ0), ..., π(θk))
have general element:
πi(θj) := πi exp[siθj − ψ(θj)].

Further, let �B = B − π(θ0)1T
k+1, whose general column is π(θj) − π(θ0). Then, it suffices to show that
�B has rank k. However, using [? ] (p. 33), Rank(�B) = Rank(B) − 1, so that:

Rank(�B) = k ⇔ B is nonsingular ⇔ B∗ is nonsingular,

36


Entropy 2014, 16, 2454–2471

where B∗ = (exp[siθj]). It suffices, then, to recall [? ] that K(x, y) = exp(xy) is strictly total positive (of
order ∞), so that det B∗ > 0.

3. Infinite Dimensional Structure

This section will start to explore the question of whether the simplex structure, which describes
the finite dimensional space of distributions, can extend to the infinite dimensional case. We examine
some of the differences with the finite dimensional case, illustrating them with clear, commonly
occurring examples.

3.1. Infinite Dimensional Information Geometry: A Review

In the previous sections, the underlying computational space is always finite dimensional. This
section looks at issues related to an infinite dimensional extension of the theory in that paper. There
is a great deal of literature concerning infinite dimensional statistical models. The discussion here
concentrates on information geometric, parametrisation and boundary issues.
The information geometry theory of Amari [? ] has a geometric foundation, where statistical
models (typically full and curved exponential families) have a finite dimensional manifold structure.
When considering the extension to infinite dimensional cases, Amari notes the problem of finding an
“adequate topology” [? ] (p. 93). There has to be very interesting work following up this topological
challenge. By concentrating on distributions with a common support, the paper [? ] uses the geometry
of a Banach manifold, where local patches on the manifold are modelled by Banach spaces, via
the concept of an Orlicz space. This gives a structure that is analogous to an infinite dimensional
exponential family, with mean and natural parameters and including the ability to define mixed
parametrisations. One drawback of this Banach structure, as pointed out in [? ], is that the likelihood
function with finite samples is not continuous on the manifold. Fukumizu uses a reproducing kernel
Hilbert space structure rather than a Banach manifold, which is a stronger topology. There are strong
connections between the approach taken in [? ] and the material in Section ??, we note two issues here:
(1) a focus on the finite nature of the data; and (2) using a Hilbert structure defined by a cumulant
generating function. The approaches differ in that [? ] uses a manifold approach rather than the
simplicial complex as the fundamental geometric object. There is also other work that explicitly used
infinite dimensional Hilbert spaces in statistics, a good reference being [? ].
In this paper, in contrast to previous authors, a simplicial, rather than a manifold-based, approach
is taken. This allows distributions with varying support, as well as closures of statistical families to be
included in the geometry. Another difference in approach is the way in which geometric structures
are induced by infinite dimensional affine spaces rather than by using an intrinsic geometry. This
approach was introduced by [? ] and extended by [? ]. Spaces of distributions are convex subsets of
the affine spaces, and their closure within the affine space is key to the geometry.
In exponential families, the −1-affine structure is often called the mean parametrisation, and
using moments as parameters is one very important part of modelling. In the infinite dimensional
case, the use of moments as a parameter system is related to the classical moment problem—when
does there exist a (unique) distribution whose moments agree with a given sequence?—which has
generated a vast literature in its own right; see [? ? ? ]. In general terms, the existence of a solution
to the moment problem is connected to positivity conditions on moment matrices. Such conditions
have been used in connection to the infinite dimensional geometry of mixture models [? ]. Uniqueness,
however, is a much more subtle problem: sufficient conditions can be formulated in terms of the rate
of growth of the moments [? ]. Counter examples to general uniqueness results include the log-normal
distribution [? ].
The geometry of the Fisher information is also much more complex in general spaces of
distributions than in exponential families. Simple mixture models, including two-component mixtures
of exponential distributions [? ], can have “infinite” expected Fisher information, which gives rise to

37


Entropy 2014, 16, 2454–2471

non-standard inference issues. Similar results on infinitely small (and large) eigenvalues of covariance
operators are also noted in [? ]. Since the Fisher information is a covariance, the fact that it does not
exist for certain distributions or that its spectrum can be unbounded above or arbitrarily close to zero
is not a surprise. However, these observations do need to be taken into account when considering the
information geometry of infinite dimensional spaces.
The rest of this section looks at the topology and geometry of the infinite dimensional simplex
and gives some illustrative examples, which, in particular, show the need for specific Hilbert space
structures, discussed in the final section.

3.2. Topology

For simplicity and concreteness, in this section, we will be looking at models for real valued
random variables. In this paper, we restrict attention to the cases where the sample space is R+ or R
and has been discretised to a countably infinite set of bins, Bi, with i ∈ N or Z, respectively. In the
finite case, the basic object is the standard simplex, Δk, with k + 1 bins. We generalise this to countable
unions of such objects. Of these, one is of central importance, denoted by Δemp or simply Δ, because it
is the smallest object that contains all possible empirical distributions.

Definition 2. For any finite subset of bins, indexed by I ⊂ N or Z, denote

ΔI =

�

x = (xi)i∈I : xi ≥ 0 , ∑
i∈I
xi = 1

�

.

We take the union of all such sets �
|I|<∞ ΔI, where |I| denotes the number of elements of the index set. This
can always be written as:

Δ =

�

x = (xi)i∈Z : ∑
i∈Z
xi = 1, xi ≥ 0 and only finitely many xi > 0

�

.

In what follows, it is important to note that for any given statistical inference problem, the sample
size, n, is always finite, even if we frequently use asymptotic approximations, where n → ∞. Thus, the
data, as represented by the empirical distribution, naturally lie in the space, Δ. However, many models,
used in the given inference problem, will have support over all bins, so the models most naturally
lie in the “boundary” constructed using the closures of the set. These objects are subsets of sequence
spaces, and the corresponding topologies can be constructed from the Banach spaces, ℓp, p ∈ [1, ∞].
The following results follow directly from explicit calculations, where we note that in this section, since
all terms are non-negative, convergence always means absolute convergence. In particular, arbitrary
rearrangements of series do not affect the existence of limits or their values.

Example 6. Consider the sequence of “uniform distributions” x(n) = ( 1

n, . . . , 1

n, 0, . . . ) as elements of Δ. This
has an ℓp limit of the zero sequence for p ∈ (1, ∞].

Proposition 1. The ℓp extreme points of Δ, for p ∈ (1, ∞], are the zero sequence and the sequences, ffii (i ∈ Z),
with one as the i − th element and zero elsewhere.

For p ∈ [1, ∞], let Δp ⊂ ℓp denote the ℓp closure of Δ.

Theorem 2. (a) Δ1 = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi = 1} .
(b) Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} .
(c) For p ∈ (1, ∞), Δp = Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} .

38


Entropy 2014, 16, 2454–2471

Proof. (a) It is immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ1. Conversely, if ¯x is a limit
point, then all its elements must be non-negative. Finally, if ∑∞
i=1 ¯xi is not bounded above by one,

then there exists N, such that ∑N
i=1 ¯xi > 1 + ϵ for some ϵ > 0. Hence, ∑∞
i=1 | ¯xi − x(n)
i
| ≥ ∑N
i=1 | ¯xi −

x(n)
i
| ≥ ∑N
i=1 ¯xi − ∑N
i=1 x(n)
i
> ϵ for all n, which contradicts convergence. If ∑∞
i=1 ¯xi < 1 − ϵ, then

∑∞
i=1 | ¯xi − x(n)
i
| ≥ ∑∞
i=1 x(n)
i
− ∑∞
i=1 ¯xi > ϵ, which again contradicts convergence.
(b) It is again immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ∞. However, by Example ??,
the zero sequence is also in Δ∞, so that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi ≤ 1} ⊆ Δ∞.
Conversely, by contradiction, it is easy to see that all elements of the closure must have non-
negative elements. Finally, for any ¯x ∈ Δ∞, if ∑∞
i=1 ¯xi is not bounded above by one, there exists N, such

that ∑N
i=1 ¯xi > 1 + ϵ for some ϵ > 0. For any sequence of points, x(n) in Δ, we have that ∑N
i=1 x(n)
i
≤ 1,

so that, for i = 1, . . . , N, the maximum value of |x(n)
i
− ¯xi| > ϵ/N. Hence, for all sequences, x(n), we
have ∥x(n) − ¯x∥∞ > ϵ/N, which contradicts ¯x being in the closure.
(c) This follows essentially the same argument as (b) by noting in the case where ∑∞
i=1 ¯xi is not
bounded above by one, we have:

∥x(n) − ¯x∥p
p ≥
N
∑
i=1
| ¯xi − x(n)
i
|p ≥ N max
i=1,...N |x(n)
i
− ¯xi|p > N1−pϵp

for any sequences, x(n), which contradicts ¯x being in the closure.

It is immediate that the spaces, Δ and Δ1, are convex subsets of ℓ1 and that Δ∞ is a convex set in
ℓ∞.

3.3. Geometry

In the same way as for the finite case, the −1-geometry can be defined using an affine space
structure using the following definition.

Definition 3. Let I be a countable index set which is a subset of Z. The −1-affine space structure over
distributions is (Xmix, Vmix, +), where:

Xmix =

�

x = (xi)|∑
I
xi = 1,∑
I
|xi| < ∞

�

, Vmix =

�

v = vi|∑
I
vi = 0,∑
I
|vi| < ∞,

�

,

and x + v = (xi + vi).

In order to define the +1-geometric structure, we also follow the approach used in the finite case.
Initially, to understand the +1- structure, consider the case where all distributions have a common
support, i.e., assume πi > 0 for all i. We follow here the approach of [? ].

Definition 4. Consider the set of non-negative measures on N or Z and the equivalence relation defined by:

{ai} ∼ {bi} ⇐⇒ ∃λ > 0 s.t. ∀i ai = λbi.

The equivalences classes of this are the points in the +1 geometry.
These points can be further partitioned into sets with the same support, i.e., supp(< a >) = {i : ai > 0},
where this is clearly well-defined.

On sets of +1-points with the same support, we can define the +1-geometry in the same way as
in the finite case. With “exp” connoting an exponential family distribution, we have:

39


Entropy 2014, 16, 2454–2471

Definition 5. For a given index set, I, define Xexp to be all +1-points whose support equals I, and define the
vector space Vexp = {vi, i ∈ I} with the operation, ⊕, defined by:

< xi > ⊕vi = ⟨xi exp(vi)⟩ ,

is an affine space. The +1-affine structure is then defined by (Xexp, Vexp, ⊕).

Theorem 3. If a and b lie in Δ (or Δ1) and have the same support, then C(ρ) = ∑(aρ
i b(1−ρ)
i
) < ∞ for

ρ ∈ [0, 1]. Hence, aρ
i b(1−ρ)
i

C(ρ)
∈ Δ (or Δ1).

Proof. Since a, b are absolutely convergent, the sequence, max(ai, bi), is also. Since we have:

0 ≤ min(ai, bi) ≤ aρ
i b1−ρ
i
≤ max(ai, bi)

it follows that C(ρ) < ∞, and we have the result.

This result shows that sets in Δ1 with the same support are +1-convex, just as the faces in the
finite case are.

3.4. Examples

In order to get a sense of how the +1-geometry works, let us consider a few illustrative examples.

Example 7. If we denote the discretised standard normal density by a and the discretised Cauchy density by b
and consider the path:

aρ
i b(1−ρ)
i

C(ρ)
,

the normalising constant is shown in Figure ??. We see that at ρ = 0 (the Cauchy distribution), we have that
the derivative of the normalising constant (i.e., the mean of the sufficient statistic) is tending to infinity. At the
other end (ρ = 1), the model can be extended in the sense that the distribution exists for values greater than one.

���
���
���
���
���
���

����
����
����
����
����
����
����
����

����������������

��������������������

Figure 6. Normalising constant for normal-Cauchy exponential mixing example.

Thus, in this example, the path joining the two distributions is an extended, rather than natural,
exponential family, since we have to include the boundary point where the mean is unbounded.

40


Entropy 2014, 16, 2454–2471

Example 8. Let us return to Example ??, but now without the censoring. Thus, now, there is a countably
infinite set of bins, and so, we can investigate its embedding in the infinite simplex. As discussed in [? ], we shall
discretise the continuous distribution by computing the probabilities associated to bins [ci, ci+1], i = 1, 2, · · · .
For the exponential model, Exp(θ), the bin probabilities are simply:

πi(θ) = exp(−θci) − exp(−θci+1).

Using this, the model will lie in the infinite simplex on the positive half line with the index set I = N.
First, consider the case where we have a uniform choice of discretisation, where cn = n × ϵ for some fixed,
ϵ > 0. In this case, the bin probabilities can be written as an exponential family:

πn(θ) = exp
�
−θϵn + log(1 − e−θϵ)
�

for θ > 0. This gives a +1-geodesic though {πi(θ0)} in the direction {ϵ × n} of the form:

πn(θ0) exp

�

−λϵn + log

�
1 − e−(λ+θ0)ϵ

1 − e−θ0ϵ

��

(5)

for λ > −θ0. In the case where λ → −θ0, the limiting distribution is the zero measure in Δ∞, and at the
other extreme, where λ → ∞, the limiting distribution is the atomic distribution in the first bin, a distribution
with a different support than πi(θ0). However, unlike the finite case, there is no guarantee that, for a given
“direction”, {ti}, there exists a +1-geodesic starting at {πi(θ0)}, since we require the convergence of the
normalising constant:
∞
∑
i=0
πi(θ0) exp(λti) < ∞.

From this example, we see that the limit points of exponential families can lie in the space, Δ∞,
but not in Δ1. The next example shows that limits do not have to exist at all.

Example 9. Consider the family whose bin probabilities, πi ∈ Δ∞, are proportional to a discretised standard
normal with bins of constant width. The exponential family, which is proportional to πi exp(θi), does not have
an ℓ∞ limit, as it is discretised normal with mean θ. The natural parameter space here is (−∞, ∞).

The last illustrative example is from [? ] and shows that even for simple models, the Fisher
information for the parameters of interest need not be finite.

Example 10. Let us consider a simple example of a two-component mixture of (discretised) exponential distributions:

(1 − ρ)πi(θ0 + λ) + ρπi(θ0)
(6)

the tangent vector in the ρ-direction is:

πi(θ0) − πi(θ0 + λ) = πi(θ0)
�
1 − e−λϵnC
�

for a positive constant, C. The corresponding squared length, with respect to the Fisher information, is:

∞
∑
n=0

�
1 − e−λϵnC
�2

πi(θ0)
.

As an example, consider θ0 = 1; then, this term will be infinite for λ ≤ −0.5.

41


Entropy 2014, 16, 2454–2471

3.5. Hilbert Space Structures

Following these examples, we can consider the Hilbert space structure of exponential families
inside the infinite simplex with the following results.

Definition 6. Define the functions, S(·), by S({vi}, ß) = supθ {θ| ∑I πi exp(θvi) < ∞}, the function being
set to ∞ when the set is unbounded. Furthermore, define for a given {πi} ∈ ¯Δ∞, the set:

V(ß) = {{vi}|S({vi}, ß > 0} , and Vc(ß) = {{vi}| ± {vi} ∈ V(ß)} .

The spaces, Vc(ß), correspond to the directions in which the +1-geodesic and, so, the
corresponding exponential families are well-defined and have particularly “nice” geometric structures.

Theorem 4. For ß, define a Hilbert space by:

H(ß) :=
�
{vi}|∑ v2
i πi < ∞
�

with inner product:
⟨{vi}, {wi}⟩ß = ∑ viwiπi,

and corresponding norm || · ||ß. Under these conditions:
(i) Vc(ß) is a subspace of H(ß), and
(ii) the set V(ß) is a convex cone.

Proof. (i) First, if {vi} ∈ Vc(ß), then by definition, the moment generating function:

∑ exp(θvi)πi,

is finite for θ in an open set containing θ = 0. Hence, have both:

∑ viπi < ∞, and ∑ v2
i πi < ∞.

Thus, {vi} ∈ H(ß). The fact that it is a subspace follows from (ii) below.
(ii) It is immediate that V(ß) is a cone.
Convexity follows from the Cauchy–Schwartz inequality, since for all {vi}, {v∗
i } ∈ V(ß) and
λ ∈ [0, 1], it follows that:

�
∑ πie
θ
2 (λvi+(1−λ)v∗
i )�2
=
�
∑
�√πie
θ
2 λvi
� �√πie
θ
2 (1−λ)v∗
i
��2

≤
�
∑ πieθλvi
� �
∑ πieθ(1−λ)v∗
i
�
,

and, so, is finite for a strictly positive value of θ, hence
�
λvi + (1 − λ)v∗
i
� ∈ V(ß).

Hence, this result illustrates the point above regarding the existence of “nice” geometric structure
in the sense of Amari’s information geometry developed for finite dimensional exponential families.
Infinite dimensional families have a richer structure; for example, they include the possibility of having
an infinite Fisher information; see Examples ?? and ??.

Acknowledgments: The authors would like to thank Karim Anaya-Izquierdo and Paul Vos for many helpful
discussions and the UK’s Engineering and Physical Sciences Research Council (EPSRC) for the support of grant
number EP/E017878/.

Author Contributions: All authors contributed to the conception and design of the study, the collection and
analysis of the data and the discussion of the results. All authors read and approved the final manuscript.

Conflicts of Interest: The authors declare no conflict of interest.

42


Entropy 2014, 16, 2454–2471

References

1.
Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency).
Ann. Stat. 1975, 3, 1189–1242.
2.
Kass, R.E.; Vos, P.W. Geometrical Foundations of Asymptotic Inference; John Wiley & Sons: London, UK, 1997.
3.
Amari, S.-I. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics; Springer-Verlag Inc.:
New York, NY, USA, 1985; Volume 28.
4.
Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Foundations.
In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science;
Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 311–318.
5.
Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Mixture
Modelling. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer
Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 319–326.
6.
Anaya-Izquierdo, K.; Critchley, F.; Marriott, P. When are first order asymptotics adequate? A diagnostic. Stat
2014, 3, 17–22.
7.
Murray, M.K.; Rice, J.W. Differential Geometry and Statistics; Chapman & Hall: London, UK, 1993.
8.
Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 95–97.
9.
Hand, D.J.; Daly, F.; Lunn, A.D.; McConway, K.J.; Ostrowski, E. A Handbook of Small Data Sets; Chapman and
Hall: London, UK, 1994.
10.
Bryson, M.C.; Siddiqui, M.M. Survival times: Some criteria for aging. J. Am. Stat. Assoc. 1969, 64, 1472–1483.
11.
Marriott, P.; West, S. On the geometry of censored models. Calcutta Stat. Assoc. Bull. 2002, 52, 567–576.
12.
Geiger, D.; Heckerman, D.; King, H.; Meek, C. Stratified exponential families: Graphical models and model
selection. Ann. Stat. 2001, 29, 505–529.
13.
Edelsbrunner, H. Algorithms in Combinatorial Geometry; Springer-Verlag: NewYork, NY, USA, 1987.
14.
Barndorff-Nielsen, O.E.; Cox, D.R. Asymptotic Techniques for Use in Statistics; Chapman & Hall: London, UK, 1989.
15.
McCullagh, P. Tensor Methods in Statistics; Chapman & Hall: London, UK, 1987.
16.
Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge Universtiy Press: Cambridge, UK, 1985.
17.
Karlin, S. Total Positivity; Stanford University Press: Stanford, CA, USA, 1968; Volume I.
18.
Householder, A.S. The Theory of Matrices in Numerical Analysis; Dover Publications: Dover, DE, USA, 1975.
19.
Pistone, G.; Rogantin, M.P. The exponential statistical manifold: Mean parameters, orthogonality and space
transformations. Bernoulli 1999, 5, 571–760.
20.
Fukumizu, K. Infinite dimensional exponential families by reproducing kernel Hilbert spaces. In Proceedings
of the 2nd International Symposium on Information Geometry and its Applications, Tokyo, Japan,
12–16 December 2005.
21.
Small, C.G.; McLeish, D.L. Hilbert Space Methods in Probability and Statistical Inference; John Wiley & Sons:
London, UK, 1994.
22.
Akhiezer, N.I. The Classical Moment Problem; Hafner: New York, NY, USA, 1965.
23.
Stoyanov, J.M. Counter Examples in Probability; John Wiley & Sons: London, UK, 1987.
24.
Gut, A. On the moment problem. Bernoulli 2002, 8, 407–421.
25.
Lindsay, B.G. Moment matrices: Applications in mixtures. Ann. Stat. 1989, 17, 722–740.
26.
Li, P.; Chen, J.; Marriott, P. Non-finite Fisher information and homogeneity: An EM approach. Biometrika
2009, 96, 411–426.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

43


entropy

Article
Using Geometry to Select One Dimensional
Exponential Families That Are Monotone Likelihood
Ratio in the Sample Space, Are Weakly Unimodal and
Can Be Parametrized by a Measure of
Central Tendency

Paul Vos 1 and Karim Anaya-Izquierdo 2,*

1 Department of Biostatistics, East Carolina University, Greenville, NC 27858, USA; E-Mail: vosp@ecu.edu
2 Department of Mathematical Sciences, University of Bath, Bath BA27AY, UK
*
E-Mail: kai21@bath.ac.uk; Tel: +44-1225-384644

Received: 30 April 2014; in revised form: 30 June 2014 / Accepted: 14 July 2014 /
Published: 18 July 2014

Abstract: One dimensional exponential families on finite sample spaces are studied using the
geometry of the simplex Δ◦
n−1 and that of a transformation Vn−1 of its interior. This transformation is
the natural parameter space associated with the family of multinomial distributions. The space Vn−1
is partitioned into cones that are used to find one dimensional families with desirable properties for
modeling and inference. These properties include the availability of uniformly most powerful tests
and estimators that exhibit optimal properties in terms of variability and unbiasedness.

Keywords: simplex; cone; exponential family; monotone likelihood ratio; unimodal; duality

1. Introduction

The motivation for the constructions in this paper begins with a sample from a one dimensional
space that is discrete. We allow for a continuous sample space but assume that this has been suitably
discretized into n bins. The simplest underlying structure for the probability assigned to these bins is
given by the multinomial distribution. The collection of all multinomial distributions can be identified
with the n − 1 simplex Δn−1. We use the geometry of the simplex along with a transformation of its
interior Δ◦
n−1 to search for one dimensional subspaces that have good properties for modeling and for
inference. In particular, we want families that can be parameterized by the mean, have only unimodal
distributions, have desirable test characteristics (such as providing uniformly most powerful unbiased
tests) and estimation properties (such as unbiasedness and small variability).
The boundary of the (n − 1) dimensional simplex Δn−1 can be written as the union of simplexes
of dimension (n − 2). This process can be repeated on the simplexes of lower dimension until the
boundary consists of the vertices of the original simplex. This construction has statistical relevance
to the possible supports for the probability distributions considered on the n bins. We obtain a dual
decomposition for a transformation Vn−1 (defined in Equation (5) in Section 5) of Δ◦
n−1; it is dual in
that the result can be obtained by replacing simplexes with cones. The statistical relevance of the
conical decomposition is to the possible modes for all the distributions on the n bins. Since Vn−1 is
the natural parameter space for the distributions in Δ◦
n−1, one dimensional exponential families are
lines in Vn−1 and these can be related to the cones that partition Vn−1. One result is that the limiting
distribution for any one dimensional exponential family in Δ◦
n−1 is the uniform distribution whose
support is determined by the cone that contains the limiting values of the line corresponding to the
exponential family.

Entropy 2014, 16, 4088–4100; doi:10.3390/e16074088
www.mdpi.com/journal/entropy
44


Entropy 2014, 16, 4088–4100

While one parameter exponential families can be defined quite generally by choosing a sufficient
statistic, it can be useful to start with the sufficient statistics from well-known families such as the
binomial, Poisson, negative binomial, normal, inverse Gaussian, and Gamma distribution. These
exponential families have good modeling and inferential properties that we try to maintain by limiting
the extent to which the sufficient statistic is modified. These restrictions lead to considering vectors in
Vn−1 that lie in a cone. Examples of how to construct these cones are given.

2. Motivating Examples

One dimensional exponential families such as the binomial or Poisson are the workhorse of
parametric inference because of their excellent statistical properties. However, being one dimensional
means they do not always fit data very well so an extension to a two (or higher) dimensional
exponential family can be pursued in order to preserve the nice inferential structure. An issue
with such extension is that, for each extra natural parameter added, we need to choose a new sufficient
statistic and this choice can substantially change the shape of the corresponding density functions. For
example densities can pass from being unimodal to have multiple modes for some parameter values.
To see this, consider the following examples.

Example 1. Altham [1] considered the so-called multiplicative generalization of the binomial distribution with
corresponding density

f (x; p, φ) =
�n
x

�
px(1 − p)n−xφx(x−n)/C(p, φ)
(1)

where C is the normalizing constant and where clearly the binomial is recovered when φ = 1.
By reparametrizing using θ1 = log(p/(1 − p)) and θ2 = log(φ) this density can be expressed in
exponential form as
f (x; θ1, θ2) = h(x) exp(θ1 x + θ2 T(x) − K(θ1, θ2))
(2)

where T(x) = x(x − n) is the added sufficient statistic and h(x) = (n
x) where dependence on n has been ignored.
Note that the same family is obtained if T(x) = x2 is added as a sufficient statistic instead of x(x − n).
If n = 127 and (θ1, θ2) = (−0.0122, 0.018) then density (2) is bimodal as shown in the left panel of Figure

1. The mean μ of this distribution is 50. Also plotted is the corresponding binomial density with the same mean
or equivalently with θ1 = log(50/(127 − 50)) = −0.4318 and θ2 = 0.

�
��
��
��
��
���

����
����
����
����
����

�

�
��
��
��
��
���

����
����
����
����
����

�

Figure 1. Binomial density (thick in both panels). Multiplicative binomial density (left panel and thin)
and double binomial density (right panel and thin). All densities have the same mean μ = 50 and
n = 127. Variance of the multiplicative and double binomial densities is equal.

45


Entropy 2014, 16, 4088–4100

As explained by Lovison [2], this distribution has the feature of being under- or over-dispersed with
respect to the binomial depending on θ2 being negative or positive, respectively. Furthermore, using the mixed
parametrization (μ, θ2) (see [3] for details) it is easy to see that this distribution can be parametrized so that one
parameter controls dispersion independently of the mean. In fact, for a fixed mean μ, as θ2 → −∞ f (x; θ1, θ2)
tends to a two point distribution (with support points at the extremes x = 0 and x = n) or to a degenerate
distribution on x = μ when θ2 → ∞.

Example 2. Double exponential families [4] are two parameter exponential families that extend standard
unidimensional exponential families such as the binomial and the Poisson. Similar to the multiplicative binomial
in Example 1, the extra parameter involved in double exponential families controls the variance independently of
the mean. The density for the so-called double binomial family can be written in the form (2) with

T(x) = x log
� x

n

�
+ (n − x) log
�
1 − x

n

�

h(x) = (n
x) and with the particular restriction that θ2 < 1 (see [4] for details). The range θ2 < 0 generates
underdispersion and θ2 ∈ [0, 1) generates overdispersion with respect to the binomial. As shown on the right
panel of Figure 1, the double binomial density can also be multimodal where the double binomial density shown
has the same mean and variance as the multiplicative binomial shown in the left panel.

These examples show that while extending exponential families can lead to useful modeling
properties such as overdispersion, the extension can also result in distributions that are not suitable
for modeling. We are interested in the relationship between geometric properties of one dimensional
families and the modeling properties of their distributions.

3. Sample Space and Distribution-valued Random Variables

We consider first the general case where the sample space for a single observation X1 consists of
n bins
Sn = {B1, B2, . . . , Bn−1, Bn} .

We consider the space of all probability distributions P on this sample space Sn. Each probability
distribution in P is defined by the n-tuple p whose ith component is

pi = Pr(Bi)

so that P can be identified with the n − 1 simplex

Δn−1 = {p ∈ Rn : pi ≥ 0 ∀i, 1′p = 1}

where 1 in 1′p is the vector 1 ∈ Rn each of whose components is 1. We will slightly abuse the notation
by using p to name a point in Δn−1, and hence in Rn, as well as the corresponding distribution in P.
The sample space for a random sample of size N from a distribution p0 ∈ Δn−1 is

X N
n = {x : x is an n vector of nonnegative integers that sum to N} .

There is simple relationship between X N
n and the simplex that we obtain by dividing each component
of x by N. Although the sample space X N
n can be viewed as formed by compositional data, we will
follow a different approach to handle this kind of data compared with the classical approach described
by Aitchison [5] because the data we consider have additional structure.
In Figure 2 the sample space for the sample of size N = 10 is displayed using open circles. The
vertices correspond to the case where all 10 values fall in a single bin. The other points correspond
to the less extreme cases. Let p0 be any point in Δn−1. By mapping the multinomial random variable
of counts X to Δn−1, we obtain the random distribution �P = X/N whose values are multinomial

46


Entropy 2014, 16, 4088–4100

distributions each having number of cases N and probability vector X/N. Identifying X N
n -valued
random variables with distribution-valued random variables provides a natural means for comparing
data with probability models using the Kullback–Leibler (KL) divergence.
We can compare distributions in Δn−1 using the KL divergence D : P × P �→ R

D(p1, p2) = ∑ p1 log (p1/p2) = H(p1, p2) − H(p1)

where H(p1, p2) = − ∑ p1 log(p2) and H(p1) = H(p1, p1) is the entropy of p1. Note that the arguments
to D and H are distributions while the logarithm and ratios are defined on points in Rn. Following Wu
and Vos [6], the variance of the random distribution �P is defined to be

Varp0( �P) = min
p∈Δn−1
Ep0D( �P, p)

and its mean is defined to be
Ep0( �P) = arg min
p∈Δn−1
Ep0D( �P, p).

Note that the expectation on the right hand side of the equations above are for real-valued random
variables while the expectation on the left hand side of the second equation is for a distribution-valued
random variable.

Figure 2. Simplex for n = 3 bins and sample space for N = 10 observations.

It is not difficult to show that Ep0 �P = p0 so that �P can be considered an unbiased estimator for
p0. Details are in [6], which also shows that the KL risk can be decomposed into bias-squared and
variance terms:
Ep0D( �P, q) = D(p0, q) + Varp0( �P).

The distributional variance is related to the entropy

Varp0( �P) = Ep0D( �P, p0) = H(p0) − Ep0 H( �P).

Note that for N = 1, H( �P) = 0 so that for a single observation the random distribution �P taking values
on the vertices of Δn−1 has variance equal to the entropy of p0.
For inference, p0 is unknown but we specify a subspace M ⊂ Δn−1 that contains p0, or at
least has distributions that are not too different from p0. Estimates can be obtained by choosing a
parameterization for M, say θ, and then considering real-valued functions ˆθ and evaluating these in

47


Entropy 2014, 16, 4088–4100

terms of bias and variance. Bias and variance are useful descriptions when θ describes a feature of
the distribution that is of inherent interest. However, if θ is simply a parameterization, or if there are
other features that are also of interest, then these quantities are less useful. For inference regarding the
distribution p0 we can use a distribution-valued estimator �PM where the subscript indicates that the
estimator is defined to account for the fact that p0 ∈ M.
We will not pursue the details of distribution-valued estimators here; we mention these only
because all the subspaces we consider will be exponential families and in this case the maximum
likelihood estimator has important properties in terms of distribution variance and distribution bias:
when M is an exponential family, the maximum likelihood estimator is distribution unbiased, and it
uniquely minimizes the distribution variance among the class of all distribution unbiased estimators.
Furthermore, when p0 ̸∈ M then the maximum likelihood estimator is the unique unbiased minimum
distribution variance estimator of the distribution in M that is closest (in terms of KL) to p0. Extensions
of one dimensional exponential families that do not result in exponential families will not enjoy these
properties of maximum likelihood estimation. Details of these results that hold for sample spaces more
general than Sn are in [7].

4. Simplices Δs

One dimensional exponential families on Sn are curves in Δn−1 whose properties will depend on
their location within various subspaces of Δn−1. An important collection of subspaces will be indexed
by the subsets of Sn. For notational convenience we take Bi to the integer i. Using integers is suggestive
of an ordering and a scale structure but at this point these are only being used to indicate distinct bins.
For each s ⊂ Sn,

Δs =
�
p ∈ Rn : pi ≥ 0 ∀i ∈ s, pi = 0 ∀i ∈ sc, 1′p = 1
�

where sc = {i ∈ Sn : i ̸∈ s}. Note that ΔSn = Δn−1. The interior of Δs is

Δ◦
s =
�
p ∈ Δs : pi > 0 ∀i ∈ s
�
.

As probability distributions in P, Δ◦
s corresponds to the set of all distributions having support s. There
is a simple and obvious relationship between the dimension of Δs, |Δs|, and the cardinality of s, |s|,
which holds for all nonempty s ⊂ Sn
|Δs| + 1 = |s|.

The boundary of Δs is defined as

∂Δs = {p ∈ Δs : p ̸∈ Δ◦
s }

so that
Δs = Δ◦
s ⊎ ∂Δs

where ⊎ indicates the sets in the union are disjoint. The boundary ∂Δs can be written as the union of
all simplices of dimension one less than that Δs

∂Δs =
�

s′:s′⊂s, |s′|=|s|−1
Δs′
(3)

This boundary property for Δs holds because the simplex Sn consists of all possible subsets. Each
nonempty s ∈ Sn specifies one of the possible supports for distribution P ∈ Pn

Δs =
�

s′:s′⊂s
Δ◦
s′
(4)

48


Entropy 2014, 16, 4088–4100

where we set Δ∅ = ∅.

5. Cones Λs

The set of all nonempty subsets of the sample space provides a partition of Δn−1 based on the
support of the distributions in P. The elements in the partition are simplices whose dimension is one
less than the cardinality of the indexing set. In most cases we will consider models having support
Sn, that is, models corresponding to Δ◦
n−1. If we use subsets s to define the mode rather than support,
we obtain a partition of P◦, the distributions in P having support Sn. This partition can be expressed
using convex cones in an n − 1 dimensional plane Vn−1. The dimension of the cones are n minus the
cardinality of the indexing set and the relationship between interiors of cones and their boundaries is
analogous to that for simplices expressed in Equations (3) and (4).
Let
Vn−1 =
�
v ∈ Rn : 1′v = 0
�
(5)

be the subspace of Rn of dimension n − 1 of all vectors that sum to zero. For each nonempty s ∈
Sn define
Λs =
�
v ∈ Vn−1 : vi ≥ vj ∀i ∈ s, ∀j ∈ Sn
�
.

It is easily checked that Λs is a convex cone

v1, v2 ∈ Λs =⇒ a1v1 + a2v2 ∈ Λs ∀a1, a2 ∈ [0, ∞) .

The dimension of Λs is |Λs| = n − |s| since each point in j ∈ sc provides a basis vector bj whose ith

component is 1 if i ∈ s or i = j and is zero otherwise and |sc| = n − |s|. The interior of Λs is

Λ◦
s =
�
v ∈ Λs : vi > vj ∀i ∈ s, ∀j ∈ sc�
,

the boundary is
∂Λs = {v ∈ Λs : v ̸∈ Λ◦
s } ,

so that
Λs = Λ◦
s ⊎ ∂Λs

by definition. Note ΛSn = Λ◦
Sn = 0 ∈ Vn−1 ⊂ Rn where the first equality holds because the conditions
in the definition of Λ◦
s hold vacuously since i ∈ Sc
n = ∅ adds no restriction. Likewise, we can extend
the definition of Λs to include s = ∅ and since i ∈ ∅ adds no restriction

Λ∅ = Λ◦
∅ = Vn−1.

Note that Λ∅ depends on the cardinality of the set Sn. Since we are considering n fixed, we will not
show this dependence in the notation.
Corresponding to Equation (3) we have for all nonempty s that the boundary of the cone Λs is the
union of all cones having dimension one less than the dimension of Λs

∂Λs =
�

s′:s⊂s′, |s′|=|s|+1
Λs′.
(6)

Corresponding to Equation (4) we have

Λs =
�

s′:s⊂s′
Λ◦
s′
(7)

The relationship between the simplices Δ and cones Λ is more easily seen if we suppress the
sets that index these objects. Let Δ and Δ∗ be any two simplices and let Λ and Λ∗ be any two convex

49


Entropy 2014, 16, 4088–4100

cones. We only consider cones and simplices that correspond to a nonempty subset of Sn. Then the
Equations (6) and (7) for the convex cones are obtained by simply replacing Δ in Equations (3) and (4)
with Λ:
∂Δ =
�

Δ∗:|Δ∗|=|Δ|−1
Δ∗,
∂Λ =
�

Λ∗:|Λ∗|=|Λ|−1
Λ∗
(8)

Δ =
�

Δ∗⊂Δ
Δ◦
∗,
Λ =
�

Λ∗⊂Λ
Λ◦
∗
(9)

Equation (9) also holds for the empty set since Δ∅ = ∅ and Λ∅ = Vn−1.

6. Vn−1 and P◦

There is a natural bijection φ between Vn−1 and Δ◦
n−1 defined by

φ(p) = log(p) − m(p)1

where log(p) is the vector with ith component log(pi) and m(p) is defined so that 1′φ(p) = 0. The
inverse is
ϕ(v) = k−1(v) exp(v)

where exp(v) is the vector with ith component exp(vi) and k(v) is defined so that 1′ exp(v) = 1.
Each cone Λ◦
s in the partition
Vn−1 =
�
Λ◦
s

corresponds to one of the 2n − 1 possible modes for any distribution having support Sn since vi > vj if
and only if ϕi(v) > ϕj(v).

7. Vn−1 and Exponential Families in P◦

We define a line by a pair of vectors v0, v1 ∈ Vn−1 with v1 ̸= 0

ℓ = ℓ(t) = {v ∈ Vn−1 : v = v0 + tv1, t ∈ R}

Note that v0 and v1 are not unique. Applying the inverse transformation ϕ to points in ℓ gives
probability densities

ϕ(v0 + tv1) =
exp(v0 + tv1)
1′ exp(v0 + tv1)
(10)

which have the exponential family form with t playing the role of the natural parameter. Therefore,
the space Vn−1 is easily recognized as the natural parameter space for the distributions Δ◦
n−1 so that
each line ℓ in Vn−1 corresponds to a one dimensional exponential family.
For each line ℓ(t) there is a value tmax such that {ℓ(t) : t ≥ tmax} is contained in one of the cones
Λ◦
s where s is the subset of Sn with the property that vi
1 ≥ vj
1 for all i ∈ s for vectors v1 ∈ Λ◦
x. For each
line ℓ(t) there is a value tmin such that {ℓ(t) : t ≤ tmin} is contained in one of the cones Λ◦
s′ where s′ is

the subset of Sn with the property that vi
1 ≤ vj
1 for all i ∈ s′ for vectors v1 ∈ Λ◦
x. The cones Λ◦
s and Λ◦
s′
are disjoint and will be called the extremal cones for ℓ. There is at least one other cone Λ◦
s′′ such that
ℓ ∩ Λ◦
s′′ ̸= ∅.
Any one dimensional exponential family ℓ(t) can be described by an ordered sequence of
disjoint cones
�
Λ◦
s1, Λ◦
s2, . . . , Λ◦
sk

�

50


Entropy 2014, 16, 4088–4100

where k = k(ℓ) will depend on the family. These are simply the cones that are traversed by ℓ(t)
between its extremal cones. We take Λ◦
sk to be the cone that contains ℓ(t) for all sufficiently large t.
Equation (6) for cones means that

∂Λsi ⊂ Λsj for j = i + 1 or j = i − 1

The ordered sequence of cones provides an ordered sequence of unique subsets of Sn

(s1, s2, . . . , sk)

that we call the modal profile for ℓ as these are the modes realized by the exponential family ℓ(t) between
its extremal cones that have modes s1 and sk.
Each point on a line ℓ(t) in Vn−1 corresponds to a distribution having support Sn. As t goes to −∞
(+∞) ϕ(ℓ(t)) goes to a distribution having support s1 (sk). In fact, these are the uniform distribution
on these supports. For every s ⊂ Sn other than ∅ and Sn, the uniform distribution on s is a limiting
distribution for some one dimensional exponential family in P◦.
Figure 3 shows Vn−1 for the two dimensional simplex shown in Figure 2. The three rays are the
one dimensional cones and the spaces between these cones are the two dimensional cones. The origin
is the zero dimensional cone. The sample values on the boundary of Δ2 are not in V2. Note that the
one dimensional cones are line segments in Δ2.

��
��
�
�
�

��
��
�
�
�

��

Figure 3. V2 for n = 3 bins and sample space for N = 10 observations that are in the interior of Δ2.

8. Ordered Bins and the Monotone Likelihood Ratio Property

Let the bins be ordered and assign the first n integers to the bins to reflect this ordering. We seek
to define exponential families that have a modal profile of the form

({1} , {1, 2} , {2} , {2, 3} , . . . , {n − 1, n} , {n})
(11)

or a contiguous sub-collection of this profile. Extensions to three or more contiguous modes are clearly
possible but not discussed here.
From the definition of modal profile, it follows that a family with modal profile (11) will have the
property that the mode is a non-decreasing function of t. In addition to this property for the mode, we
want the likelihood ratio for any two members of the family to provide the same ordering structure

51


Entropy 2014, 16, 4088–4100

as that of the bins. A family that satisfies this condition is said to have the monotone likelihood ratio
property with respect to x where x takes the values of the bin labels: 1, 2, . . . , n. Let pθ1 and pθ2 be
two distributions in a one dimensional family parameterized by θ and let pθ2/pθ1 be the n-vector with

components pj
θ2/pj
θ1 for 1 ≤ j ≤ n. This family has monotone likelihood ratio if for all θ1 < θ2 and
j < j′

pj
θ2

pj
θ1
<
pj′

θ2

pj′
θ1

.

A family with this property avoids the problem situation where in general the data in the higher
numbered bins are evidence for pθ2 but in going from a particular bin, say j0 to j0 + 1, the likelihood
ratio actually decreases. Exponential families such as the binomial and Poisson have this monotone
likelihood ratio property for the bin labels. The monotone likelihood ratio property can be extended
to allow for likelihood ratios that are monotone in some function of x. An important advantage of
families with the monotone likelihood ratio property is the existence of uniformly most powerful tests.
To ensure that our exponential families have the monotone likelihood ratio property we consider
vectors in the cone Λ↑ ⊂ Λn
Λ↑ =
�
v : vi < vj, i < j
�
.

From Equation (10), the exponential family indexed by θ is k(θ) exp(v0 + θv1)

pj
θ2

pj
θ1
= k(θ2)

k(θ1) exp
�
(θ2 − θ1) vj
1
�

so that the likelihood ratio is monotone in j if v1 ∈ Λ↑.

9. Selecting Vectors in Λ↑

In order to choose n-dimensional vectors v ∈ Λ↑ we will consider a set of infinite dimensional
vectors f. Let ¯f : R �→ R and consider f = ¯f |Z where Z is the set of integers. The function f is
represented by a doubly infinite sequence

f = . . . , f j−1, f j, f j+1, . . .

and we denote the set of all such functions as

F =
�
f : f j ∈ R ∀ j ∈ Z
�
.

While it is not necessary to consider functions ¯f to define f, these functions are useful to describe
properties of f, which can be thought of as a discretized version of ¯f.
Define the gradient of f as the function ∇ whose jth component is

(∇ f )j = f j − f j−1

The simplest functions in F are the constant functions

F0 =
�
f ∈ F : f j = f j′ ∀j, j′ ∈ Z
�
.

The next simplest functions are those whose gradient is constant. We call these first order functions
and denote the set of these as
F1 = { f ∈ F : ∇ f ∈ F0} .

52


Entropy 2014, 16, 4088–4100

Functions in F1 are such that changes from one bin to the next bin is the same for all bins. That is,
these functions describe constant change. We can write the functions in F1 explicitly as

F1 =
�
f ∈ F : f j = aj + b, a, b ∈ R
�

which shows that each f ∈ F1 is the discretized version of a function ¯f whose graph is a line in R × R.
We obtain a vector v from f by defining the jth component of v as

vj = f j −
n
∑
1
f i

. From this definition we see that the intercept b of f does not affect v and that the slope is a scaling
factor so that the restriction to first order functions results in a single direction in Λ↑. This direction
defines the one dimensional cone defined by the vector with vj = j − (n + 1)/2.
Additional directions can be obtained from the second order functions

F2 = { f ∈ F : ∇ f ∈ F1} .

If f ∈ F2 then (∇2 f )j = a for some a ∈ R and for all j ∈ Z. Using the fact that

(∇2 f )j = (∇(∇ f ))j = ( f j − f j−1) − ( f j−1 − f j−2)

= f j + f j−2 − 2f j−1

the second order functions can be written explicitly as

F2 =
�
f ∈ F : f j = a

2 j(j + 1) + bj + c, a, b, c ∈ R
�

.
In order for the vector v obtained from f ∈ F2 to be in Λ↑ we need (∇ f )j ≥ 0 for j = 1, 2, . . . , n.
With f j = (a/2)j(j + 1) + bj + c we have (∇ f )j = aj + b so that for a > 0 we require b ≥ −a and for
a < 0 we require b ≥ −an. Since we are concerned with the direction rather than the magnitude we
can take a = ±1 and the value of c is chosen so the sum of the components is zero.
The second order vectors in Λ↑ consists of the cone defined by the vectors v20 and v21 having
components defined by

(n − 1)(v20)j = 1

2 j(j + 1) − j − c20

(n − 1)(v21)j = −1

2 j(j + 1) + nj − c21

Notice that this cone contains v1 since v1 is proportional to v20 + v21. Many discrete one dimensional
exponential families (e.g., binomial, negative binomial, and Poisson) use the vector v1. Furthermore,
many continuous one dimensional exponential families use the continuous function f used to define
v1: normal with σ known, and the gamma and inverse Gaussian distributions with known shape
parameter (the shape parameter is the non-scale parameter). The cone defined by v20 and v21 allows us
to perturb the v1 direction to obtain related exponential families that we would expect to have similar
properties. Figure 4 shows v20 and v21 as well as v1 = 0.5v20 + 0.5v21.
Other vectors can be used to define cones around v1. Looking at common exponential families we
see that log(x) and x−1 are sufficient statistics so that these suggest taking ¯f (x) = log(x) or ¯f (x) = 1/x.
These can be further generalized to ¯f (x; λ), which can be the power family or some other family of
transformations. The vectors v f0 and v f1 are defined using the discretized f with the constraints that
v f0, v f1 ∈ Λ↑ and 0.5v f0 + 0.5v f1 = v1.

53


Entropy 2014, 16, 4088–4100

An exponential family with sufficient statistic x can be modified by choosing a function ¯f (x) and
0 ≤ α ≤ 1 where α = 0.5 corresponds to the original exponential family and other values perturb this
direction. We denote this vector as v f α so that v0 + tv f α is the natural parameter of the modified family.
Figure 4 shows the components of the vectors v20 and v21.

�
��
��
��
��
���
���

����
����
���
���
���

�����������

�����

���������������

���

���

Figure 4. Components of the vectors v20 and v21 for n = 128 bins.

Since v0 is common to each exponential family with natural parameter ℓ(t) = v0 + tv f α, the
monotone likelihood ratio property will hold even if v0 ̸∈ Λ↑. Initial choices for v0 are suggested by
the Poisson, binomial, and negative binomial distributions:

(vPoisson)j = − log Γ(j) + c
/∈ Λ↑

(vbinomial)j = log Γ(n) − log Γ(j) − log Γ(n − j) + c
/∈ Λ↑

(vneg.bin.)j = log Γ(j + r) − log Γ(j) + c
∈ Λ↑

where c is a constant chosen so that the components sum to 1, n is the number of bins, and r is a
positive real constant.

Author Contributions: This paper was initiated by the first author but all sections reflect a collaborative effort.
Both authors have read and approved the final manuscript.

Conflicts of Interest: The authors declare no conflict of interest.

References

1.
Altham, P.M.E. Two Generalizations of the Binomial Distribution. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1978,
27, 162–167.
2.
Lovison, G. An alternative representation of Altham’s multiplicative-binomial distribution. Stat. Probab. Lett.
1998, 36, 415–420.
3.
Brown, L. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory;
IMS Lecture Notes; Institute of Mathematical Statistics: Hayward, CA, USA, 1986.
4.
Efron, B. Double Exponential Families and Their Use in Generalized Linear Regression. J. Am. Stat. Assoc.
1986, 81, 709–721.
5.
Aitchison, J. The Statistical Analysis of Compositional Data; Chapman and Hall: London, UK, 1986.

54


Entropy 2014, 16, 4088–4100

6.
Wu, Q.; Vos, P. Decomposition of Kullback–Leibler risk and unbiasedness for parameter-free estimators.
J. Stat. Plan. Inference 2012, 142, 1525–1536.
7.
Vos, P.; Wu, Q.
Maximum Likelihood Estimators Uniformly Minimize Distribution Variance among
Distribution Unbiased Estimators in Exponential Families. Bernoulli 2014, submitted.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

55


entropy

Article
On the Fisher Metric of Conditional Probability
Polytopes

Guido Montúfar 1,*, Johannes Rauh 1 and Nihat Ay 1,2,3

1 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig 04103, Germany; E-Mails:
jrauh@mis.mpg.de (J.R.); nay@mis.mpg.de (N.A.)
2 Department of Mathematics and Computer Science, Leipzig University, PF 10 09 20, Leipzig 04009, Germany
3 Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
*
E-Mail: montufar@mis.mpg.de; Tel.: +49-341-9959-521.

Received: 31 March 2014; in revised form: 18 May 2014 / Accepted: 29 May 2014 /
Published: 6 June 2014

Abstract: We consider three different approaches to define natural Riemannian metrics on polytopes
of stochastic matrices. First, we define a natural class of stochastic maps between these polytopes and
give a metric characterization of Chentsov type in terms of invariance with respect to these maps.
Second, we consider the Fisher metric defined on arbitrary polytopes through their embeddings as
exponential families in the probability simplex. We show that these metrics can also be characterized
by an invariance principle with respect to morphisms of exponential families. Third, we consider
the Fisher metric resulting from embedding the polytope of stochastic matrices in a simplex of joint
distributions by specifying a marginal distribution. All three approaches result in slight variations
of products of Fisher metrics. This is consistent with the nature of polytopes of stochastic matrices,
which are Cartesian products of probability simplices. The first approach yields a scaled product of
Fisher metrics; the second, a product of Fisher metrics; and the third, a product of Fisher metrics
scaled by the marginal distribution.

Keywords: Fisher information metric; information geometry; convex support polytope; conditional
model; Markov morphism; isometric embedding; natural gradient

1. Introduction

The Riemannian structure of a function’s domain has a crucial impact on the performance of
gradient optimization methods, especially in the presence of plateaus and local maxima. The natural
gradient [1] gives the steepest increase direction of functions on a Riemannian space. For example,
artificial neural networks can often be trained by following some function’s gradient on a space of
probabilities. In this context, it has been observed that following the natural gradient with respect to
the Fisher information metric, instead of the Euclidean metric, can significantly alleviate the plateau
problem [1,2]. The Fisher information metric, which is also called Shahshahani metric [3] in biological
contexts, is broadly recognized as the natural metric of probability spaces. An important argument
was given by Chentsov [4], who showed that the Fisher information metric is the only metric on
probability spaces for which certain natural statistical embeddings, called Markov morphisms, are
isometries. More generally, Chentsov’s Theorem characterizes the Fisher metric and α-connections of
statistical manifolds uniquely (up to a multiplicative constant) by requiring invariance with respect
to Markov morphisms. Campbell [5] gave another proof that characterizes invariant metrics on the
set of non-normalized positive measures, which restrict to the Fisher metric in the case of probability
measures (up to a multiplicative constant). In this paper, we explore ways of defining distinguished
Riemannian metrics on spaces of stochastic matrices.

Entropy 2014, 16, 3207–3233; doi:10.3390/e16063207
www.mdpi.com/journal/entropy
56


Entropy 2014, 16, 3207–3233

In learning theory, when modeling the policy of a system, it is often preferred to consider
stochastic matrices instead of joint probability distributions. For example, in robotics applications,
policies are optimized over a parametric set of stochastic matrices by following the gradient of a
reward function [6,7]. The set of stochastic matrices can be parametrized in many ways, e.g., in terms
of feedforward neural networks, Boltzmann machines [8] or projections of exponential families [9].
The information geometry of policy models plays an important role in these applications and has
been studied by Kakade [2], Peters and co-workers [10–12], and Bagnell and Schneider [13], among
others. A stochastic matrix is a tuple of probability distributions, and therefore, the space of stochastic
matrices is a Cartesian product of probability simplices. Accordingly, in applications, usually a product
metric is considered, with the usual Fisher metric on each factor. On the other hand, Lebanon [14]
takes an axiomatic approach, following the ideas of Chentsov and Campbell, and characterizes a class
of invariant metrics of positive matrices that restricts to the product of Fisher metrics in the case of
stochastic matrices. We will consider three different approaches discussed in the following.
In the first part, we take another look at Lebanon’s approach for characterizing a distinguished
metric on polytopes of stochastic matrices. However, since the maps considered by Lebanon do not
map stochastic matrices to stochastic matrices, we will use different maps. We show that the product
of Fisher metrics can be characterized by an invariance principle with respect to natural maps between
stochastic matrices.
In the second part, we consider an approach that allows us to define Riemannian structures
on arbitrary polytopes. Any polytope can be identified with an exponential family by using the
coordinates of the polytope vertices as observables. The inverse of the moment map then defines
an embedding of the polytope in a probability simplex. This embedding can be used to pull back
geometric structures from the probability simplex to the polytope, including Riemannian metrics,
affine connections, divergences, etc. This approach has been considered in [9] as a way to define
low-dimensional families of conditional probability distributions. More general embeddings can be
defined by identifying each exponential family with a point configuration, B, together with a weight
function, ν. Given B and ν, the corresponding exponential family defines geometric structures on the
set (conv B)◦, which is the relative interior of the convex support of the exponential family. Moreover,
we can define natural morphisms between weighted point configurations as surjective maps between
the point sets, which are compatible with the weight functions. As it turns out, the Fisher metric on
(conv B)◦ can be characterized by invariance under these maps.
In the third part, we return to stochastic matrices. We study natural embeddings of conditional
distributions in probability simplices as joint distributions with a fixed marginal. These embeddings
define a Fisher metric equal to a weighted product of Fisher metrics. This result corresponds to the
Definitions commonly used in robotics applications.
All three approaches give very similar results. In all cases, the identified metric is a product
metric. This is a sensible result, since the set of k × m stochastic matrices is a Cartesian product of
probability simplices Δm−1 × · · · × Δm−1 = Δk
m−1, which suggests using the product metric of the
Fisher metrics defined on the factor simplices, Δm−1. Indeed, this is the result obtained from our second
approach. The first approach yields that same result with an additional scaling factor of 1/k. Only
when stochastic matrices of different sizes are compared, the two approaches differ. The third approach
yields a product of Fisher metrics scaled by the marginal distribution that defines the embedding.
Which metric to use depends on the concrete problem and whether a natural marginal distribution
is defined and known. In Section 7, we do a case study using a reward function that is given as an
expectation value over a joint distribution. In this simple example, the weighted product metric gives
the best asymptotic rate of convergence, under the assumption that the weights are optimally chosen.
In Section 8, we sum up our findings.
The contents of the paper is organized as follows. Section 2 contains basic Definitions around the
Fisher metric and concepts of differential geometry. In Section 3, we discuss the Theorems of Chentsov,
Campbell and Lebanon, which characterize natural geometric structures on the probability simplex,

57


Entropy 2014, 16, 3207–3233

on the set of positive measures and on the cone of positive matrices, respectively. In Section 4, we
study metrics on polytopes of stochastic matrices, which are invariant under natural embeddings. In
Section 5, we define a Riemannian structure for polytopes, which generalizes the Fisher information
metric of probability simplices and conditional models in a natural way. In Section 6, we study a class of
weighted product metrics. In Section 7, we study the gradient flow with respect to an expectation value.
Section 8 contains concluding Remarks. In Appendix A, we investigate restrictions on the parameters
of the metrics characterized in Sections 3 and 4 that make them positive definite. Appendix B contains
the proofs of the results from Section 4.

2. Preliminaries

We will consider the simplex of probability distributions on [m] := {1, . . . , m}, m ≥ 2, which is
given by Δm−1 := {(pi)i ∈ Rm : pi ≥ 0, ∑i pi = 1}. The relative interior of Δm−1 consists of all strictly
positive probability distributions on [m], and will be denoted Δ◦
m−1. This is a subset of Rm+, the cone
of strictly positive vectors. The set of k × m row-stochastic matrices is given by Δk
m−1 := {(Kij)ij ∈
Rk×m : (Kij)j ∈ Δm−1 for all i ∈ [k]} and is equal to the Cartesian product ×i∈[k] Δm−1. The relative

interior (Δk
m−1)◦ is a subset of Rk×m
+
, the cone of strictly positive matrices.
Given two random variables X and Y taking values in the finite sets [k] and [m], respectively, the
conditional probability distribution of Y given X is the stochastic matrix K = (P(y|x))x∈[k],y∈[m] with
rows (P(y|x))y∈[m] ∈ Δm−1 for all x ∈ [k]. Therefore, the polytope of stochastic matrices Δk
m−1 is called
a conditional polytope.
The tangent space of Rn+ at a point p ∈ Rn+, denoted by TpRn+, is the real vector space spanned
by the vectors ∂1, . . . , ∂n of partial derivatives with respect to the n components. The tangent space of
Δ◦
n−1 at a point p ∈ Δ◦
n−1 ⊂ Rn+ is the subspace TpΔ◦
n−1 ⊂ TpRn+ consisting of the vectors:

u = ∑
i
ui∂i ∈ TpRn
+
with
∑
i
ui = 0.
(1)

The Fisher metric on the positive probability simplex Δ◦
n−1 is the Riemannian metric given by:

g(n)
p (u, v) =
n
∑
i=1

uivi
pi
,
for all u, v ∈ TpΔ◦
n−1.
(2)

The same formula (2) also defines a Riemannian metric on Rn+, which we will denote by the same
symbol. This, however, is not the only way in which the Fisher metric can be extended from Δ◦
n−1
to Rn+. We will discuss other extensions in the next section (see Campbell’s Theorem, Theorem 2).
Consider a smoothly parametrized family of probability distributions M = {(p(x; θ))x∈[n] : θ ∈

Ω} ⊆ Δ◦
n−1, where Ω ⊆ Rd is open. Then, g(n) induces a Riemannian metric on M. Denote by
∂θi =
∂
∂θi the tangent vector corresponding to the partial derivative with respect to θi, for all i ∈ [d].
Then, the Fisher matrix has coordinates:

gM
θ (∂θi, ∂θj) = ∑
x∈[n]
p(x; θ)∂ log p(x; θ)

∂θi

∂ log p(x; θ)

∂θj
,
for all i, j ∈ [d],
for all θ ∈ Ω.
(3)

Here, it is not necessary to assume that the parameters θi are independent. In particular, the dimension
of M may be smaller than d, in which case the matrix is not positive definite. If the map Ω → M, θ �→
p(·; θ) is an embedding (i.e., a smooth injective map that is a diffeomorphism onto its image), then gM
θ
defines a Riemannian metric on Ω, which corresponds to the pull-back of g(n).
Consider an embedding f : E → E′. The pull-back of a metric g′ on E′ through f is defined as:

( f ∗g′)p(u, v) := g′
f (p)( f∗u, f∗v),
for all u, v ∈ TpE,
(4)

58


Entropy 2014, 16, 3207–3233

where f∗ denotes the push-forward of TpE through f, which in coordinates is given by:

f∗ :
TpE → Tf (p)E′;
∑
i
ui∂θi �→ ∑
j ∑
i
ui
∂ fj(p)

∂θi
∂θ′
j,
(5)

where {∂θi}i spans TqE and {∂θ′
j}j spans Tf (p)E′.

An embedding f : E → E′ of two Riemannian manifolds (E, g) and (E′, g′) is an isometry iff:

gp(u, v) = ( f ∗g′)p(u, v),
for all p ∈ E and u, v ∈ TpE.
(6)

In this case, we say that the metric g is invariant with respect to f (and g′).

3. The Results of Campbell and Lebanon

One of the theoretical motivations for using the Fisher metric is provided by Chentsov’s
characterization [4], which states that the Fisher metric is uniquely specified, up to a multiplicative
constant, by an invariance principle under a class of stochastic maps, called Markov morphisms. Later,
Campbell [5] considered the characterization problem on the space Rn+ instead of Δ◦
n−1. This simplifies
the computations, since Rn+ has a more symmetric parametrization.

Definition 1. Let 2 ≤ m ≤ n. A (row) stochastic partition matrix (or just row-partition matrix) is a matrix
Q ∈ Rm×n of non-negative entries, which satisfies ∑j∈Ai′ Qij = δii′ for an m block partition {A1, . . . , Am} of
[n]. The linear map defined by:
Rm
+ → Rn
+;
p �→ p · Q
(7)

is called a congruent embedding by a Markov mapping of Rm+ to Rn+ or just a Markov map, for short.

An example of a 3 × 5 row-partition matrix is:

Q =

⎛

⎜
⎝
1/2
0
1/2
0
0
0
1/3
0
2/3
0
0
0
0
0
1

⎞

⎟
⎠ .
(8)

Markov maps preserve the 1-norm and restrict to embeddings Δ◦
m−1 → Δ◦
n−1.

Theorem 1 (Chentsov’s Theorem.).

• Let g(m) be a Riemannian metric on Δ◦
m−1 for m ∈ {2, 3, . . .}. Let this sequence of metrics have the
property that every congruent embedding by a Markov mapping is an isometry. Then, there is a constant
C > 0 that satisfies:

g(m)
p
(u, v) = C∑
i

uivi
pi
.
(9)

• Conversely, for any C > 0, the metrics given by Equation (9) define a sequence of Riemannian metrics
under which every congruent embedding by a Markov mapping is an isometry.

The main result in Campbell’s work [5] is the following variant of Chentsov’s Theorem.

Theorem 2 (Campbell’s Theorem.).

• Let g(m) be a Riemannian metric on Rm+ for m ∈ {2, 3, . . .}. Let this sequence of metrics have the
property that every embedding by a Markov mapping is an isometry. Then:

g(m)
p
(∂i, ∂j) = A(|p|) + δijC(|p|)|p|

pi
,
(10)

59


Entropy 2014, 16, 3207–3233

where |p| = ∑m
i=1 pi, δij is the Kronecker delta, and A and C are C∞ functions on R+ satisfying
C(α) > 0 and A(α) + C(α) > 0 for all α > 0.
• Conversely, if A and C are C∞ functions on R+ satisfying C(α) > 0, A(α) + C(α) > 0 for all α > 0,
then Equation (10) defines a sequence of Riemannian metrics under which every embedding by a Markov
mapping is an isometry.

The metrics from Campbell’s Theorem also define metrics on the probability simplices Δ◦
m−1 for
m = 2, 3, . . .. Since the tangent vectors v = ∑i vi∂i ∈ TpΔ◦
m−1 satisfy ∑i vi = 0, for any two vectors
u, v ∈ TpΔ◦
m−1, also ∑i ∑j Auivj = 0 for any A. In this case, the choice of A is immaterial, and the
metric becomes Chentsov’s metric.

Remark 1. Observe that Chentsov’s Theorem is not a direct implication of Campbell’s Theorem. However, it
can be deduced from it by the following arguments. Suppose that we have a family of Riemannian simplices
(Δ◦
m−1, g(m)) for m ∈ {2, 3, . . .}, and suppose that they are isometric with respect to Markov maps. If we can
extend every g(m) to a Riemannian metric ˜g(m) on Rm+ in such a way that the resulting spaces (Rm+, ˜g(m)) are
still isometric with respect to Markov maps, then Campbell’s Theorem implies that g(m) is a multiple of the
Fisher metric. Such metric extensions can be defined as follows. Consider the diffeomorphism:

Δ◦
m−1 × R+ ∼= Rm
+,
(p, r) �→ r · p.
(11)

Any tangent vector u ∈ T(p,r)Rm+ can be written uniquely as u = up + ur∂r, where up is tangent to rΔ◦
m−1.
Since each Markov map f preserves the one-norm | · |, its push-forward f∗ maps the tangent vector ∂r ∈ T(p,r)Rm+
to the corresponding tangent vector ∂r ∈ Tf (p,r)Rm+; that is, f∗u = f∗up + ur∂r. Therefore,

˜g(m)
(p,r)(u, v) := g(m)
p
(up, vp) + urvr
(12)

is a metric on Rm+ that is invariant under f.

In what follows, we will focus on positive matrices. In order to define a natural Riemannian
metric, we can use the identification Rk×m
+
∼= Rkm
+ and apply Campbell’s Theorem. This leads to
metrics of the form:
g(k,m)
M
(∂ij, ∂kl) = A(|M|) + δikδjlC(|M|)/Mij,
(13)

where ∂ij =
∂

∂Mij and |M| = ∑ij Mij. However, a disadvantage of this approach is that the action of

general Markov maps on Rkm
+ has no natural interpretation in terms of the matrix structure. Therefore,
Lebanon [14] considered a special class of Markov maps defined as follows.

Definition 2. Consider a k × l row-partition matrix R and a collection of m × n row-partition matrices
Q = {Q(1), . . . , Q(k)}. The map:

Rk×m
+
→ Rl×n
+ ;
M �→ R⊤(M ⊗ Q)
(14)

is called a congruent embedding by a Markov morphism of Rk×m
+
to Rl×n
+
in [15]. We will refer to such an
embedding as a Lebanon map. Here, the row product M ⊗ Q is defined by:

(M ⊗ Q)ab = (M · Q(a))ab,
for all a ∈ [k], b ∈ [n];
(15)

that is, the a-th row of M is multiplied by the matrix Q(a).

In a Lebanon map, each row of the input matrix M is mapped by an individual Markov mapping
Q(i), and each resulting row is copied and scaled by an entry of R. This kind of map preserves the
sum of all matrix entries. Therefore, with the identification Rk×m
+
∼= Rkm
+ , each Lebanon map restricts

60


Entropy 2014, 16, 3207–3233

to a map Δ◦
mk−1 → Δ◦
nl−1. The set Δ◦
mk−1 can be identified with the set of joint distributions of two
random variables. Lebanon maps can be regarded as special Markov maps that incorporate the product
structure present in the set of joint probability distributions of a pair of random variables. In Section 4,
we will give an interpretation of these maps.
Contrary to what is stated in [15], a Lebanon map does not map (Δk
m−1)◦ to (Δl
n−1)◦, unless k = l.
Therefore, later, we will provide a characterization for the metrics on (Δk
m−1)◦ in terms of invariance
under other maps (which are not Markov nor Lebanon maps).
The main result in Lebanon’s work [15, Theorems 1 and 2] is the following.

Theorem 3 (Lebanon’s Theorem.).

• For each k ≥ 1, m ≥ 2, let g(k,m) be a Riemannian metric on Rk×m
+
in such a way that every Lebanon
map is an isometry. Then:

g(k,m)
M
(∂ab, ∂cd) = A(|M|) + δac

� B(|M|)

|Ma|
+ δbd
C(|M|)

Mab

�
(16)

for some differentiable functions A, B, C ∈ C∞(R+).
• Conversely, let {(Rk×m
+
, g(k,m))} be a sequence of Riemannian manifolds, with metrics g(k,m) of the
form (16) for some A, B, C ∈ C∞(R+). Then, every Lebanon map is an isometry.

Lebanon does not study the question under which assumptions on A, B, C ∈ C∞(R+) the
formula (16) does indeed define a Riemannian metric. This question has the following simple answer,
which we will prove in Appendix A:

Proposition 1. The matrix (16) is positive definite if and only if C(|M|) > 0, B(|M|) + C(|M|) > 0 and
A(|M|) + B(|M|) + C(|M|) > 0.

The class of metrics (16) is larger than the class of metrics (13) derived in Campbell’s Theorem.
The reason is that Campbell’s metrics are invariant with respect to a larger class of embeddings.
The special case with A(|M|) = 0, B(|M|) = 0 and C(|M|) = 1 is called product Fisher metric,

g(k,m)
M
(∂ab, ∂cd) = δacδbd
1

Mab
.
(17)

Furthermore, if we restrict to (Δk
m−1)◦, the functions A and B do not play any role. In this case |M| = k,
and we obtain the scaled product Fisher metric:

g(k,m)
M
(∂ab, ∂cd) = δacδbd
C(k)
Mab
,
(18)

where C(k) : N → R+ is a positive function. As mentioned before, Lebanon’s Theorem does not give a
characterization of invariant metrics of stochastic matrices, since Lebanon maps do not preserve the
stochasticity of the matrices. However, Lebanon maps are natural maps on the set Δ◦
mk−1 of positive
joint distributions. In the same way as Chentsov’s Theorem can be derived from Campbell’s Theorem
(see Remark 1), we obtain the following Corollary:

61


Entropy 2014, 16, 3207–3233

Corollary 1.

• Let {(Δ◦
km−1, g(k,m)): k ≥ 1, m ≥ 2} be a double sequence of Riemannian manifolds with the property
that every Lebanon map is an isometry. Then:

g(k,m)
P
(u, v) = B∑
a ∑
b,c

uabuac

|Pa|
+ C∑
a ∑
b

uabvab

Pab
,
for each P ∈ Δ◦
km−1,
(19)

for some constants B, C ∈ R with C > 0 and B + C > 0, where |Pa| = ∑b Pab.
• Conversely, let {(Δ◦
km−1, g(k,m))} be a sequence of Riemannian manifolds with metrics g(k,m) of the form
of Equation (19) for some B, C ∈ R. Then, every Lebanon map is an isometry.

Observe that these metrics agree with (a multiple of) the Fisher metric only if B = 0. The case B = 0
can also be characterized; note that Lebanon maps do not treat the two random variables symmetrically.
Switching the two random variables corresponds to transposing the joint distribution matrix P. When
exchanging the role of the two random variables, the Lebanon map becomes P �→ (P⊤ ⊗ Q)⊤R. We
call such a map a dual Lebanon map. If we require invariance under both Lebanon maps and their
duals in Theorem 3 or Corollary 1, the statements remain true with the additional restriction that B = 0
(as a function or constant, respectively).

4. Invariance Metric Characterizations for Conditional Polytopes

According to Chentsov’s Theorem (Theorem 1), a natural metric on the probability simplex can
be characterized by requiring the isometry of natural embeddings. Lebanon follows this axiomatic
approach to characterize metrics on products of positive measures (Theorem 3). However, the maps
considered by Lebanon dissolve the row-normalization of conditional distributions. In general, they
do not map conditional polytopes to conditional polytopes. Therefore, we will consider a slight
modification of Lebanon maps, in order to obtain maps between conditional polytopes.

4.1. Stochastic Embeddings of Conditional Polytopes

A matrix of conditional distributions P(Y|X) in Δk
m−1 can be regarded as the equivalence class
of all joint probability distributions P(X, Y) ∈ Δkm−1 with conditional distribution P(Y|X). Which
Markov maps of probability simplices are compatible with this equivalence relation? The most obvious
examples are permutations (relabelings) of the state spaces of X and Y.
In information theory, stochastic matrices are also viewed as channels. For any distribution of X,
the stochastic matrix gives us a joint distribution of the pair (X, Y) and, hence, a marginal distribution
of Y. If we input a distribution of X into the channel, the stochastic matrix determines what the
distribution of the output Y will be.
Channels can be combined, provided the cardinalities of the state spaces fit together. If we
take the output Y of the first channel P(Y|X) and feed it into another channel P(Y′|Y) then we
obtain a combined channel P(Y′|X). The composition of channels corresponds to ordinary matrix
multiplication. If the first channel is described by the stochastic matrix K and the second channel by Q,
then the combined channel is described by K · Q. Observe that in this case, the joint distribution P
(considered as a normalized matrix P ∈ Δkm−1) is transformed similarly; that is, the joint distribution
of the pair (X, Y′) is given by P · Q.
More general maps result from compositions where the choice of the second channel depends on
the input of the first channel. In other words, we have a first channel that takes as input X and gives
as output Y, and we have another channel that takes as input (X, Y) and gives as output Y′; we are
interested in the resulting channel from X to Y′. The second channel can be described by a collection
of stochastic matrices Q = {Q(i)}i. If K describes the first channel, then the combined channel is
described by the row product K ⊗ Q (see Definition 2). Again, the joint distribution of (X, Y′) arises in
a similar way as P ⊗ Q.

62


Entropy 2014, 16, 3207–3233

We can also consider transformations of the first random variable X. Suppose that we use X as the
input to a channel described by a stochastic matrix R. In this case, the joint distribution of the output
X′ of the channel and Y is described by R⊤X. However, in general, there is not much that we can say
about the conditional distribution of Y given X′. The result depends in an essential way on the original
distribution of X. However, this is not true in the special case that the channel is “not mixing”, that is,
in the case that R is a stochastic partition matrix. In this case, the conditional distribution P(Y|X′) is

described by R⊤K, where R is the corresponding partition indicator matrix, where all non-zero entries
of R are replaced by one. In other words, each state of X corresponds to several states of X′, and the
corresponding row of K is copied a corresponding number of times.
To sum up, if we combine the transformations due to Q and R, then the joint probability

distribution transforms as P �→ R⊤(P ⊗ Q) and the conditional transforms as K �→ R⊤(K ⊗ Q).
In particular, for the joint distribution, we obtain the Definition of a Lebanon map. Figure 1 illustrates
the situation.

X
Y

X′
Y′

K

K′

R
Q

joint distributions: P′ = R⊤(P ⊗ Q)

conditional distributions: K′ = R⊤(K ⊗ Q)

Figure 1. An interpretation for Lebanon maps and conditional embeddings. The variable X′ is
computed from X by R, and Y′ is computed from X and Y by Q.

Finally, we will also consider the special case where the partition of R (and R) is homogeneous,
i.e., such that all blocks have the same size. For example, this describes the case where there is a third
random variable Z that is independent of Y given X. In this case, the conditional distribution satisfies
P(Y|X) = P(Y|X, Z), and R describes the conditional distribution of (X, Z) given X.

Definition 3. A (row) partition indicator matrix is a matrix R ∈ {0, 1}k×l that satisfies:

Rij =

�
1,
if j ∈ Ai,

0,
else,
(20)

for a k block partition {A1, . . . , Ak} of [l].

For example, the 3 × 5 partition indicator matrix corresponding to Equation (8) is:

R =

⎛

⎜
⎝
1
0
1
0
0
0
1
0
1
0
0
0
0
0
1

⎞

⎟
⎠ .
(21)

Definition 4. Consider a k × l partition indicator matrix R and a collection of m × n stochastic partition
matrices Q = {Q(i)}k
i=1. We call the map:

f :
Rk×m
+
→ Rl×n
+ ;
M �→ R⊤(M ⊗ Q)
(22)

a conditional embedding of Rk×m
+
in Rl×n
+ . We denote the set of all such maps by ˆF l,n
k,m. If R is the partition
indicator matrix of a homogeneous partition (with partition blocks of equal cardinality), then we call f a
homogeneous conditional embedding. We denote the set of all such homogeneous conditional embeddings by
F l,n
k,m and assume that l is a multiple of k.

63


Entropy 2014, 16, 3207–3233

Conditional embeddings preserve the 1-norm of the matrix rows; that is, the elements of ˆF l,n
k,m
map (Δk
m−1)◦ to (Δl
n−1)◦. On the other hand, they do not preserve the 1-norm of the entire matrix.
Conditional embeddings are Markov maps only when k = l, in which case they are also Lebanon
maps.

4.2. Invariance Characterization

Considering the conditional embeddings discussed in the previous section, we obtain the
following metric characterization.

Theorem 4.

• Let g(k,m) denote a metric on Rk×m
+
for each k ≥ 1 and m ≥ 2. If every homogeneous conditional
embedding f ∈ F l,n
k,m is an isometry with respect to these metrics, then:

g(k,m)
M
(∂ab, ∂cd) = A

k2 + δac

�
k B

k2 + δbd
|M|
Mab

C
k2

�
,
for all M ∈ Rk×m
+
,
(23)

for some constants A, B, C ∈ R, where ∂ab =
∂

∂Mab and |M| = ∑ab Mab.
• Conversely, given the metrics defined by Equation (23) for any non-degenerate choice of constants
A, B, C ∈ R, each homogeneous conditional embedding f ∈ F l,n
k,m, k ≤ l, m ≤ n is an isometry.
• Moreover, the tensors g(k,m) from Equation (23) are positive-definite for all k ≥ 1 and m ≥ 2 if and only
if C > 0, B + C > 0 and A + B + C > 0.

The proof of Theorem 4 is similar to the proof of the Theorems of Chentsov, Campbell and
Lebanon. Due to its technical nature, we defer it to Appendix B.
Now, for the restriction of the metric g(k,m) to (Δk
m−1)◦, we have the following. In this case,
|M| = k. Since tangent vectors v = ∑ab vab∂ab ∈ TM(Δk
m−1)◦ satisfy ∑b vab = 0 for all a, the constants
A and B become immaterial, and the metric can be written as:

g(k,m)
M
(u, v) = ∑
ab

|M|uabvab

Mab

C
k2 = ∑
ab

uabvab
Mab

C
k ,
for all u, v ∈ TM(Δk
m−1)◦.
(24)

This metric is a specialization of the metric (18) derived by Lebanon (Theorem 3).
The statement of Theorem 4 becomes false if we consider general conditional embeddings instead
of homogeneous ones:

Theorem 5. There is no family of metrics g(k,m) on Rk×m
+
(or on (Δk
m−1)◦) for each k ≥ 1 and m ≥ 2, for

which every conditional embedding f ∈ ˆF l,n
k,m is an isometry.

This negative result will become clearer from the perspective of Section 6: as we will show in
Theorem 7, although there are no metrics that are invariant under all conditional embeddings, there are
families of metrics (depending on a parameter, ρ) that transform covariantly (that is, in a well-defined
manner) with respect to the conditional embeddings. We defer the proof of Theorem 5 to Appendix B.

5. The Fisher Metric on Polytopes and Point Configurations

In the previous section, we obtained distinguished Riemannian metrics on Rk×m
+
and (Δk
m−1)◦ by
postulating invariance under natural maps. In this section, we take another viewpoint based on general
considerations about Riemannian metrics on arbitrary polytopes. This is achieved by embedding each
polytope in a probability simplex as an exponential family. We first recall the necessary background.
In Section 5.2, we then present our general results, and in Section 5.3, we discuss the special case of
conditional polytopes.

64


Entropy 2014, 16, 3207–3233

5.1. Exponential Families and Polytopes

Let X be a finite set and A ∈ Rd×X a matrix with columns ax indexed by x ∈ X . It will be
convenient to consider the rows Ai, i ∈ [d] of A as functions Ai : X → R. Finally, let ν: X → R+. The
exponential family EA,ν is the set of probability distributions on X given by:

p(x; θ) = exp(θ⊤ax + log(ν(x)) − log(Z(θ))),
for all x ∈ X ,
for all θ ∈ Rd,
(25)

with the normalization function Z(θ) = ∑x′∈X exp(θ⊤ax′ + log(ν(x′))). The functions Ai are called
the observables and ν the reference measure of the exponential family. When the reference measure ν
is constant, ν(x) = 1 for all x ∈ X , we omit the subscript and write EA.
A direct calculation shows that the Fisher information matrix of EA,ν at a point θ ∈ Rd has
coordinates:

gEA,ν
θ
(∂θi, ∂θj) = covθ(Ai, Aj),
for all i, j ∈ [d].
(26)

Here, covθ denotes the covariance computed with respect to the probability distribution p(·; θ).
The convex support of EA,ν is defined as:

conv A := conv{ax : x ∈ X } =
�
Ep[A]: p ∈ Δ|X |−1
�
=
�
Ep[A]: p ∈ EA,ν
�
,
(27)

where conv S is the set of all convex combinations of points in S. The moment map μ : p ∈ Δn−1 �→
A · p ∈ Rd restricts to a homeomorphism EA,ν → conv A; see [16]. Here, EA,ν denotes the Euclidean
closure of EA,ν. The inverse of μ will be denoted by μ−1 : conv A → EA,ν ⊆ Δn−1. This gives a natural
embedding of the polytope conv A in the probability simplex Δ|X |−1. Note that the convex support is
independent of the reference measure ν. See [17] for more details.

5.2. Invariance Fisher Metric Characterizations for Polytopes

Let P ∈ Rd be a polytope with n vertices a1, . . . , an. Let A = (a1, . . . , an) be the matrix with
columns ai ∈ Rd for all i ∈ [n]. Then, EA ⊆ Δ◦
n−1 is an exponential family with convex support P. We
will also denote this exponential family by EP. We can use the inverse of the moment map, μ−1, to pull
back geometric structures on Δ◦
n−1 to the relative interior P◦ of P.

Definition 5. The Fisher metric on P◦ is the pull-back of the Fisher metric on EA ⊆ Δ◦
n−1 by μ−1.

Some obvious questions are: Why is this a natural construction? Which maps between polytopes
are isometries between their Fisher metrics? Can we find a characterization of Chentsov type for
this metric?
Affine maps are natural maps between polytopes. However, in order to obtain isometries, we
need to put some additional constraints. Consider two polytopes P ∈ Rd, P′ ∈ Rd′ and an affine
map φ : Rd → Rd′ that satisfies φ(P) ⊆ P′. A natural condition in the context of exponential families is
that φ restricts to a bijection between the set vert(P) of vertices of P and the set vert(P′) of vertices of P′.
In this case, EP′ ⊆ EP ⊆ Δ◦
n−1. Moreover, the moment map μ′ of P′ factorizes through the moment
map μ of P: μ′ = φ ◦ μ. Let φ−1 = μ ◦ μ′−1. Then, the following diagram commutes:

P◦
EP

Δ◦
n−1

P′◦
EP′

μ−1

μ′−1

φ−1
(28)

65


Entropy 2014, 16, 3207–3233

It follows that φ−1 is an isometry from P′◦ to its image in P◦. Observe that the inverse moment map
itself arises in this way: In the diagram (28), if P is equal to Δn−1, then the upper moment map μ−1 is
the identity map, and φ−1 equals the inverse moment map μ′−1 of P′.
The constraint of mapping vertices to vertices bijectively is very restrictive. In order to consider
a larger class of affine maps, we need to generalize our construction from polytopes to weighted
point configurations.

Definition 6. A weighted point configuration is a pair (A, ν) consisting of a matrix A ∈ Rd×n with columns
a1, . . . , an and a positive weight function ν : {1, . . . , n} → R+ assigning a weight to each column ai. The pair
(A, ν) defines the exponential family EA,ν.
The (A, ν)-Fisher metric on (conv A)◦ is the pull-back of the Fisher metric on Δ◦
n−1 through the inverse
of the moment map.

We recover Definition 5 as follows. For a polytope P, let A be the point configuration consisting
of the vertices of P. Moreover, let ν be a constant function. Then, EP = EA,ν, and the two Definitions of
the Fisher metric on P◦ coincide.
The following are natural maps between weighted point configurations:

Definition 7. Let (A, ν), (A′, ν′) be two weighted point configurations with A = (ai)i ∈ Rd×n and A′ =
(a′
j)j ∈ Rd′×n′. A morphism (A, ν) → (A′, ν′) is a pair (φ, σ) consisting of an affine mapφ : Rd → Rd′ and a
surjective map σ : {1, . . . , n} → {1, . . . , n′} with φ(ai) = a′
σ(i) andν′(a′
j) = α ∑i:σ(i)=j ν(ai), where α > 0 is
a constant that does not depend on j.

Consider a morphism (φ, σ) : (A, ν) → (A′, ν′). For each j ∈ [n′], let Aj = {i : φ(ai) = a′
j}. Then,

(A1, . . . , An′) is a partition of [n]. Define a matrix Q ∈ Rn′×n by:

Qji =

⎧
⎨

⎩

ν(i)

∑i′∈Aj ν(i′),
if i ∈ Aj,

0,
else.
(29)

Then, Q is a Markov mapping, and the following diagram commutes:

(conv A)◦
EA,ν
Δ◦
n−1

(conv A′)◦
EA′,ν′
Δ◦
n′−1

μ−1

μ′−1

φ−1
Q
(30)

By Chentsov’s Theorem (Theorem 1), Q is an isometric embedding. It follows that φ−1 also induces an
isometric embedding. This shows the first part of the following Theorem:

Theorem 6.

• Let (φ, σ) : (A, ν) → (A′, ν′) be a morphism of weighted point configurations.
Then, φ−1 :
(conv A′)◦ → (conv A)◦ is an isometric embedding with respect to the Fisher metrics on (conv A)◦

and (conv A′)◦.
• Let gA,ν be a Riemannian metric on (conv A)◦ for each weighted point configuration (A, ν). If every
morphism (φ, σ) : (A, ν) → (A′, ν′) of weighted point configurations induces an isometric embedding
φ−1 : (conv A′)◦ → (conv A)◦, then there exists a constant α ∈ R+ such that gA,ν is equal to α times
the (A, ν)-Fisher metric.

66


Entropy 2014, 16, 3207–3233

Proof. The first statement follows from the discussion before the Theorem. For the second statement,
we show that under the given assumptions, all Markov maps are isometric embeddings. By Chentsov’s
Theorem (Theorem 1), this implies that the metrics gP agree with the Fisher metric whenever P is
a simplex. The statement then follows from the two facts that the metric on P◦ or (conv A)◦ is the
pull-back of the Fisher metric through the inverse of the moment map and that μ−1 is itself a morphism.
Observe that Δn−1 = conv In = conv{e1, . . . , en} is a polytope, and Δ◦
n−1 is the corresponding
exponential family. Consider a Markov embedding Q : Δ◦
n′−1 → Δ◦
n−1, p �→ p · Q. Let ν(i) = ∑j Qji
be the value of the unique non-zero entry of Q in the i-th column. This defines a morphism and an
embedding as follows:
Let A be the matrix that arises from Q by replacing each non-zero entry by one. We define φ
as the linear map represented by the matrix A, and define σ : [n] → [n′] by σ(j) = i if and only
if aj = ei, that is, σ(j) indicates the row i in which the j-th column of A is non-zero. Then, (φ, σ)
is a morphism (In, ν) → (In′, 1), and by assumption, the inverse φ−1 is an isometric embedding
Δ◦
n′−1 → Δ◦
n−1. However, φ−1 is equal to the Markov map Q. This shows that all Markov maps are
isometric embeddings, and so, by Chentsov’s Theorem, the statement holds true on the simplices.

Theorem 6 defines a natural metric on (Δk
m−1)◦ that we want to discuss in more detail next.

5.3. Independence Models and Conditional Polytopes

Consider k random variables with finite state spaces [n1], . . . , [nk]. The independence model
consists of all joint distributions p ∈ Δ∏i∈[k] ni−1 of these variables that factorize as:

p(x1, . . . , xk) = ∏
i∈[k]
pi(xi),
for all x1 ∈ [n1], . . . , xk ∈ [nk],
(31)

where pi ∈ Δni−1 for all i ∈ [k]. Assuming fixed n1, . . . , nk, we denote the independence model by Ek.
It is the Euclidean closure of an exponential family (with observables of the form δiyi). The convex
support of Ek is equal to the product of simplices Pk := Δn1−1 × · · · × Δnk−1. The parametrization (31)
corresponds to the inverse of the moment map.
We can write any tangent vector u ∈ T(p1,...,pk)P◦
k of this open product of simplices as a linear
combination u = ∑i∈[k] ∑xi∈[ni] uixi∂i,xi, where ∑xi∈[ni] vixi = 0 for all i ∈ [k]. Given two such tangent
vectors, the Fisher metric is given by:

gPk
(p1,...,pk)(u, v) = ∑
i∈[k] ∑
xi∈[ni]

uixivixi
pi(xi) .
(32)

Just as the convex support of the independence model is the Cartesian product of probability
simplices, the Fisher metric on the independence model is the product metric of the Fisher metrics
on the probability simplices of the individual variables. If n1 = · · · = nk =: n, then Pk = Δk
n−1 can be
identified with the set of k × n stochastic matrices.
The Fisher metric on the product of simplices is equal to the product of the Fisher metrics on the
factors. More generally, if P = Q1 × Q2 is a Cartesian product, then the Fisher metric on P◦ is equal
to the product of the Fisher metrics on Q◦
1 and Q◦
2. In fact, in this case, the inverse of the moment
map of P can be expressed in terms of the two moment map inverses μ1 : Q1 → EQ1 ⊆ Δm1−1 and
μ2 : Q2 → EQ2 ⊆ Δm2−1 and the moment map ˜μ of the independence model Δm1−1 × Δm2−1, by:

μ−1(q1, q2) = ˜μ−1(μ−1
1 (q1), μ−1
2 (q2)).
(33)

Therefore, the pull-back by μ−1 factorizes through the pull-back by ˜μ−1, and since the independence
model carries a product metric, the product of polytopes also carries a product metric.

67


Entropy 2014, 16, 3207–3233

Let us compare the metric g(k,m)
K
from Equation (24), with the Fisher metric gPk
(K1,...,Kk) from

Equation (32) on the product of simplices P◦ = (Δk
m−1)◦. In both cases, the metric is a product
metric; that is, it has the form:
g = g1 + · · · + gk,
(34)

where gi is a metric on the i-th factor Δ◦
m−1. For g
Δk
Km−1
, gi is equal to the Fisher metric on Δ◦
m−1. However,

for g(k,m)
K
, gi is equal to 1/k times the Fisher metric on Δ◦
m−1. Since this factor only depends on k, it
only plays a role if stochastic matrices of different sizes are compared. The additional factor of 1/k can
be interpreted as the uniform distribution on k elements. This is related to another more general class
of Riemannian metrics that are used in applications; namely, given a function K ∈ Δk
m−1 → ρK ∈ Rk+,
it is common to use product metrics with gi equal to ρK(i) times the Fisher metric on Δ◦
m−1. When K
has the interpretation of a channel or when K describes the policy by which a system reacts to some
sensor values, a natural possibility is to let ρK be the stationary distribution of the channel input or of
the sensor values, respectively. We will discuss this approach in Section 6.

6. Weighted Product Metrics for Conditional Models

In this section, we consider metrics on spaces of stochastic matrices defined as weighted sums
of the Fisher metrics on the spaces of the matrix rows, similar to Equation (34). This kind of metric
was used initially by Amari [1] in order to define a natural gradient in the supervised learning context.
Later, in the context of reinforcement learning, Kakade [2] defined a natural policy gradient based on
this kind of metric, which has been further developed by Peters et al. [10]. Related applications within
unsupervised learning have been pursued by Zahedi et al. [18].
Consider the following weighted product Fisher metric:

gρ,m
K
= ∑
a
ρK(a)g(m),a
Ka
,
for all K ∈ (Δk
m−1)◦,
(35)

where g(m),a
Ka
denotes the Fisher metric of Δ◦
m−1 at the a-th row of K and ρK ∈ Δ◦
k−1 is a probability
distribution over a associated with each K ∈ (Δk
m−1)◦. For example, the distribution ρK could be the
stationary distribution of sensor values observed by an agent when operating under a policy described
by K.
In the following, we will try to illuminate the properties of polytope embeddings that yield the
metric (35) as the pull-back of the Fisher information metric on a probability simplex. We will focus on
the case that ρK = ρ is independent of K.
There are two direct ways of embedding Δk
n−1 in a probability simplex. In Section 5, we used
the inverse of the moment map of an exponential family, possibly with some reference measure. This
embedding is illustrated in the left panel of Figure 2. If we have given a fixed probability distribution
ρ ∈ Δ◦
k−1, there is a second natural embedding ψρ : Δk
m−1 → Δk·m−1 defined as follows:

ψρ(K)(x, y) = ρ(x)Kx,y
for all x ∈ [k], y ∈ [m].
(36)

If ρ is the distribution of a random variable X and K ∈ Δk
m−1 is the stochastic matrix describing the
conditional distribution of another variable Y given X, then ψρ(K) is the joint distribution of X and Y.
Note that ψρ is an affine embedding. See the right panel of Figure 2 for an illustration.
The pull-back of the Fisher metric on Δ◦
km−1 through ψρ is given by:

g(km)
ψρ(K)(ψρ∗u, ψρ∗v) =∑
a,b ∑
c,d ∑
i,j
ρ(i)Kijuab
∂ log ρ(i)Kij

∂Kab
vcd
∂ log ρ(i)Kij

∂Kcd

=∑
i
ρ(i)∑
j

uijvij
Kij
= ∑
i
ρ(i)gi
Ki(ui, vi) = gρ,m
K (u, v).
(37)

68


Entropy 2014, 16, 3207–3233

This recovers the weighted sum of Fisher metrics from Equation (35).

Δk
m−1

Δmk−1

Ek

Δk·m−1

ψ( 1

4, 3

4)Δk
m−1

ψ( 1

2, 1

2)Δk
m−1

Figure 2. An illustration of different embeddings of the conditional polytope Δk
m−1 in a probability
simplex. The left panel shows an embedding in Δmk−1 by the inverse of the moment map μ of the
independence model. The right panel shows an affine embedding in Δk·m−1 as a set of joint probability
distributions for two different specifications of marginals.

Are there natural maps that leave the metrics gρ,m invariant? Let us reconsider the stochastic
embeddings from Definition 4. Let R be a k × l indicator partition matrix and R a stochastic partition
matrix with the same block structure as R. Observe that to each indicator partition matrix R there are
many compatible stochastic partition matrices R, but the indicator partition matrix R for any stochastic
partition matrix R is unique. Furthermore, let Q = {Q(a)}a∈[k] be a collection of stochastic partition

matrices. The corresponding conditional embedding f maps K ∈ Δk
m−1 to f (K) := R⊤(K ⊗ Q) ∈ Δl
n−1.
Let ρ ∈ Δ◦
k−1. Suppose that K describes the conditional distribution of Y given X and that ψρ(K)
describes the joint distribution of Y and X. As explained in Section 4.1, the matrix f (P) := R⊤(P ⊗ Q)
describes the joint distribution of a pair of random variables (X′, Y′), and the conditional distribution
of Y′ given X′ is given by f (K). In this situation, the marginal distribution of X′ is given by ρ′ = ρR.
Therefore, the following diagram commutes:

(Δk
m−1)◦
Δ◦
mk−1

(Δl
n−1)◦
Δ◦
nl−1

ψρ

ψρ′

f
f
(38)

The preceding discussion implies the first statement of the following result:

69


Entropy 2014, 16, 3207–3233

Theorem 7.

• For any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦
k−1, the Riemannian metric gρ,m on (Δk
m−1)◦ satisfies:

gρ,m = f
∗(gρ′,n),
for ρ′ = ρR,
(39)

for any conditional embedding f : K �→ R(K ⊗ Q).
• Conversely, suppose that for any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦
k−1, there is a Riemannian metric
g(ρ,m) on (Δk
m−1)◦, such that Equation (39) holds for all conditional embeddings, and suppose that g(ρ,m)

depends continuously on ρ. Then, there is a constant A > 0 that satisfies g(ρ,m) = Agρ,m.

Proof. The first statement follows from the commutative diagram (38). For the second statement,
denote by ρk the uniform distribution on a set of k elements. If f : K �→ R(K ⊗ Q) is a homogeneous
conditional embedding of Δk
m−1 in Δl
n−1, then R = k

l R is a stochastic partition matrix corresponding
to the partition indicator matrix R. Observe that ρl = ρkR. Therefore, the family of Riemannian
metrics gρk,m on Δk
m−1 satisfies the assumptions of Theorem 4. Therefore, there is a constant A > 0

for which gρk,m equals A/k times the product Fisher metric. This proves the statement for uniform
distributions ρ.
A general distribution ρ ∈ Δ◦
k−1 can be approximated by a distribution with rational probabilities.
Since g(ρ,m) is assumed to be continuous, it suffices to prove the statement for rational ρ. In this case,
there exists a stochastic partition matrix R for which ρ′ := ρR is a uniform distribution, and so, g(ρ′,n)

is of the desired form. Equation (39) shows that g(ρ,m) is also of the desired form.

7. Gradient Fields and Replicator Equations

In this section, we use gradient fields in order to compare Riemannian metrics on the space
(Δk
n−1)◦.

7.1. Replicator Equations

We start with gradient fields on the simplex Δ◦
n−1. A Riemannian metric g on Δ◦
n−1 allows us to
consider gradient fields of differentiable functions F: Δ◦
n−1 → R. To be more precise, consider the
differential dpF : TpΔ◦
n−1 → R of F in p. It is a linear form on TpΔ◦
n−1, which maps each tangent vector
u to dpF(u) = ∂F

∂u(p) ∈ R. Using the map u �→ gp(u, ·), this linear form can be identified with a tangent
vector in TpΔ◦
n−1, which we denote by gradpF. If we choose the Fisher metric g(n) as the Riemannian
metric, we obtain the gradient in the following way. First consider a differentiable extension of F to the
positive cone Rn+, which we will denote by the same symbol F. With the partial derivatives ∂iF of F,
the Fisher gradient of F on the simplex Δ◦
n−1 is given as:

(gradpF)i = pi

�

∂iF(p) −
n
∑
j=1
pj ∂jF(p)

�

,
i ∈ [n].
(40)

Note that the expression on the right-hand side of Equation (40) does not depend on the particular
differentiable extension of F to Rn+. The corresponding differential equation is well known in theoretical
biology as the replicator equation; see [19,20].

˙pi = pi

�

∂iF(p) −
n
∑
j=1
pj ∂jF(p)

�

,
i ∈ [n].
(41)

70


Entropy 2014, 16, 3207–3233

We now apply this gradient formula to functions that have the structure of an expectation value.
Given real numbers Fi, i ∈ [n], referred to as fitness values, we consider the mean fitness:

¯F(p) :=
n
∑
i=1
pi Fi.
(42)

Replacing the pi by any positive real numbers leads to a differentiable extension of F, also denoted
by F. Obviously, we have ∂iF = Fi, which leads to the following replicator equation:

˙pi = pi (Fi − ¯F(p)) ,
i ∈ [n].
(43)

This equation has the solution:

pi(t) =
pi(0)etFi

∑n
j=1 pj(0)etFi ,
i ∈ [n].
(44)

Clearly, the mean fitness will increase along this solution of the gradient field. The rate of increase can
be easily calculated:

d
dt
¯F
�
p(t)
� =
n
∑
i=1
˙pi(t) Fi =
n
∑
i=1
pi (Fi − ¯F(p)) Fi =
n
∑
i=1
pi (Fi − ¯F(p))2 = varp(F) > 0.
(45)

As limit points of this solution, we obtain:

lim
t→−∞ pi(t) =

�
pi(0)

∑j∈argmin F pj(0),
if i ∈ argmin F

0
,
otherwise
,
i ∈ [n],
(46)

and:

lim
t→+∞ pi(t) =

�
pi(0)

∑j∈argmax F pj(0),
if i ∈ argmax F

0
,
otherwise
,
i ∈ [n].
(47)

7.2. Extension of the Replicator Equations to Stochastic Matrices

Now, we come to the corresponding considerations of gradient fields in the context of stochastic
matrices K ∈ (Δk
n−1)◦. We consider a function:

K �→ F(K) = F(K11, . . . , K1n; K21, . . . , K2n; . . . ; Kk1, . . . , Kkn).
(48)

One way to deal with this is to consider for each i ∈ [k] the corresponding replicator equation:

˙Kij = Kij

⎛

⎝∂ijF(K) −
n
∑
j′=1
Kij′ ∂ij′F(K)

⎞

⎠ ,
j ∈ [n].
(49)

Obviously, this is the gradient field that one obtains by using the product Fisher metric on (Δk
n−1)◦

(Equation (17)):

g(k,m)
K
(u, v) = ∑
ij

1
Kij
uijvij.
(50)

If we replace the metric by the weighted product Fisher metric considered by Kakade (Equation (35)),

gρ,m
K (u, v) = ∑
ij

ρi
Kij
uijvij,
(51)

71


Entropy 2014, 16, 3207–3233

then we obtain

˙Kij = Kij

ρi

⎛

⎝∂ijF(K) −
n
∑
j′=1
Kij′ ∂ij′F(K)

⎞

⎠ ,
j ∈ [n].
(52)

7.3. The Example of Mean Fitness

Next, we want to study how the gradient flows with respect to different metrics compare. We
restrict to the class of metrics gρ,m (Equation (35)), where ρ ∈ Δ◦
k is a probability distribution. In
principle, one could drop the normalization condition ∑i ρi = 1 and allow arbitrary coefficients ρi.
However, it is clear that the rate of convergence can always be increased by scaling all values ρi with a
common positive factor. Therefore, some normalization condition is needed for ρ.
With a probability distribution p ∈ Δ◦
k−1 and fitness values Fij, let us consider again the example
of an expectation value function:

¯F(K) =
k
∑
i=1
pi

n
∑
j=1
Kij Fij.
(53)

With ∂ij ¯F(π) = pi Fij, this leads to:

˙Kij = pi

ρi
Kij

⎛

⎝Fij −
n
∑
j′=1
Kij′ Fij′

⎞

⎠ ,
j ∈ [n].
(54)

The corresponding solutions are given by:

Kij(t) =
Kij(0) et pi

ρi Fij

∑n
j′=1 Kij′(0) et pi

ρi Fij′ ,
i ∈ [n].
(55)

Since argmax( pi

ρi Fi·) and argmin( pi

ρi Fi·) are independent of ρi > 0, the limit points are given
independently of the chosen ρ as:

lim
t→−∞ Kij(t) =

⎧
⎨

⎩

Kij(0)

∑j′∈argmin Fi· Kij′(0),
if j ∈ argmin Fi·

0
,
otherwise
,
i ∈ [n],
(56)

and:

lim
t→+∞ Kij(t) =

⎧
⎨

⎩

Kij(0)

∑j′∈argmax Fi· Kij′(0),
if j ∈ argmax Fi·

0
,
otherwise
,
i ∈ [n].
(57)

This is consistent with the fact that the critical points of gradient fields are independent of the chosen
Riemannian metric. However, the speed of convergence does depend on the metric:
For each i, let Gi = maxj Fij and gi = maxj/∈argmax(Fij) Fij be the largest and second-largest values
in the i-th row of Fij, respectively. Then, as: t → ∞,

Kij(t) =

�
1 − O(exp(− pi

ρi (Gi − gi)t),
if i ∈ argmax Fi·
O(exp(− pi

ρi (Gi − Fij)t) ,
otherwise
(58)

72


Entropy 2014, 16, 3207–3233

Therefore,

¯F(K(t)) = ∑
i
pi
∑
j∈argmax Fi·
Fij + O
�
exp(− pi

ρi
(Gi − gi)t)
�

= ∑
i
pi
∑
j∈argmax Fi·
Fij + O
�
exp(− inf
i

� pi

ρi
(Gi − gi)
�
t)
�
.
(59)

Thus, in the long run, the rate of convergence is given by infi{ pi

ρi (Gi − gi)}, which depends on the
parameter ρ of the metric. As a result, in this case study, the optimal choice of ρi, i.e., with the largest
convergence rate, can be computed if the numbers Gi and gi are known.
Consider, for example, the case that the differences Gi − gi are of comparable sizes for all i. Then,
we need to find the choice of ρ that maximizes infi{ pi

ρi }. Clearly, infi{ pi

ρi } ≤ 1 (since there is always an
index i with pi ≤ ρi). Equality is attained for the choice ρi = pi. Thus, we recover the choice of Kakade.

8. Conclusions

So, which Riemannian metric should one use in practice on the set of stochastic matrices, (Δk
m−1)◦?
The results provided in this manuscript give different answers, depending on the approach. In all
cases, the characterized Riemannian metrics are products of Fisher metrics with suitable factor weights.
Theorem 4 suggests to use a factor weight proportional to 1/k, and Theorem 6 suggests to use a
constant weight independent of k. In many cases, it is possible to work within a single conditional
polytope (Δk
m−1)◦ and a fixed k, and then, these two results are basically equivalent. On the other
hand, Theorem 7 gives an answer that allows arbitrary factor weights ρ.
Which metric performs best obviously depends on the concrete application. The first observation
is that in order to use the metric gρ,m of Theorem 7, it is necessary to know ρ. If the problem at
hand suggests a natural marginal distribution ρ, then it is natural to make use of it and choose the
metric gρ,m. Even if ρ is not known at the beginning, a learning system might try to learn it to improve
its performance.
On the other hand, there may be situations where there is no natural choice of the weights ρ.
Observe that ρ breaks the symmetry of permuting the rows of a stochastic matrix. This is also expressed
by the structural difference between Theorems 4 and 6 on the one side and Theorem 7 on the other.
While the first two Theorems provide an invariance metric characterization, Theorem 7 provides a
“covariance” classification; that is, the metrics gρ,m are not invariant under conditional embeddings,
but they transform in a controlled manner. This again illustrates that the choice of a metric should
depend on which mappings are natural to consider, e.g., which mappings describe the symmetries of a
given problem.
For example, consider a utility function of the form F = ∑i ρi ∑j KijFij. Row permutations do not
leave gρ,m invariant (for a general ρ), but they are not symmetries of the utility function F, either, and
hence, they are not very natural mappings to consider. However, row permutations transform the
metric gρ,m and the utility function in a controlled manner; in such a way that the two transformations
match. Therefore, in this case, it is natural to use gρ,m. On the other hand, when studying problems
that are symmetric under all row permutations, it is more natural to use the invariant metric g(k,m).

Appendix A

Appendix A Conditions for Positive Definiteness

Equation (16) in Lebanon’s Theorem 3 defines a Riemannian metrics whenever it defines a
positive-definite quadratic form. The next Proposition gives sufficient and necessary conditions for
which this is the case.

73


Entropy 2014, 16, 3207–3233

Proposition A1. For each pair k ≥ 1 and m ≥ 2, consider the tensor on Rk×m
+
defined by:

g(k,m)
M
(∂ab, ∂cd) = A(|M|) + δac

� B(|M|)

|Ma|
+ δbd
C(|M|)

Mab

�
(A1)

for some differentiable functions A, B, C ∈ C∞(R+). The tensor g(k,m) defines a Riemannian metric for all k
and m if and only if C(α) > 0, B(α) + C(α) > 0 and A(α) + B(α) + C(α) > 0 for all α ∈ R+.

Proof. The tensors are Riemannian metrics when:

g(k,m)
M
(V) = A(|M|)(∑
ab
Vab)2 + B(|M|)∑
a

|M|
|Ma|(∑
b
Vab)2 + C(|M|)∑
ab

|M|
Mab
V2
ab
(A2)

is strictly positive for all non-zero V ∈ Rk×m, for all M ∈ Rk×m
+
.
We can derive necessary conditions on the functions A, B, C from some basic observations.
Choosing V = ∂ab in Equation (A2) shows that A(|M|) + |M|

|Ma| B(|M|) + |M|

Mab C(|M|) has to be positive

for all a ∈ [k], b ∈ [m], for all M ∈ Rk×m
+
. Since Mab can be arbitrarily small for fixed |M| and |Ma|, we
see that C has to be non-negative. Since we can choose |Ma| ≈ Mab ≪ |M| for a fixed |M|, we find
that B + C has to be non-negative. Further, since we can choose Mab ≈ |Ma| ≈ |M| for a given |M|,
we find that A + B + C has to be non-negative. This shows that the quadratic form is positive definite
only if C ≥ 0, B + C ≥ 0, A + B + C ≥ 0. Since the cone of positive definite matrices is open, these
inequalities have to be strictly satisfied. In the following, we study sufficient conditions.
For any given M ∈ Rk×m
+
, we can write Equation (A2) as a product V⊤GV, for all V ∈ Rkm, where
G = GA + GB + GC ∈ Rkm×km is the sum of a matrix GA with all entries equal to A(|M|), a block
diagonal matrix GB whose a-th block has all entries equal to |M|

|Ma| B(|M|), and a diagonal matrix GC with

diagonal entries equal to |M|

Mab C(|M|). The matrix G is obviously symmetric, and by Sylvester’s criterion,
it is positive definite iff all its leading principal minors are positive. We can evaluate the minors using
Sylvester’s determinant Theorem. That Theorem states that for any invertible m × m matrix X, an
m × n matrix Y and an n × m matrix Z, one has the equality det(X + YZ) = det(X) det(In + ZX−1Y).
Let us consider a leading square block G′, consisting of all entries Gab,cd of G with row-index pairs
(a, b) satisfying b ∈ [m] for all a < a′ and b ≤ b′ for a = a′ for some a′ ≤ k and b′ ≤ m; and the same
restriction for the column index pairs. The corresponding block G′
A + G′
B can be written as the rank-a′

matrix YZ, with Y consisting of columns 1a for all a ≤ a′ and Z consisting of rows A + 1a
|M|
|Ma| B for all

a ≤ a′. Hence, the determinant of G′ is equal to:

det(G′) = det(G′
C) · det(Ia′ + ZG′−1
C Y).
(A3)

Since G′C is diagonal, the first term is just:

det(G′
C) =

�
∏
a<a′ ∏
b

|M|
Mab C

� �
∏
b≤b′

|M|
Ma′b C

�

.
(A4)

The matrix in the second term of Equation (A3) is given by:

Ia′ + ZG′−1
C Y =

1
C

⎛

⎜
⎜
⎜
⎜
⎝

C + B
...
C + B

C + ∑b≤b′ Ma′b

|Ma′|
B

⎞

⎟
⎟
⎟
⎟
⎠
+ 1

C

⎛

⎜
⎜
⎜
⎝

|M1|
|M| A
· · ·
|Ma′−1|

|M|
A
∑b≤b′ Ma′b

|M|
A
...
...
...
|M1|
|M| A
· · ·
|Ma′−1|

|M|
A
∑b≤b′ Ma′b

|M|
A

⎞

⎟
⎟
⎟
⎠ .
(A5)

74


Entropy 2014, 16, 3207–3233

By Sylvester’s determinant Theorem, we have:

det(Ia′ + ZG′−1
C Y) = C−a′(C + B)a′−1(C + ∑b≤b′ Ma′b

|Ma′|
B)(1 + ∑
a<a′

|Ma|
|M| A

C + B +

∑b≤b′ Ma′b

|M|
A

C + ∑b≤b′ Ma′b

|Ma′|
B
)

=

�
∏
a

C + Ba

C

� �

1 + ∑
a

Aa

C + Ba

�

,
(A6)

where Aa = |Ma|

|M| A for a < a′ and Aa′ = ∑b≤b′ Ma′b

|M|
A, and Ba = B for a < a′ and Ba′ = ∑b≤b′ Ma′b

|Ma′|
B.
This shows that the matrix G is positive definite for all M if and only if C > 0, C + B > 0 and
�
1 + ∑a≤a′
Aa

C+Ba

�
> 0 for all a′ and b′. The latter inequality is satisfied whenever A + B + C > 0. This
completes the proof.

Appendix B Proofs of the Invariance Characterization

The following Lemma follows directly from the Definition and contains all the technical details
we need for the proofs.

Lemma A1. The push-forward f∗ : TMRk×m
+
→ Tf (M)Rl×n
+
of a map f ∈ ˆF l,n
k,m is given by:

f∗(∂ab) =
l
∑
i=1

n
∑
j=1

RaiQ(a)
bj ∂′
ij,
(A7)

and the pull-back of a metric g(l,n) on Rl×n
+
through f is given by:

( f ∗g(l,n))M(∂ab, ∂cd) = g(l,n)
f (M)( f∗∂ab, f∗∂cd) =
l
∑
i=1

n
∑
j=1

l
∑
s=1

n
∑
t=1

RaiRcsQ(a)
bj Q(c)
dt g(l,n)
f (M)(∂′
ij, ∂′
st).
(A8)

Proof of Theorem 4. We follow the strategy of [5,14]. The idea is to consider subclasses of maps from
the class F l,n
k,m and to evaluate their push-forward and pull-back maps together with the isometry
requirement. This yields restrictions on the possible metrics, eventually fully characterizing them.

First. Consider the maps hπ,σ ∈ F k,m
k,m , resulting from permutation matrices Q(a) = Pπa, πa : [m] → [m]
for all a ∈ [k], and R = Pσ, σ: [k] → [k]. Requiring isometry yields:

(hπ,σ)∗(∂ab)
=
∂′
σ(a) πa(b)
(A9)

g(k,m)
M
(∂ab, ∂cd)
=
g(k,m)
hπ,σ(M)(∂σ(a) π(a)(b), ∂σ(c) π(c)(d)).
(A10)

Second. Consider the maps rzw ∈ F kz,mw
k,m
defined by Q(1) = · · · = Q(k) ∈ Rm×mw and R ∈ Rk×kz

being uniform. In this case, for some permutations π and σ,

(rzw)∗(∂ab)
=
1
w

z
∑
i=1

w
∑
j=1
∂′
σ(a)(i) π(b)(j)
(A11)

(rzw∗g(kz,mw))M(∂ab, ∂cd)
=
1
w2

z
∑
i=1

w
∑
j=1

z
∑
s=1

w
∑
t=1
g(kz,mw)
rzw(M) (∂′
σ(a)(i) π(b)(j), ∂′
σ(c)(s) π(d)(t)).
(A12)

75


Entropy 2014, 16, 3207–3233

Third. For a rational matrix M = 1

Z ˜M with ˜M ∈ Nk×m and row-sum | ˜Ma| = N ∈ N for all a ∈ [k],
consider the map vM ∈ F zk,N
k,m
that maps M to a constant matrix. In this case, R ∈ Rk×kz and Q(a) has

the b-th row with | ˜Mab| entries with value
1

| ˜Mab| at positions π(ab)([ ˜Mab]) ⊆ [N], and:

(vM)∗(∂ab)
=
1
˜Mab

k
∑
i=1

˜Mab
∑
j=1
∂′
σ(a)(i) π(ab)(j)
(A13)

(vM∗g(kz,N))M(∂ab, ∂cd)
=
1
˜Mab

1
˜Mcd

z
∑
i=1

˜Mab
∑
j=1

z
∑
s=1

˜Mcd
∑
t=1
g(kz,N)
vM(M)(∂′
σ(a)(i) π(ab)(j), ∂′
σ(c)(s) π(cd)(t)). (A14)

Step 1: a ̸= c. Consider a constant matrix M = U. Then:

g(k,m)
U
(∂a1b1, ∂c1d1) = g(k,m)
hπ,σ(U)(∂a2b2, ∂c2d2) = g(k,m)
U
(∂a2b2, ∂c2d2).
(A15)

This implies that g(k,m)
U
(∂ab, ∂cd) = ˆA(k, m) when a ̸= c.
Using the second type of map, we get:

ˆA(k, m) = z2w2

w2
ˆA(kz, mw),
(A16)

which implies g(k,m)
U
(∂ab, ∂cd) = A

k2 , when a ̸= c. Considering a rational matrix M and the map vM
yields:

g(k,m)
M
(∂ab, ∂c,d) = A

k2 .
(A17)

Step 2: b ̸= d. By similar arguments as in Part 1, g(k,m)
U
(∂ab, ∂ad) = ˆB(k, m). Evaluating the map
rzw yields:

ˆB(k, m) = zw2

w2 ˆB(kz, mw) + z(z − 1)w2

w2
A

(kz)2

= z ˆB(kz, mw) + z − 1

z
A
k2 ,
(A18)

and therefore,

1
z

�
ˆB(k, m) − A

k2

�
= ˆB(kz, mw) −
A

(kz)2 ,
(A19)

which implies that
�
ˆB(k, m) − A

k2
�
is independent of m and scales with the inverse of k, such that it

can be written as B

k . Rearranging the terms yields g(k,m)
U
(∂ab, ∂ad) = A

k2 + B

k , for b ̸= d.
For a rational matrix M, the pull-back through vM shows then:

g(k,m)
M
(∂ab, ∂cd) = z
˜Mab ˜Mad
˜Mab ˜Mad

�
A

(kz)2 + B

kz

�
+ z(z − 1) ˜Mab ˜Mad

˜Mab ˜Mad

A

(kz)2 = A

k2 + B

k .
(A20)

76


Entropy 2014, 16, 3207–3233

Step 3: a = c and b = d. In this case, g(k,m)
U
(∂a1b1, ∂a1b1) = g(k,m)
U
(∂a2b2, ∂a2b2) = ˆC(k, m), and:

ˆC(k, m) = 1

w2 zw ˆC(kz, mw) + 1

w2 zw(w − 1)
�
A

(kz)2 + B

kz

�
+ 1

w2 zw2z(z − 1)
A

(kz)2

= z

w
ˆC(kz, mw) + (1 − 1

zw) A

k2 + (1 − z

zw) B

k ,
(A21)

which implies:
k
m

�
ˆC(k, m) − A

k2 − B

k

�
= kz

mw

�
˜C(kz, mw) −
A

(kz)2 − B

kz

�
,
(A22)

such that the left-hand side is a constant C, and g(k,m)
U
(∂ab, ∂ab) = A

k2 + B

k + m

k C. Now, for a rational
matrix M, pulling back through vM gives:

g(k,m)
M
(∂ab, ∂ab) =
1
˜M2
ab
˜Mab

� A

k2 + B

k + | ˜Ma|

k
C
�
+
1
˜M2
ab
˜Mab( ˜Mab − 1)
� A

k2 + B

k

�
+ 0

= A

k2 + B

k + | ˜Ma|

˜MabkC

= A

k2 + k B

k2 + |M|

Mab

C
k2 .
(A23)

Summarizing, we found:

g(k,m)
M
(∂ab, ∂cd) = A

k2 + δac

�
k B

k2 + δbd
|M|
Mab

C
k2

�
,
(A24)

which proves the first statement. The second statement follows by plugging Equation (23) into
Equation (A8). Finally, the statement about the positive-definiteness is a direct consequence of
Proposition 1.

Proof of Theorem 5. Suppose, contrary to the claim, that a family of metrics g(k,m)
M
exists, which is
invariant with respect to any conditional embedding. By Theorem 4, these metrics are of the form
of Equation (23). To prove the claim, we only need to show that A, B and C vanish. In the following,
we study conditional embeddings where Q consists of identity matrices and evaluate the isometry

requirement ( f ∗g(l,n))M(∂ab, ∂cd) = g(k,m)
M
(∂ab, ∂cd).
Step 1: In the case a ̸= c, we obtain from the invariance requirement and Equation (A8), that:

A
k2 = |Ra||Rc| A

l2 .
(A25)

Observe that:
1
k

k
∑
i=1
|Ri| = 1

k |R| = l

k.
(A26)

In fact, |Ri| is the cardinality of the i-th block of the partition belonging to R. Therefore, if we choose R
to be the partition indicator matrix of a partition that is not homogeneous and in which |Ra| > l/k
and |Rc| > l/k, then Equation (A25) implies that A = 0.
Step 2: In the case a = c and b ̸= d, we obtain from invariance and Equation (A8), that:

B
k =
l
∑
i=1

l
∑
s=1

RaiRasδis
B
l = |Ra| B

l .
(A27)

Again, we may chose Ra in such a way that |Ra| ̸= k

l and find that B = 0.

77


Entropy 2014, 16, 3207–3233

Step 3: Finally, in the case a = c and b = d, we obtain from invariance and Equation (A8), that:

C|M|
k2Mab
=
l
∑
i=1

l
∑
s=1

RaiRasδi,s
C|R⊤M|

l2(R⊤M)ib
= |Ra|C|R⊤M|

l2Mab
.
(A28)

If we chose Ra, such that |Ra| ̸=
|M|

|R⊤M|, then we see that C = 0. Therefore, g(k,m) is the zero-tensor,

which is not a metric.

Acknowledgments

The authors are grateful to Keyan Zahedi for discussions related to policy gradient methods in
robotics applications. Guido Montúfar thanks the Santa Fe Institute for hosting him during the initial
work on this article. Johannes Rauh acknowledges support by the VW Foundation. This work was
supported in part by the DFG Priority Program, Autonomous Learning (DFG-SPP 1527).

Author Contributions

All authors contributed to the design of the research. The research was carried out by all authors,
with main contributions by Guido Montúfar and Johannes Rauh. The manuscript was written by
Guido Montúfar, Johannes Rauh and Nihat Ay. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interests.

References

1.
Amari, S. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276.
2.
Kakade, S. A Natural Policy Gradient. Advances in Neural Information Processing Systems 14; MIT Press:
Cambridge, MA, USA, 2001; pp. 1531–1538.
3.
Shahshahani, S. A New Mathematical Framework for the Study of Linkage and Selection; American Mathematical
Society: Providence, RI, USA, 1979.
4.
Chentsov, N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Providence, RI,
USA, 1982.
5.
Campbell, L. An extended ˇCencov characterization of the information metric. Proc. Am. Math. Soc. 1986,
98, 135–141.
6.
Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with
Function Approximation. In Advances in Neural Information Processing Systems 12; MIT Press: Cambridge,
MA, USA, 2000; pp. 1057–1063.
7.
Marbach, P.; Tsitsiklis, J.
Simulation-based optimization of Markov reward processes.
IEEE Trans.
Autom. Control 2001, 46, 191–209.
8.
Montúfar, G.; Ay, N.; Zahedi, K. Expressive power of conditional restricted boltzmann machines for
sensorimotor control. 2014, arXiv:1402.3346.
9.
Ay, N.; Montúfar, G.; Rauh, J. Selection Criteria for Neuromanifolds of Stochastic Dynamics. In Advances
in Cognitive Neurodynamics (III); Yamaguchi, Y., Ed.; Springer-Verlag: Dordrecht, The Netherlands 2013;
pp. 147–154.
10.
Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190.
11.
Peters, J.; Schaal, S. Policy Gradient Methods for Robotics. In Proceedings of the IEEE International
Conference on Intelligent Robotics Systems (IROS 2006), Beijing, China, 9–15 October 2006.

78


Entropy 2014, 16, 3207–3233

12.
Peters, J.; Vijayakumar, S.; Schaal, S. Reinforcement learning for humanoid robotics. In Proceedings of the
third IEEE-RAS international conference on humanoid robots, Karlsruhe, Germany, 29–30 September 2003;
pp. 1–20.
13.
Bagnell, J.A.; Schneider, J.
Covariant policy search.
In Proceedings of the 18th International Joint
Conference on Artificial Intelligence, Acapulco, Mexico, August 9–15, 2003; Morgan Kaufmann Publishers
Inc.: San Francisco, CA, USA, 2003; pp. 1019–1024.
14.
Lebanon, G. Axiomatic geometry of conditional models. IEEE Trans. Inform. Theor. 2005, 51, 1283–1294.
15.
Lebanon, G.
An Extended ˇCencov-Campbell Characterization of Conditional Information Geometry.
In Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence (UAI 04), Banff, AL, Canada,
7–11 July 2004; Chickering, D.M., Halpern, J.Y., Eds.; AUAI Press: Arlington, VA, USA, 2004; pp. 341–345.
16.
Barndorff-Nielsen, O. Information and Exponential Families: In statistical Theory; John Wiley & Sons, Inc.:
Hoboken, NJ, USA, 1978.
17.
Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory;
Institute of Mathematical Statistics: Hayward, CA, USA, 1986.
18.
Zahedi, K.; Ay, N.; Der, R. Higher coordination with less control—A result of informaion maximiation in the
sensorimotor loop. Adapt. Behav. 2010, 18.
19.
Hofbauer, J.; Sigmund, K.
Evolutionary Games and Population Dynamics; Cambridge University Press:
Cambridge, United Kingdom, 1998.
20.
Ay, N.; Erb, I. On a notion of linear replicator equations. J. Dyn. Differ. Equ. 2005, 17, 427–451.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

79


entropy

Article
Matrix Algebraic Properties of the Fisher Information
Matrix of Stationary Processes

André Klein

Rothschild Blv. 123 Apt.7, 65271 Tel Aviv, Israel; A.A.B.Klein@uva.nl or klein@contact.uva.nl; Tel.: 972.5.25594723

Received: 12 February 2014; in revised form: 11 March 2014 / Accepted: 24 March 2014 /
Published: 8 April 2014

Abstract: In this survey paper, a summary of results which are to be found in a series of papers,
is presented.
The subject of interest is focused on matrix algebraic properties of the Fisher
information matrix (FIM) of stationary processes. The FIM is an ingredient of the Cramér-Rao
inequality, and belongs to the basics of asymptotic estimation theory in mathematical statistics. The
FIM is interconnected with the Sylvester, Bezout and tensor Sylvester matrices. Through these
interconnections it is shown that the FIM of scalar and multiple stationary processes fulfill the
resultant matrix property. A statistical distance measure involving entries of the FIM is presented.
In quantum information, a different statistical distance measure is set forth. It is related to the
Fisher information but where the information about one parameter in a particular measurement
procedure is considered. The FIM of scalar stationary processes is also interconnected to the solutions
of appropriate Stein equations, conditions for the FIM to verify certain Stein equations are formulated.
The presence of Vandermonde matrices is also emphasized.MSC Classification: 15A23, 15A24, 15B99,
60G10, 62B10, 62M20.

Keywords: Bezout matrix; Sylvester matrix; tensor Sylvester matrix; Stein equation; Vandermonde
matrix; stationary process; matrix resultant; Fisher information matrix

1. Introduction

In this survey paper, a summary of results derived and described in a series of papers, is presented.
It concerns some matrix algebraic properties of the Fisher information matrix (abbreviated as FIM) of
stationary processes. An essential property emphasized in this paper concerns the matrix resultant
property of the FIM of stationary processes. To be more explicit, consider the coefficients of two monic
polynomials p(z) and q(z) of finite degree, as the entries of a matrix such that the matrix becomes
singular if and only if the polynomials p(z) and q(z) have at least one common root. Such a matrix is
called a resultant matrix and its determinant is called the resultant. The Sylvester, Bezout and tensor
Sylvester matrices have such a property and are extensively studied in the literature, see e.g., [1,3]. The
FIM associated with various stationary processes will be expressed by these matrices. The derived
interconnections are obtained by developing the necessary factorizations of the FIM in terms of the
Sylvester, Bezout and tensor Sylvester matrices. These factored forms of the FIM enable us to show that
the FIM of scalar and multiple stationary processes fulfill the resultant matrix property. Consequently,
the singularity conditions of the appropriate Fisher information matrices and Sylvester, Bezout and
tensor Sylvester matrices coincide, these results are described in [4,6].
A statistical distance measure involving entries of the FIM is presented and is based on [7]. In
quantum information, a statistical distance measure is set forth, see [8,10], and is related to the Fisher
information but where the information about one parameter in a particular measurement procedure is
considered. This leads to a challenging question that can be presented as, can the existing distance
measure in quantum information be developed at the matrix level?

Entropy 2014, 16, 2023–2055; doi:10.3390/e16042023
www.mdpi.com/journal/entropy
80


Entropy 2014, 16, 2023–2055

The matrix Stein equation, see e.g., [11], is associated with the Fisher information matrices of
scalar stationary processes through the solutions of the appropriate Stein equations. Conditions for the
Fisher information matrices or associated matrices to verify certain Stein equations are formulated
and proved in this paper. The presence of Vandermonde matrices is also emphasized. The general
and more detailed results are set forth in [12] and [13]. In this survey paper it is shown that the FIM of
linear stationary processes form a class of structured matrices. Note that in [14], the authors emphasize
that statistical problems related to stationary processes have been treated successfully with the aid
of Toeplitz forms. This paper is organized as follows. The various stationary processes, considered
in this paper, are presented in Section 2, the Fisher information matrices of the stationary processes
are displayed in Section 3. Section 3 sets forth the interconnections between the Fisher information
matrices and the Sylvester, Bezout, tensor Sylvester matrices, and solutions to Stein equations. A
statistical distance measure is expressed in terms of entries of a FIM.

2. The Linear Stationary Processes

In this section we display the class of linear stationary processes whose corresponding Fisher
information matrix shall be investigated in a matrix algebraic context. But first some basic definitions
are set forth, see e.g., [15].

If a random variable X is indexed to time, usually denoted by t, the observations {Xt, t ∈ T } is

called a time series, where T is a time index set (for example, T = Z, the integer set).

2.1. Definition 2.1

A stochastic process is a family of random variables {Xt, t ∈ T } defined on a probability space {Ω, F, ℘}.

2.2. Definition 2.2

The Autocovariance function. If {Xt, t ∈ T } is a process such that Var(Xt) < ∞ (variance) for each t, then

the autocovariance function γX (·, ·) of {Xt} is defined by γX (r, s) = Cov (Xr, Xs) = E [(Xr − E Xr) (Xs − E

Xs)], r, s ∈ Z and E represents the expected value.

2.3. Definition 2.3

Stationarity. The time series {Xt, t ∈ Z}, with the index set Z ={0,±}1,±}2, . . .}, is said to be stationary if

(i)
E |Xt|2 < ∞

(ii)
E (Xt) = m for all t ∈ Z, m is the constant average or mean
(iii)
γX (r, s) = γX (r + t, s + t) for all r, s, t ∈ Z,

From Definition 2.3 can be concluded that the joint probability distributions of the random variables
{X1, X2, . . . Xtn} and {X1+k, X2+k, . . . Xtn+k} are the same for arbitrary times t1, t2, . . . , tn for all n and
all lags or leads k = 0, ±}1, ±}2, . . .. The probability distribution of observations of a stationary process
is invariant with respect to shifts in time. In the next section the linear stationary processes that will be
considered throughout this paper are presented.

2.4. The Vector ARMAX or VARMAX Process

We display one of the most general linear stationary process called the multivariate autoregressive,
moving average and exogenous process, the VARMAX process. To be more specific, consider the
vector difference equation representation of a linear system {y(t), t ∈ Z}, of order (p, r, q),

p
∑
j=0
Aj y(t − j) =
r
∑
j=0
Cj x(t − j) +

q
∑
j=0
Bj e(t − j), t ∈ Z
(1)

81


Entropy 2014, 16, 2023–2055

where y(t) are the observable outputs, x(t) the observable inputs and ϵ(t) the unobservable errors, all
are n-dimensional. The acronym VARMAX stands for vector autoregressive-moving average with
exogenous variables. The left side of (1) is the autoregressive part the second term on the right
is the moving average part and x(t) is exogenous. If x(t) does not occur the system is said to be
(V)ARMA. Next to exogenous, the input x(t) is also named the control variable, depending on the field
of application, in econometrics and time series analysis, e.g., [15], and in signal processing and control,
e.g., [16,17]. The matrix coefficients, Aj ∈ Rn×n, Cj ∈ Rn×n, and Bj ∈ Rn×n are the associate parameter
matrices. We have the property A0 ≡ B0 ≡ C0 ≡ In.
Equation (1) can compactly be written as

A(z) y(t) = C(z) x(t) + B(z) e(t)
(2)

where

A(z) =

p
∑
j=0
Aj zj; C(z) =
r
∑
j=0
Cj zj; B(z) =

q
∑
j=0
Bj zj

we use z to denote the backward shift operator, for example z xt = xt−1. The matrix polynomials
A(z), B(z) and C(z) are the associated autoregressive, moving average matrix polynomials, and the
exogenous matrix polynomial respectively of order p, q and r respectively. Hence the process described

by (2) is denoted as a VARMAX(p, r, q) process. Here z ∈ C with a duplicate use of z as an operator and
as a complex variable, which is usual in the signal processing and time series literature, e.g., [15,16,18].
The assumptions Det(A(z)) ̸= 0, such that |z| ≤ 1 and Det(B(z)) ̸= 0, such that |z| < 1 for all z ∈
C, is imposed so that the VARMAX(p, r, q) process (2) has exactly one stationary solution and the
condition Det(B(z)) ̸= 0 implies the invertibility condition, see e.g., [15] for more details. Under these
assumptions, the eigenvalues of the matrix polynomials A(z) and B(z) lie outside the unit circle. The
eigenvalues of a matrix polynomial Y (z) are the roots of the equation Det(Y (z)) = 0, Det(X) is the
determinant of X. The VARMAX(p, r, q) stationary process (2) is thoroughly discussed in [15,18,19].
The error {ϵ(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables

each having positive definite covariance matrix ∑ and we assume, for all s, t, Eϑ { x(s) ϵT(t)} = 0, where

XT denotes the transposition of matrix X and Eϑ represents the expected value under the parameter
ϑ. The matrix ϑ represents all the VARMAX(p, r, q) parameters, with the total number of parameters
being n2(p + q + r). For different purposes which will be specified in the next sections, two choices of
the parameter structure are considred. First, the parameter vector ϑ ∈ Rn2(p+q+r)×1 is defined by

ϑ = vec {A1, A2, . . . , Ap, C1, C2, . . . , Cr, B1, B2, . . . , Bq}
(3)

The vec operator transforms a matrix into a vector by stacking the columns of the matrix one
underneath the other according to vec X = col(col(Xij)n
i=1)n
j=1, see e.g., [2,20]. A different choice

is set forth, when the parameter matrix ϑ ∈ Rn×n(p+q+r) is of the form

ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+r ϑp+r+1 ϑp+r+2 . . . ϑp+r+q)
(4)

= (A1 A2 . . . Ap C1 C2 . . . Cr B1 B2 . . . Bq)
(5)

Representation (5) of the parameter matrix has been used in [21]. The estimation of the matrices A1,
A2,. . ., Ap, C1, C2,. . ., Cr, B1, B2, . . ., Bq and ∑ has received considerable attention in the time series
and statistical signal processing literature, see e.g., [15,17,19]. In [19], the authors study the asymptotic
properties of maximum likelihood estimates of the coefficients of VARMAX(p, r, q) processes, stored in
a (ℓ × 1) vector ϑ, where ℓ = n2(p + q + r).
Before describing the control-exogenous variable x(t) used in this survey paper, we shall present
the different special cases of the model described in 1 and 2.

82


Entropy 2014, 16, 2023–2055

2.5. The Vector ARMA or VARMA Process

When the process (2) does not contain the control process x(t) it yields

A(z)y(t) = B(z)e(t)
(6)

which is a vector autoregressive and moving average process, VARMA(p, q) process, see e.g., [15].
The matrix ϑ represents now all the VARMA parameters, with the total number of parameters being
n2(p+q). The VARMA(p, q) version of the parameter vector ϑ defined in (3) is then given by

ϑ = vec {A1, A2, . . . , Ap, B1, B2, . . . , Bq}
(7)

A VARMA process equivalent to the parameter matrix (4) is then the n × n(p + q) parameter matrix

ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+q) = (A1 A2 . . . Ap B1 B2 . . . Bq)
(8)

A description of the input variable x(t), in 2 follows. Generally, one can assume either that x(t) is non

stochastic or that x(t) is stochastic. In the latter case, we assume Eϑ{ x(s) ϵT(t)} = 0, for all s, t, and
that statistical inference is performed conditionally on the values taken by x(t). In this case it can
be interpreted as constant, see [22] for a detailed exposition. However, in the papers referred in this
survey, like in [21] and [23], the observed input variable x(t), is assumed to be a stationary VARMA
process, of the form

α(z)x(t) = β(z)η(t)
(9)

where α(z) and β(z) are the autoregressive and moving average polynomials of appropriate degree
and {η(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables each having
positive definite covariance matrix Ω. The spectral density of the VARMA process x(t) is Rx(·)/2π and
for a definition, see e.g., [15,16], to obtain

Rx(eiω) = α−1(eiω)β(eiω)Ωβ∗(eiω)α−∗(eiω)
ω ∈ [−π, π]
(10)

where i is the imaginary unit with the property i2 = −1, ω is the frequency, the spectral density Rx(eiω)
is Hermitian, and we further have, Rx(eiω) ≥ 0 and � π
−π Rx(eiω)dω < ∞. As mentioned above, the
basic assumption, x(t) and ϵ(t) are independent or at least uncorrelated processes, which corresponds
geometrically with orthogonal processes, holds and X* is the complex conjugate transpose of matrix X.

2.6. The ARMAX and ARMA Processes

The scalar equivalent to the VARMAX(p, r, q) and VARMA(p, q) processes, given by 2 and 6

respectively, shall now be displayed, to obtain for the ARMAX(p, r, q) process

a(z)y(t) = c(z)x(t) + b(z)e(t)
(11)

and for the ARMA(p, q) process

a(z)y(t) = b(z)e(t)
(12)

popularized in, among others, the Box-Jenkins type of time series analysis, see e.g., [15]. Where a(z),
b(z) and c(z) are respectively the scalar autoregressive, moving average polynomials and exogenous
polynomial, with corresponding scalar coefficients aj, bj and cj,

a(z) =

p
∑
j=0
aj zj; c(z) =
r
∑
j=0
cj zj; b(z) =

q
∑
j=0
bj zj
(13)

83


Entropy 2014, 16, 2023–2055

Note that as in the multiple case, a0 = b0 = 1. The parameter vector, ϑ, for the processes, 11 and 12
is then

ϑ = {a1, a2, . . . , ap, c1, c2, . . . , cr, b1, b2, . . . , bq}
(14)

and

ϑ = {a1, a2, . . . , ap, b1, b2, . . . , bq}
(15)

respectively.
In the next section the matrix algebraic properties of the Fisher information matrix of the stationary
processes (2), (6), (11) and (12) will be verified. Interconnections with various known structured
matrices like the Sylvester resultant matrix, the Bezout matrix and Vandermonde matrix are set forth.
The Fisher information matrix of the various stationary processes is also expressed in terms of the
unique solutions to the appropriate Stein equations.

3. Structured Matrix Properties of the Asymptotic Fisher Information Matrix of
Stationary Processes

The Fisher information is an ingredient of the Cramér-Rao inequality, also called by some
the Cauchy-Schwarz inequality in mathematical statistics, and belongs to the basics of asymptotic
estimation theory in mathematical statistics. The Cramér-Rao theorem [24] is therefore considered.
When assuming that the estimators of ϑ, defined in the previuos sections, are asymptotically unbiased,
the inverse of the asymptotic information matrix yields the Cramér-Rao bound, and provided that the
estimators are asymptotically efficient, the asymptotic covariance matrix then verifies the inequality

Cov
� ˆϑ
� ≽ I−1� ˆϑ
�

here I (�ϑ) is the FIM, Cov (�ϑ) is the covariance of �ϑ, the unbiased estimator of ϑ, for a detailed
fundamental statistical analysis, see [25,26]. The FIM equals the Cramér-Rao lower bound, and the
subject of the FIM is also of interest in the control theory and signal processing literature, see e.g., [27].
Its quantum analog was introduced immediately after the foundation of mathematical quantum
estimation theory in the 1960’s, see [28,29] for a rigorous exposition of the subject. More specifically, the
Fisher information is also emphasized in the context of quantum information theory, see e.g., [30,31].
It is clear that the Cramér-Rao inequality takes a lot of attention because it is located on the highly
exciting boundary of statistics, information and quantum theory and more recently matrix theory. In
the next sections, the Fisher information matrices of linear stationary processes will be presented and
its role as a new class of structured matrices will be the subject of study.
When time series models are the subject, using 2 for all t ∈ Z to determine the residual ϵ(t) or
ϵt(ϑ), to emphasize the dependency on the parameter vector ϑ, and assuming that x(t) is stochastic and
that (y(t), x(t)) is a Gaussian stationary process, the asymptotic FIM F(ϑ) is defined by the following
(ℓ × ℓ) matrix which does not depend on t

F(ϑ) = E

��∂et(ϑ)

∂ϑ⊤

�⊤
Σ−1
�∂et(ϑ)

∂ϑ⊤

��

(16)

where the (v × ℓ) matrix ∂(·)/∂ϑ T, the derivative with respect to ϑ T, for any (v × 1) column vector
(·) and ℓ is the total number of parameters. The derivative with respect to ϑ T is used for obtaining
the appropriate dimensions. Equality (16) is used for computing the FIM of the various time series
processes presented in the previous sections and appropriate definitions of the derivatives are used,
especially for the multivariate processes (2) and (6), see [21,22].

84


Entropy 2014, 16, 2023–2055

3.1. The Fisher Information Matrix of an ARMA(p, q) Process

In this section, the focus is on the FIM of the ARMA process (12). When ϑ is given in 15, the
derivatives in 16 are at the scalar level

∂et(ϑ)

∂aj
=
1

a(z)et−j
for j = 1, . . . , p and∂et(ϑ)

∂bk
= − 1

b(z)et−k for k = 1, . . . , q

when combined for all j and k, the FIM of the ARMA process (12) with the variance of the noise process
ϵt(ϑ) equal to one, yields the block decomposition, see [32]

F(ϑ) =

�
Faa(ϑ)
Fab(ϑ)
Fba(ϑ)
Fbb(ϑ)

�

(17)

The expressions of the different blocks of the matrix F(ϑ) are

Faa(ϑ) =
1

2πi

�

|z|=1

up(z)u⊤
p (z−1)

a(z)a(z−1)
dz
z =
1

2πi

�

|z|=1

up(z)v⊤
p (z)

a(z)ˆa(z)
dz
(18)

Fab(ϑ) = − 1

2πi

�

|z|=1

up(z)u⊤
q (z−1)

a(z)b(z−1)
dz
z = − 1

2πi

�

|z|=1

up(z)v⊤
q (z)

a(z)ˆb(z)
dz
(19)

Fba(ϑ) = − 1

2πi

�

|z|=1

uq(z)u⊤
p (z−1)

a(z−1)b(z)
dz
z = − 1

2πi

�

|z|=1

uq(z)v⊤
p (z)

ˆa(z)b(z) dz
(20)

Fbb(ϑ) =
1

2πi

�

|z|=1

uq(z)u⊤
q (z−1)

b(z)b(z−1)
dz
z =
1

2πi

�

|z|=1

uq(z)v⊤
q (z)

b(z)ˆb(z)
dz
(21)

where the integration above and everywhere below is counterclockwise around the unit circle. The

reciprocal monic polynomials â(z) and �b(z) are defined as â(z) = zpa(z−1) and �b (z) = zqb(z−1) and ϑ
=(a1, . . . , ap, b1, . . . , bq) T introduced in (15). For each positive integer k we have uk(z) = (1, z, z2,
. . . , zk−1) T and vk(z) = zk−1uk(z−1). Considering the stability condition of the ARMA(p, q) process
implies that all the roots of the monic polynomials a(z) and b(z) lie outside the unit circle. Consequently,

the roots of the polynomials â(z) and �b(z) lie within the unit circle and will be used as the poles for
computing the integrals (18)–(21) when Cauchy’s residue theorem is applied. Notice that the FIM F(ϑ)
is symmetric block Toeplitz so that Fab(ϑ) = F ⊤
ba(ϑ) and the integrands in (18)–(21) are Hermitian.
The computation of the integral expressions, (18)–(21) is easily implementable by using the standard
residue theorem. The algorithms displayed in [33] and [22] are suited for numerical computations of
among others the FIM of an ARMA(p, q) process.

3.2. The Sylvester Resultant Matrix - The Fisher Information Matrix

The resultant property of a matrix is considered, in order to show that the FIM F(ϑ) has the matrix
resultant property implies to show that the matrix F(ϑ) becomes singular if and only if the appropriate

scalar monic polynomials â(z) and �b(z) have at least one common zero. To illustrate the subject, the
following known property of two polynomials is set forth. The greatest common divisor (frequently
abbreviated as GCD) of two polynomials is a polynomial, of the highest possible degree, that is a factor
of both the two original polynomials, the roots of the GCD of two polynomials are the common roots of
the two polynomials. Consider the coefficients of two monic polynomials p(z) and q(z) of finite degree,
as the entries of a matrix such that the matrix becomes singular if and only if the polynomials p(z) and
q(z) have at least one common root. Such a matrix is called a resultant matrix and its determinant is

85


Entropy 2014, 16, 2023–2055

called the resultant. Therefore we present the known (p + q) × (p + q) Sylvester resultant matrix of the
polynomials a and b, see e.g., [2], to obtain

S(a, b) =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

1
a1
· · ·
ap
0
· · ·
0

0
...
...
...
...
...
...
...
...
...
...
0
0
· · ·
0
1
a1
· · ·
ap
1
b1
· · ·
bq
0
· · ·
0

0
...
...
...
...
...
...
...
...
...
...
0n×n
0
· · ·
0
1
b1
· · ·
bq

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

(22)

Consider the q ×(p+q) and p×(p+q) upper and lower submatrices Sp (b) and Sq (−a) of the Sylvester

resultant matrix S (−b, a) such that

S(b, −a) =

�
Sp(b)
−Sq(a)

�

(23)

The matrix

S

(a, b) becomes singular in the presence of one or more common zeros of the monic polynomials â(z)

and �b(z), this property is assessed by the following equalities

R(a, b) =
∏
i = 1, . . . , p
j = 1, . . . , q

(αi − βj), R(b, a) = (−1)pq
∏
i = 1, . . . , p
j = 1, . . . , q

(αi − βj)
(24)

and

R(b, −a) = (−1)q
∏
i = 1, . . . , p
j = 1, . . . , q

(βj − αi), and R(−b, a) = (−1)p
∏
i = 1, . . . , p
j = 1, . . . , q

(βj − αi)
(25)

where
R(a,
b)
is
the
resultant
of
â(z)
and
�b(z),
and
is
equal
to
Det
includegraphics[scale=1]entropy-16-02023f6.pdf (a, b).
The string of equalities in (24) and (25)
hold since R(b, a) = (−1)pq R(a, b), R(b, −a) = (−1)q R(b, a), and R(−b, a) = (−1)p R(b, a), see [34]. The

zeros of the scalar monic polynomials â(z) and �b(z) are αi and βj respectively and are assumed to be
distinct. By this is meant, when we have (z − αi)nαi and (z − βj)nβj with the powers nαi and nβj both
greater than one, that only the distinct roots will be considered free from the corresponding powers.

The key property of the classical Sylvester resultant matrix S (a, b) is that its null space provides a
complete description of the common zeros of the polynomials involved. In particular, in the scalar

case the polynomials â(z) and �b(z) are coprime if and only if S (a, b) is non-singular. The following

key property of the classical Sylvester resultant matrix S (a, b), is given by the well known theorem on
resultants, to obtain

dim Ker S(a, b) = ν(a, b)
(26)

86


Entropy 2014, 16, 2023–2055

where ν(a, b) is the number of common roots of the polynomials â(z) and �b(z), with counting
multiplicities, see e.g., [3]. The dimension of a subspace V is represented by dim (V ), Ker (X)
is the null space or kernel of the matrix X, denoted by Null or Ker. The null space of an n × n matrix A
with coefficients in a field K (typically the field of the real numbers or of the complex numbers) is the
set Ker A = {x ∈ Kn: Ax = 0}, see e.g., [1,2,20].

In order to prove that the FIM F (ϑ) fulfills the resultant matrix property, the following
factorization is derived, Lemma 2.1 in [5],

F(ϑ) = S(b, −a)P(ϑ)S⊤(b, −a)
(27)

where the matrix ℘(ϑ) ∈ R(p+q)×(p+q) admits the form

P(ϑ) =
1

2πi

�

|z|=1

up+q(z)u⊤
p+a(z−1)

a(z)b(z)a(z−1)b(z−1)
dz
z =
1

2πi

�

|z|=1

up+q(z)v⊤
p+q(z)

a(z)b(z)ˆa(z)ˆb(z)
dz
(28)

It is proved in [5] that the symmetric matrix ℘(ϑ) fulfills the property, ℘(ϑ) ≻ O. The factorization (27)
allows us to show the matrix resultant property of the FIM, Corollary 2.2 in [5] states.
The FIM of an ARMA(p, q) process with polynomials a(z) and b(z) of order p, q respectively

becomes singular if and only if the polynomials â(z) and �b(z) have at least one common root.
From Corollary 2.2 in [5] can be concluded, the FIM of an ARMA(p, q) process and the Sylvester
resultant matrix

S

(−b, a) have the same singularity property. By virtue of (26) and (27) we will specify the dimension of

the null space of the FIM F (ϑ), this is set forth in the following lemma.

3.2.1. Lemma 3.1

Assume that the polynomials â(z) and b(z) have ν(a, b) common roots, counting multiplicities.
The factorization (27) of the FIM and the property (26) enable us to prove the equality

dim (Ker F(ϑ)) = dim (Ker S(b, −a)) = ν(a, b)
(29)

Proof

The matrix ℘(ϑ) ∈ R(p+q)×(p+q), given in (27), fulfills the property of positive definiteness, as
proved in [5]. This implies that a Cholesky decomposition can be applied to ℘(ϑ), see [35] for more
details, to obtain ℘(ϑ) =LT(ϑ)L(ϑ), where L(ϑ) is a R(p+q)×(p+q) upper triangular matrix that is unique if
its diagonal elements are all positive. Consequently, all its eigenvalues are then positive so that the
matrix L(ϑ) is also positive definite. Factorization of (27) now admits the representation

F(ϑ) = S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a)
(30)

and taking the property, if A is an m× n matrix, then Ker (A) = Ker (ATA), into account, yields when
applied to (30)

Ker F(ϑ) = Ker S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a) = Ker L(ϑ)S⊤(b, −a)

Assume the vector u ∈ Ker L(ϑ) S⊤ (b, −a), such that L(ϑ) S⊤ (b, −a)u = 0 and set S⊤ (b, −a)u = v = ⇒

L(ϑ)v = 0, since the matrix L(ϑ) ≻ O = ⇒ v = 0, this implies S⊤ (b, −a)u = 0 = ⇒ u ∈ Ker S⊤ (b, −a).
Consequently,

87


Entropy 2014, 16, 2023–2055

Ker F(ϑ) = Ker S⊤(b, −a)
(31)

We will now consider the Rank-Nullity Theorem, see e.g., [1], if A is an m × n matrix, then

dim (Ker A) + dim (Im A) = n

and the property dim (Im A) = dim (Im AT). When applied to the (p + q) × (p + q) matrix S (b, −a),
it yields

dim (Ker S(b, −a)) = dim (Ker S⊤(b, −a)) ⇒ dim (Ker F(ϑ)) = dim (Ker S(b, −a))

which completes the proof.
Notice that the dimension of the null space of matrix A is called the nullity of A and the dimension
of the image of matrix A, dim (Im A), is termed the rank of matrix A. An alternative proof to the one
developed in Corollary 2.2 in [5], is given in a corollary to Lemma 3.1, reconfirming the resultant

matrix property of the FIM F (ϑ).

3.2.2. Corollary 3.2

The FIM F (ϑ) of an ARMA(p, q) process becomes singular if and only if the autoregressive and moving

average polynomials â(z) and �b(z) have at least one common root.

Proof

By virtue of the equality (31) combining with the property Det S⊤ (b, −a) = Det S (b, −a) and

the matrix resultant property of the Sylvester matrix S (b, −a) yields, Det S⊤ (b, −a) = 0 ⇔ Ker S⊤

(b, −a) ̸= {0} if and only if the ARMA(p, q) polynomials â(z) and �b(z) have at least one common root.

Equivalently, Det S⊤ (b, −a) ̸= 0 ⇔ Ker S⊤ (b, −a) = {0} if and only if the ARMA(p, q) polynomials

â(z) and �b(z) have no common roots. Consequently, by virtue of the equality Ker F (ϑ) =Ker S⊤ (b,

−a) can be concluded, the FIM F (ϑ) becomes singular if and only if the ARMA(p, q) polynomials â(z)

and �b(z) have at least one common root. This completes the proof.

3.3. The Statistical Distance Measure and the Fisher Information Matrix

In [7] statistical distance measures are studied. Most multivariate statistical techniques are based
upon the concept of distance. For that purpose a statistical distance measure is considered that is
a normalized Euclidean distance measure with entries of the FIM as weighting coefficients. The
measurements x1, x2,. . . , xn are subject to random fluctuations of different magnitudes and have
therefore different variabilities. It is then important to consider a distance that takes the variability
of these variables or measurements into account when determining its distance from a fix point. A
rotation of the coordinate system through a chosen angle while keeping the scatter of points given
by the data fixed, is also applied, see [7] for more details. It is shown that when the FIM is positive
definite, the appropriate statistical distance measure is a metric. In case of a singular FIM of an ARMA
stationary process, the metric property depends on the rotation angle. The statistical distance measure,
is based on m parameters unlike a statistical distance measure introduced in quantum information, see
e.g., [8,9], that is also related to the Fisher information but where the information about one parameter
in a particular measurement procedure is considered.

88


Entropy 2014, 16, 2023–2055

The straight-line or Euclidean distance between the stochastic vector x =
�
x1
x2
. . .
xn
�⊤

and fixed vector y =
�
y1
y2
. . .
yn
�⊤
where x, y ∈ Rn, is given by

d(x, y) = ∥x − y∥ =

�
n
∑
j=1
(xj − yj)2
�1/2
(32)

where the metric d(x, y):= ||x−y|| is induced by the standard Euclidean norm || · || on Rn, see
e.g., [2] for the metric conditions.
The observations x1, x2, . . . , xn are used to compute maximum likelihood estimated of the
parameters ϑ1, ϑ2, . . . , ϑm and where m < n. These estimated parameters are random variables, see
e.g., [15]. The distance of the estimated vector ϑ ∈ Rm, given in (15), is studied. Entries of the FIM are
inserted in the distance measure as weighting coefficients. The linear transformation

�ϑ = Li(ϕ)ϑ
(33)

is applied, where Li(ϕ) ∈ Rm×n is the Givens rotation matrix with rotation angle ϕ, with 0 ≤ ϕ ≤ 2π
and i ∈ {1, . . . , m − 1}, see e.g., [36], and is given by

Li(ϕ) =

⎛

⎜
⎜
⎜
⎝

Ii−1
0
0
0
0
(cos(ϕ))i,i
(− sin(ϕ))i,i+1
0
0
(sin(ϕ))i+1,i
(cos(ϕ))i+1,i+1
0
0
0
0
Im−i−1

⎞

⎟
⎟
⎟
⎠,
0 ≤ ϕ ≤ 2π
(34)

The following matrix decomposition is applied in order to obtain a transformed FIM

Fϕ(ϑ) = Li(ϕ)F(ϑ)L⊤
i (ϕ)
(35)

where Fϕ(ϑ) and F (ϑ) are respectively the transformed and untransformed Fisher information
matrices. It is straightforward to conclude that by virtue of (35), the transformed and untransformed

Fisher information matrices F ϕ(ϑ) and F (ϑ), are similar since the rotation matrix Li(ϕ) is orthogonal.
Two matrices A and B are similar if there exists an invertible matrix X such that the equality AX = XB
holds. As can be seen, the Givens matrix Li(ϕ) involves only two coordinates that are affected by the
rotation angle ϕ whereas the other directions, which correspond to eigenvalues of one, are unaffected
by the rotation matrix.

By virtue of (35) can be concluded that a positive definite FIM, F (ϑ) ≻ 0, implies a positive

definite transformed FIM, F ϕ(ϑ) ≻ 0. Consequently, the elements on the main diagonal of F (ϑ), f 1,1,

f 2,2, . . . , fm,m, as well as the elements on the main diagonal of F ϕ(ϑ), �f1,1, �f2,2, . . . , �fm,m are all
positive. However, the elements on the main diagonal of a singular FIM of a stationary ARMA process
are also positive.
As developed in [7], combining (33) and (35) yields the distance measure of the estimated
parameters ϑ1, ϑ2, . . . , ϑm accordingly, to obtain

d2
Fϕ(ϑ) =
m
∑
j=1,j̸=i,i+1

� ϑ2
j

fj,j

�

+ {ϑi cos(ϕ) − ϑi+1 sin(ϕ)}2

�fi,i(ϕ)
+ {ϑi+1 cos(ϕ) + ϑi sin(ϕ)}2

�fi+1,i+1(ϕ)
(36)

where

�fi,i(ϕ) = fi,i cos2(ϕ) − fi,i+1 sin(2ϕ) + fi+1,i+1 sin2(ϕ)
(37)

89


Entropy 2014, 16, 2023–2055

�fi+1,i+1(ϕ) = fi+1,i+1 cos2(ϕ) + fi,i+1 sin(2ϕ) + fi,i sin2(ϕ)
(38)

and fj,l are entries of the FIM F (ϑ) whereas �fi,i(φ) and �fi+1,i+1(φ) are the transformed components

since the rotation affects only the entries, i and i+1, as can be seen in matrix Li(ϕ). In [7], the existence
of the following inequalities is proved

�fi,i(ϕ) > 0
and
�fi+1,i+1(ϕ) > 0

this guaratees the metric property of (36).
When the FIM of an ARMA(p, q) process is the
case, a combination of (27) and (35) for the ARMA(p, q) parameters, given in (15) yields for the
transformed FIM,

Fϕ(ϑ) = Sϕ(−b, a)P(ϑ)S⊤
ϕ (−b, a)
(39)

where ℘(ϑ) is given by (28) and the transformed Sylvester resultant matrix is of the form

Sϕ(−b, a) = Li(ϕ)S(−b, a)
(40)

Proposition 3.5 in [7], proves that the transformed FIM F ϕ(ϑ) and the transformed Sylvester matrix
Sφ (−b, a) fulfill the resultant matrix property by using the equalities (40) and (39). The following
property is then set forth.

3.3.1. Proposition 3.3

The properties

Ker Fϕ(ϑ) = Ker S⊤
ϕ (−b, a) and Ker Sϕ(−b, a) = Ker S(−b, a)

hold true.

Proof

By virtue of the equalities (39), (40) and the orthogonality property of the rotation matrix Li(ϕ)
which implies that Ker Li(ϕ) = {0} combined with the same approach as in Lemma 3.1 completes
the proof.
A straightforward conclusion from Proposition 3.3 is then

dim Ker Fϕ(ϑ) = dim Ker Sϕ(−b, a), dim Ker Sϕ(−b, a) = dim Ker S(−b, a)

In the next section a distance measure introduced in quantum information is discussed.
Statistical Distance Measure - Fisher Information and Quantum Information
In quantum information, the Fisher information, the information about a parameter θ in a
particular measurement procedure, is expressed in terms of the statistical distance s, see [8,10]. The
statistical distance used is defined as a measure to distinguish two probability distributions on the basis
of measurement outcomes, see [37]. The Fisher information and the statistical distance are statistical
quantities, and generally refer to many measurements as it is the case in this survey. However, in
the quantum information theory and quantum statistics context, the problem set up is presented as
follows. There may or may not be a small phase change θ, and the question is whether it is there. In
that case you can design quantum experiments that will tell you the answer unambiguously in a single
measurement. The equality derived is of the form

F (ϕ) =
� ds

dθ

�2
(41)

90


Entropy 2014, 16, 2023–2055

the Fisher information is the square of the derivative of the statistical distance s with respect to θ.
Contrary to (36), where the square of the statistical distance measure is expressed in terms of entries

of a FIM F (ϑ) which is based on information about m parameters estimated from n measurements,
for m < n. A challenging question could therefore be formulated as follows, can a generalization of
equality (41) be developed in a quantum information context but at the matrix level ? To be more
specific, many observations or measurements that lead to more than one parameter such that the
corresponding Fisher information matrix is interconnected to an appropriate statistical distance matrix,
a matrix where entries are scalar distance measures. This question could equally be a challenge to
algebraic matrix theory and to quantum information.

3.4. The Bezoutian - The Fisher Information Matrix

In this section an additional resultant matrix is presented, it concerns the Bezout matrix or
Bezoutian. The notation of Lancaster and Tismenetsky [2] shall be used and the results presented are
extracted from [38]. Assume the polynomials a and b given by a(z) = ∑n
j=0 aj zj and b(z) = ∑n
j=0 bj zj,
cfr. (13) but where p = q = n, and we further assume a0 = b0 = 1. The Bezout matrix B(a, b) of the
polynomials a and b is defined by the relation

a(z)b(w) − a(w)b(z) = (z − w)u⊤
n (z)B(a, b)un(z)

This matrix is often referred as the Bezoutian. We will display a decomposition of the Bezout matrix
B(a, b) developed in [38]. For that purpose the matrix Uϕ and its inverse Tϕ are presented, where ϕ is a
given complex number, to obtain

Uϕ =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

1
0
· · ·
· · ·
0
−ϕ
1
· · ·
· · ·
0

0
...
...
...
...
...
0
· · ·
0
−ϕ
1

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

, Tϕ =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

1
0
· · ·
· · ·
0
ϕ
1
· · ·
· · ·
0

ϕ2
...
...
...
...
...
ϕn−1
· · ·
ϕ2
ϕ
1

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

Let (1 − α1z) and (1 − β1z) be a factor of a(z) and b(z) respectively and α1 and β1 are zeros of â(z)

and �b(z). Consider the factored form of the nth order polynomials a(z) and b(z) of the form a(z) = (1
− α1z)a−1(z) and b(z) = (1 − β1z)b−1(z) respectively. Proceeding this way, for α2, . . . , αn yields the
recursion a−(k−1)(z) = (1 − αkz)a−k(z), equivalently for the polynomials b−k(z) and a0(z) = a(z) and b0(z)
= b(z). Proposition 3.1 in [38] is presented.
The following non-symmetric decomposition of the Bezoutian is derived, considering the
notations above

B(a, b) = Uα1

�
B(a−1, b−1)
0
0
0

�

U⊤
β1 + (β1 − α1)bβ1a⊤
α1
(42)

with aα1 such that a⊤
α1 un(z) = a−1 similarly for bβ1. Iteration gives the following expansion for the
Bezout matrix

B(a, b) =
n
∑
k=1
(βk − αk)Uα1 . . . Uαk−1Uβk+1 . . . Uβnen
1 (en
1)⊤ U⊤
β1 . . . U⊤
βk−1U⊤
αk+1 . . . U⊤
αn

where en
1 is the first unit standard basis column vector in Rn, by ej we denote the jth coordinate vector,
ej = (0, . . . , 1, . . . , 0) T, with all its components equal to 0 except the jth component which equals 1.
The following corollarys to Proposition 3.1 in [38] are now presented.

Corollary 3.2 in [38] states. Let ϕ be a common zero of the polynomials â(z) and �b(z). Then a(z) =
(1 − ϕz)a−1(z) and b(z) = (1 − ϕz)b−1(z) and

91


Entropy 2014, 16, 2023–2055

B(a, b) = Uϕ

�
B(a−1, b−1)
0
0
0

�

U⊤
ϕ

This a direct consequence of (42) and from which can be concluded that the Bezoutian B(a, b) is
non-singular if and only if the polynomials a(z) and b(z) have no common factors. A similar conclusion

is drawn for the FIM in (27) so that matrices F (ϑ) and B(a, b) have the same singularity property.
Related to Corollary 3.2 in[38], this is where we give a description of the kernel or nullspace of
the Bezout matrix.
Corollary 3.3 in [38] is now presented. Let ϕ1, . . ., ϕm be all the common zeros of the polynomials

â(z) and �b(z), with multiplicities n1, . . . , nm. Let ℓ be the last unit standard basis column vector in Rn

and put

wj
k =
�
Tj
ϕk Jj−1�⊤
ℓ

for k = 1, . . . , m and j = 1, . . . , nk and by J we denote the forward n × n shift matrix, Jij = 1 if i = j + 1.

Consequently, the subspace Ker B(a, b) is the linear span of the vectors wj
k.
An alternative representation to (27) but involving the Bezoutian B(b, a) and derived in
Proposition 5.1 in [38] is of the form

F(ϑ) = M−1(b, a)H(ϑ)M−⊤(b, a)
(43)

where

H(ϑ) =

�
I
0
0
B(b, a)

�

Q(ϑ)

�
I
0
0
B(b, a)

�

and M(b, a) =

�
P
0
PS(ˆa)P
PS(ˆb)P

�

(44)

and

P =

⎛

⎜
⎜
⎜
⎜
⎜
⎝

0
· · ·
0
1
...
1
0

0
...
1
0
· · ·
0

⎞

⎟
⎟
⎟
⎟
⎟
⎠
, S(ˆa) =

⎛

⎜
⎜
⎜
⎜
⎝

an−1
an−2
· · ·
a0
an−2
a0
0
...
...
a0
0
· · ·
0

⎞

⎟
⎟
⎟
⎟
⎠
and Q(ϑ) ≻ 0

The matrix S(â) is the symmetrizer of the polynomial â(z), in this paper a0 = 1, see [2] and P is a
permutation matrix. In [38] it is shown that the matrix Q(ϑ) is the unique solution to an appropriate
Stein equation and is strictly positive definite. However, in the next section an explicit form of the Stein
solution Q(ϑ) is developed. Some comments concerning the property summarized in Corollary 5.2
in [38] follow.

The matrix H(ϑ) is non-singular if and only if the polynomials a(z) and b(z) have no common
factors. The proof is straightforward since the matrix Q(ϑ) is non-singular which implies that the

matrixH(ϑ) is only non-singular when the Bezoutian B(b, a) is non-singular and this is fulfilled if and
only if the polynomials a(z) and b(z) have no common factors.

The matrix M(b, a) is non-singular if a0 ̸= 0 and b0 ̸= 0, which is the case since we have a0 =

b0 = 1. From (43) can be concluded that the FIM F (ϑ) is non-singular only when the matrix H(ϑ)
is non-singular or by virtue of (44) when the Bezoutian B(b, a) is non-singular. Consequently, the

singularity conditions of the Bezoutian B(b, a), the FIM F (ϑ) and the Sylvester resultant matrix

S

92


Entropy 2014, 16, 2023–2055

(b, −a) are therefore equivalent. Can be concluded, by virtue of (29) proved in Lemma 3.1 and the

equality dim (Ker S (a, b)) = dim (Ker B(a, b)) proved in Theorem 21.11 in [1], yields

dim (Ker S(b, −a)) = dim (Ker F(ϑ)) = dim (Ker B(b, a)) = ν(a, b)

3.5. The Stein Equation - The Fisher Information Matrix of an ARMA(p, q) Process

In [12], a link between the FIM of an ARMA process and an appropriate solution of a Stein
equation is set forth. In this survey paper we shall present some of the results and confront some
results displayed in the previous sections. However, alternative proofs will be given to some results
obtained in [12,38].
The Stein matrix equation is now set forth. Let A ∈ Cm×m, B ∈ Cn×n and Γ ∈ Cn×m and consider
the Stein equation

S − BSA⊤ = Γ
(45)

It has a unique solution if and only if λμ ̸= 1 for any λ ∈ σ(A) and μ ∈ σ(B), the spectrum of D is σ(D)

= {λ ∈ C: det(λIm − D) = 0}, the set of eigenvalues of D. The unique solution will be given in the next
theorem [11].

3.5.1. Theorem 3.4

Let A and B be, such that there is a single closed contour C with σ(B) inside C and for each non-zero w ∈
σ(A), w−1 is outside C. Then for an arbitrary Γ the Stein 45 has a unique solution S

S =
1

2πi

�

C
(λIn − B)−1Γ(Im − λA)−⊤dλ
(46)

In this section an interconnection between the representation (27) of the FIM F (ϑ) and an appropriate
solution to a Stein equation of the form (45) as developed in [12] is set forth. The distinct roots of

the polynomials â(z) and �b(z) are denoted by α1, α2, . . . , αp and β1, β2, . . . , βq respectively such

that the non-singularity of the FIM F (ϑ) is guaranteed. The following representation of the integral
expression (28) is given when Cauchy’s residue theorem is applied, equation (4.8) in [12]

P(ϑ) = U(ϑ)D(ϑ) ˆU(ϑ)
(47)

where

U(ϑ) = {up+q(α1), up+q(α2), . . . , up+q(αp), up+q(β1), up+q(β2), . . . , up+q(βq)}

D(ϑ) = diag
��
1

ˆa(z;αi)ˆb(αi)a(αi)b(αi)

�
,
�
1

ˆa(βj)ˆb(z;βj)a(βj)b(βj)

��
, i = 1, ..., p and j = 1, ..., q

and

ˆU(ϑ) = {vp+q(α1), vp+q(α2), . . . , vp+q(αp), vp+q(β1), vp+q(β2), . . . , vp+q(βq)}⊤

the polynomial p(·; β) is defined accordingly, p(z; β) =
p(z)
(z−β) and D (ϑ) is the (p + q) × (p + q) diagonal

matrix. The matrices U (ϑ) and �U((ϑ) in (47) are the (p + q)× (p + q) Vandermonde matrices Vαβ and
�U αβ respectively, given by

93


Entropy 2014, 16, 2023–2055

Vαβ =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

1
α1
α2
1
· · ·
αp+q−1
1
1
α2
α2
2
· · ·
αp+q−1
2
...
...
...
...
...
1
αp
α2
p
· · ·
αp+q−1
p
1
β1
β2
1
· · ·
βp+q−1
1
1
β2
β2
2
· · ·
βp+q−1
2
...
...
...
...
...
1
βq
β2
q
· · ·
βp+q−1
q

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

and ˆVαβ =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

αp+q−1
1
αp+q−2
1
· · ·
α1
1
αp+q−1
2
αp+q−2
2
· · ·
α2
1
...
...
...
...
...
αp+q−1
p
αp+q−2
p
· · ·
αp
1
βp+q−1
1
βp+q−2
1
· · ·
β1
1
βp+q−1
2
βp+q−2
2
· · ·
β2
1
...
...
...
...
...
βp+q−1
q
βp+q−2
q
· · ·
βq
1

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

It is clear that the (p + q) × (p + q) Vandermonde matrices Vαβ and �U αβ are nonsingular when αi ̸= αj,
βk ̸= βh and αi ̸= βk for all i, j = 1, . . . , p and k, h = 1, . . . , q. A rigorous systematic evaluation of the

Vandermonde determinants DetVαβ and Det �U αβ, yields

DetVαβ = (−1)(p+q) (p+q−1)/2Φ (αi, βk)

where

Φ (αi, βk) =
∏
1≤i<j≤p
(αi − αj)
∏
1≤k<h≤q
(βk − βh)
∏
m = 1, . . . p
n = 1, . . . q

(αm − βn)

Since Vαβ = P ˆV⊤
αβ and given the configuration of the permutation matrix, P, this leads to the equalities

Det ˆV⊤
αβ = DetP DetVαβ and DetP = (−1)(p+q)(p+q−1)/2 so that

Det ˆVαβ = (−1)(p+q) (p+q−1) Φ (αi, βk) ⇒| DetVαβ |= | Det ˆVαβ |

We shall now introduce an appropriate Stein equation of the form (45) such that an interconnection with
℘(ϑ) in (47) can be verified. Therefore the following (p + q)× (p + q) companion matrix is introduced,

Cg =

⎛

⎜
⎜
⎜
⎜
⎝

0
1
· · ·
0
...
...
...
0
· · ·
0
1
−gp+q
−gp+q−1
· · ·
−g1

⎞

⎟
⎟
⎟
⎟
⎠
(48)

where the entries gi are given by zp+q + ∑
p+q
i=1 gi(ϑ)zp+q−i = ˆa(z)ˆb(z) = ˆg(z, ϑ) and ˆg(ϑ) is the vector
ˆg(ϑ) = (gp+q(ϑ), gp+q−1(ϑ), . . . , g1(ϑ)) T. Likewise is the vector g(z, ϑ) = a(z)b(z) and g(ϑ) = (g1(ϑ), g1(ϑ),
. . . , gp+q(ϑ)) T, for investigating the properties of a companion matrix see e.g., [36], [2]. Since all

the roots of the polynomials â(z) and �b(z) are distinct and lie within the unit circle implies that the
products αiβj ̸= 1, αiαj ̸= 1 and βiβj ̸= 1 hold for all i = 1, 2, . . . , p and j = 1, 2, . . . , q. Consequently,
the uniqueness condition of the solution of an appropriate Stein equation is verified. The following
Stein equation and its solution, according to (45) and (46), are now presented

S − CgSC⊤
g = Γ and S =
1

2πi

�

|z|=1
(zIp+q − Cg)−1Γ(Ip+q − zCg)−⊤dz

where the closed contour is now the unit circle |z| = 1 and the matrix Γ is of size (p + q)× (p + q). A
more explicit expression of the solution S is of the form

94


Entropy 2014, 16, 2023–2055

S =
1

2πi

�

|z|=1

adj(zIp+q − Cg)Γ adj(Ip+q − zCg)⊤

a(z)b(z)ˆa(z)ˆb(z)
dz
(49)

where adj(X) = X−1 Det(X), the adjoint of matrix X. When Cauchy’s residue theorem is applied to the
solution S in (49), the following factored form of S is derived, equation (4.9) in [12]

S = (C1, C2) (Ip+q ⊗ Γ) (D(ϑ) ⊗ Ip+q) (C3, C4)⊤
(50)

where

C1 = adj(α1Ip+q − Cg), adj(α2Ip+q − Cg), . . . , adj(αpIp+q − Cg)
C2 = adj(β1Ip+q − Cg), adj(β2Ip+q − Cg), . . . , adj(βpIp+q − Cg)
C3 = adj(Ip+q − α1Cg), adj(Ip+q − α2Cg), . . . , adj(Ip+q − αpCg)
C4 = adj(Ip+q − β1Cg), adj(Ip+q − β2Cg), . . . , adj(Ip+q − βpCg)

and D ϑ) is given in (47), the following matrix rule is applied

(A ⊗ B) (C ⊗ D) = AC ⊗ BD

and the operator ⊗ is the tensor (Kronecker) product of two matrices, see e.g., [2], [20].
Combining (47) and (50) and taking the assumption, αi ̸= αj, βk ̸= βh and αi ̸= βk, into account

implies that the inverse of the (p + q)× (p + q) Vandermonde matrices Vαβ and �U αβ exist, as Lemma
4.2 [12] states.
The following equality holds true

S = (C1, C2)
�
V−1
αβ P(ϑ) ˆV−1
αβ ⊗ Γ
�
(C3, C4)⊤

or

S = (C1, C2)
�
V−1
αβ S−1(b, −a)F(ϑ)S−⊤(b, −a) ˆV−1
αβ ⊗ Γ
�
(C3, C4)⊤
(51)

Consequently, under the condition αi ̸= αj, βk ̸= βh and αi ̸= βk, and by virtue of (27) and (51),

an interconnection involving the FIM F (ϑ), a solution to an appropriate Stein equation S, the
Sylvester matrix

S

(b, −a) and the Vandermonde matrices Vαβ and �U αβ is established. It is clear that by using the
expression (43), the Bezoutian B (a, b) can be inserted in equality (51).
We will formulate a Stein equation when the matrix Γ = ep+qe⊤
p+q,

S − CgSC⊤
g = ep+qe⊤
p+q
(52)

where ep+q is the last standard basis column vector in Rp+q, em
i is the i-th unit standard basis
column vector in Rm, with all its components equal to 0 except the i-th component which equals 1. The
next lemma is formulated.

3.5.2. Lemma 3.5

The symmetric matrix ℘(ϑ) defined in (28) fulfills the Stein Equation (52).

Proof

The unique solution of (52) is according to (46)

95


Entropy 2014, 16, 2023–2055

S =
1

2πi

�

|z|=1
(zIp+q − Cg)−1ep+qe⊤
p+q(Ip+q − zCg)−⊤dz

more explictely written,

S =
1

2πi

�

|z|=1

adj(zIp+q − Cg)ep+qe⊤
p+qadj(Ip+q − zCg)⊤

a(z)b(z)ˆa(z)ˆb(z)
dz

Using the property of the companion matrix Cg, standard computation shows that the last column

of adj(zIp+q − Cg) is the basic vector up+q(z) and consequently the last column of adj(Ip+q − z Cg)

is the basic vector vp+q(z) = zp+q−1up+q(z−1). This implies that adj(zIp+q − Cg)ep+q = up+q(z) and
e⊤
p+qadj(Ip+q − zCg)⊤ = v⊤
p+q(z) or

S =
1

2πi

�

|z|=1

up+q(z)v⊤
p+q(z)

a(z)b(z)ˆa(z)ˆb(z)
dz = P(ϑ)

Consequently, the solution S to the Stein 52 coincides with the matrix ℘(ϑ) defined in (28).

The Stein equation that is verified by the FIM F (ϑ) will be considered. For that purpose we

display the following p × p and q × q companion matrices Ca and Cb of the form,

Ca =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

−a1
−a2
· · ·
· · ·
−ap
1
0
· · ·
· · ·
0

0
...
...
...
...
...
...
0
· · ·
0
1
0

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

, Cb =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

−b1
−b2
· · ·
· · ·
−bq
1
0
· · ·
· · ·
0

0
...
...
...
...
...
...
0
· · ·
0
1
0

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

respectively. Introduce the (p + q) × (p + q) matrix K(ϑ) =

�
Ca
O
O
Cb

�

and the (p + q) × 1 vector

B =

�
e1
p
−e1
q

�

, where e1
p and e1
q are the first standard basis column vectors in Rp and Rq respectively.

Consider the Stein equation

S − K(ϑ)SK⊤(ϑ) = BB⊤
(53)

followed by the theorem.

3.5.3. Theorem 3.6

The Fisher information matrix F (ϑ) (17) coincides with the solution to the Stein 53.

Proof

The eigenvalues of the companion matrices Ca and Cb are respectively the zeros of the

polynomials â(z) and �b(z) which are in absolute value smaller than one. This implies that the unique
solution of the Stein 53 exists and is given by

S =
1

2πi

�

|z|=1
(zIp+q − K(ϑ))−1BB⊤(Ip+q − zK(ϑ))−⊤dz

96


Entropy 2014, 16, 2023–2055

developing this integral expression in a more explicit form yields

S =
1

2πi

�

|z|=1

⎛

⎝

adj(zIp−Ca)

ˆa(z)
O

O
adj(zIq−Cb)

ˆb(z)

⎞

⎠
�
e1
p
−e1
q

� ⎧
⎨

⎩

⎛

⎝

adj(Ip−zCa)

a(z)
O

O
adj(Iq−zCb)

b(z)

⎞

⎠
�
e1
p
−e1
q

�⎫
⎬

⎭

⊤

dz

Considering the form of the companion matrices Ca and Cb leads through straightforward

computation to the conclusion, the first column of adj(zIp − Ca ) is the basic vector vp(z) and

consequently the first column of adj(Ip − z Ca ) is the basic vector up(z). Equivalently for the companion

matrix Cb , this yields

S =
1

2πi

�

|z|=1

⎛

⎝

vp(z)
ˆa(z)
− vq(z)

ˆb(z)

⎞

⎠
�
u⊤p (z)
a(z)
−
u⊤
q (z)
b(z)

�
dz
(54)

Representation (54) is such that in order to obtain an equivalent representation to the FIM F (ϑ) in (17),
the transpose of the solution to the Stein 53 is therefore required, to obtain

S⊤ =
1

2πi

�

|z|=1

⎛

⎜
⎝

up(z)v⊤p (z)

a(z)ˆa(z)
−
up(z)v⊤
q (z)

a(z)ˆb(z)

−
uq(z)v⊤p (z)

ˆa(z)b(z)
uq(z)v⊤
q (z)

b(z)ˆb(z)

⎞

⎟
⎠ dz = F(ϑ)
(55)

or

S⊤ =
1

2πi

�

|z|=1
(Ip+q − zK(ϑ))−1BB⊤(zIp+q − K(ϑ))−⊤dz = F(ϑ)

The symmetry property of the FIM F (ϑ), leads to S = F (ϑ). From the representation (55) can be

concluded that the solution S of the Stein 53 coincides with the symmetric block Toeplitz FIM F ( ϑ)
given in (17). This completes the proof.
It is straightforward to verify that the submatrix (1,2) in (55) is the complex conjugate transpose
of the submatrix (2,1), whereas each submatrix on the main diagonal is Hermitian, consequently,
the integrand is Hermitian. This implies that when the standard residue theorem is applied, it yields
F ( ϑ) = F T (ϑ).
An Illustrative Example of Theorem 3.6
To illustrate Theorem 3.6, the case of an ARMA(2, 2) process is considered. We will use the

representation (17) for computing the FIM F (ϑ) of an ARMA(2, 2) process. The autoregressive and
moving average polynomials are of degree two or p = q = 2 and the ARMA(2, 2) process is described by,

y(t)a(z) = b(z)e(t)
(56)

where y(t) is the stationary process driven by white noise ϵ(t), a(z) = (1 + a1z + a2z2) and b(z) = (1+b1z +
b2z2) and the parameter vector is ϑ = (a1, a2, b1, b2)T. The condition, the zeros of the polynomials

ˆa(z) = z2a(z−1) = z2 + a1z + a2 and ˆb(z) = z2b(z−1) = z2 + b1z + b2

are in absolute value smaller than one, is imposed. The FIM F (ϑ) of the ARMA(2, 2) process (56) is of
the form

97


Entropy 2014, 16, 2023–2055

F(ϑ) =

�
Faa(ϑ)
Fab(ϑ)
F ⊤
ab(ϑ)
Fbb(ϑ)

�

(57)

where

Faa(ϑ) =
1

(1−a2)
�
(1+a2)2−a2
1
�

�
1 + a2
−a1
−a1
1 + a2

�

Fbb(ϑ) =
1

(1−b2)
�
(1+b2)2−b2
1
�

�
1 + b2
−b1
−b1
1 + b2

�

Fab(ϑ) =
1

(a2b2−1)2+(a2b1−a1) (b1−a1b2)

�
a2b2 − 1
a1 − a2b1
b1 − a1b2
a2b2 − 1

�

The submatrices F aa(ϑ) and F bb(ϑ) are symmetric and Toeplitz whereas F ab(ϑ) is Toeplitz. One can
assert that without any loss of generality, the property, symmetric block Toeplitz, holds for the class
of Fisher information matrices of stationary ARMA(p, q) processes, where p and q are arbitrary, finite
integers that represent the degrees of the autoregressive and moving average polynomials, respectively.

The appropriate companion matrices Ca , Cb , the 4 × 4 matricesK (ϑ) and BBT are

Ca =

�
−a1
−a2
1
0

�

, Cb =

�
−b1
−b2
1
0

�

, K(ϑ) =

⎛

⎜
⎜
⎜
⎝

−a1
−a2
0
0
1
0
0
0
0
0
−b1
−b2
0
0
1
0

⎞

⎟
⎟
⎟
⎠
and BB⊤ =

⎛

⎜
⎜
⎜
⎝

1
0
−1
0
0
0
0
0
−1
0
1
0
0
0
0
0

⎞

⎟
⎟
⎟
⎠
(58)

where B =
�
1
0
−1
0
�⊤
. It can be verified that the Stein equation

F(ϑ) − K(ϑ)F(ϑ)K⊤(ϑ) = BB⊤

holds true, when F (ϑ) is of the form (57) and the matricesK (ϑ) and
includegraphics[scale=1]entropy-16-02023f666.pdfT are given in (58).

3.5.4. Some Additional Results

In Proposition 5.1 in [38], the matrix Q(ϑ) in (44) fulfills the Stein 59 and the property Q(ϑ) ≻ 0 is

proved. It states that when e⊤
P =
�
e⊤
1 P, 0
�⊤ = (en, 0n)⊤ ∈ R2n, where e1 is the first unit standard basis
column vector in Rn and en is the last or n-th unit standard basis column vector in Rn, the following
Stein equation admits the form

Q(ϑ) = FN(ϑ)Q(ϑ)F⊤
N (ϑ) + ePe⊤
P
(59)

where

FN(ϑ) =

�
ˆCa
0
e1e⊤
1
Cb

�

, ˆCa =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

0
1
0
· · ·
0
0
0
1
· · ·
0
...
...
...
...

0
...
...
1
−ap
−ap−1
· · ·
· · ·
−a1

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

A corollary to Proposition 5.1, [38] will be set forth, the involvement of various Vandermonde matrices
in the explicit solution to 59 is confirmed. For that purpose the following Vandermonde matrices
are displayed,

98


Entropy 2014, 16, 2023–2055

Vα =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎝

1
1
1
α1
α2
αn
α2
1
α2
2
α2
n
...
...
...
αn−1
1
αn−1
2
αn−1
n

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎠

, ˆVα =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎝

αn−1
1
αn−2
1
1
αn−1
2
αn−2
2
1
αn−1
3
αn−2
3
1
...
...
...
αn−1
n
αn−2
n
1

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎠

, ˆVαβ =

�
ˆVα
ˆVβ

�

, and Vαβ =
�
Vα
Vβ
�

(60)

where �U β and Vβ have the same configuration as �U α and Vα respectively. A corollary to Proposition
5.1 in [38] is now formulated.

3.5.5. Corollary 3.7

An explicit expression of the solution to the Stein 59 is of the form

Q(ϑ) =

�
VαD11(ϑ) ˆVα
VαD12(ϑ)V⊤
α
ˆV⊤
αβD21(ϑ) ˆVαβ
ˆV⊤
αβD22(ϑ)V⊤
αβ

�

(61)

where the n × n and 2n × 2n diagonal matrices Dkl ϑ) shall be specified in the proof.

Proof

The condition of a unique solution of the Stein 59 is guaranteed since the eigenvalues of the

companions matrices �Ca and Cb given respectively by the zeros of the polynomials â (z) and �b (z)
are in absolute value smaller than one. Consequently, the unique solution to the Stein 59 exists and is
given by

Q(ϑ) =
1

2πi

�

|z|=1
(zI2n − FN(ϑ))−1ePe⊤
P (I2n − zFN(ϑ))−⊤dz
(62)

in order to proceed successfully, the following matrix property is displayed, to obtain

�
A
O
B
C

�−1
=

�
A−1
O
−C−1BA−1
C−1

�

When applied to the 62, it yields

Q(ϑ) =
1

2πi
�

|z|=1

⎛

⎝

adj(zIp− ˆCa)

ˆa(z)
O

adj(zIq−Cb)e1e⊤
1 adj(zIp− ˆCa)

ˆa(z)ˆb(z)
adj(zIq−Cb)

ˆb(z)

⎞

⎠
�
en
0

�

×

⎧
⎨

⎩

⎛

⎝

adj(In−z ˆCa)

ˆa(z)
O

adj(In−zCb)e1e⊤
1 adj(Ip−z ˆCa)

ˆa(z)ˆb(z)
adj(In−zCb)

ˆb(z)

⎞

⎠
�
en
0

�⎫
⎬

⎭

⊤

dz

Considering that the last column vector of the matrices adj(zIp − �Ca ) and adj(In − z �Ca ) are the
vectors un(z) and vn(z) respectively, it then yields

Q(ϑ) =
1

2πi
�

|z|=1

⎛

⎝

un(z)
ˆa(z)
vn(z)

ˆa(z)ˆb(z)

⎞

⎠
�
v⊤
n (z)
a(z)
zn−1u⊤
n (z)

a(z)b(z)

�
dz

=
1

2πi
�

|z|=1

⎛

⎝

un(z)v⊤
n (z)

a(z)ˆa(z)
zn−1un(z)u⊤
n (z)

ˆa(z)a(z)b(z)
vn(z)v⊤
n (z)

ˆa(z)ˆb(z)a(z)
zn−1vn(z)u⊤
n (z)

ˆa(z)ˆb(z)a(z)b(z)

⎞

⎠ dz =

�
Q11(ϑ)
Q12(ϑ)
Q21(ϑ)
Q22(ϑ)

�

99


Entropy 2014, 16, 2023–2055

Applying the standard residue theorem leads for the respective submatrices

Q11(ϑ) = {un(α1), . . . , un(αn)}D11(ϑ) {vn(α1), . . . , vn(αn)}⊤

Q12(ϑ) = {un(α1), . . . , un(αn)}D12(ϑ) {un(α1), . . . , un(αn)}⊤

Q21(ϑ) = {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}D21(ϑ) {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}⊤

Q22(ϑ) = {vn(α1), . . . , vn(αn), vn(β1), . . . , vn(βn)}D22(ϑ) {un(α1), . . . , un(αn), un(β1), . . . , un(βn)}⊤

where the n × n diagonal matrices are

D11(ϑ) = diag {1/(a(αi)ˆa(z; αi))}, D12(ϑ) = diag {αn−1
i
/(a(αi)b(αi)ˆa(z; αi))} for i = 1, . . . , n

and the 2n × 2n diagonal matrices are

D21(ϑ) = diag
�
1/
�
a(αi)ˆb(αi)ˆa(z; αi)
�
, 1/
�
ˆa(βj)a(βj)ˆb(z; βj)
��
, for i, j = 1, . . . , n

D22(ϑ) = diag
�
αn−1
i
/
�
a(αi)b(αi)ˆb(αi)ˆa(z; αi)
�
, βn−1
j
/
�
ˆa(βj)a(βj)b(βj)ˆb(z; βj)
��
, for i, j = 1, . . . , n

It is clear that the first and third matrices in Q11(ϑ), Q12(ϑ), Q21(ϑ) and Q22(ϑ) are the appropriate
Vandermonde matrices displayed in (60), it can be concluded that the representation (61) is verified.
This completes the proof.
In this section an explicit form of the solution Q(ϑ), expressed in terms of various Vandermonde

matrices, is displayed. Also, an interconnection between the Fisher information F (ϑ) and appropriate
solutions to Stein equations and related matrices is presented. Proofs are given when the Stein

equations are verified by the FIM F (ϑ) and the associated matrix ℘(ϑ). These are alternative to the
proofs developed in [38]. The presence of various forms of Vandermonde matrices is also emphasized.

In the next section some matrix properties of the FIM F (ϑ) of an ARMAX process is presented.

3.6. The Fisher Information Matrix of an ARMAX(p, r, q) Process

The FIM of the ARMAX process (11) is set forth according to [4].
The derivatives in the
corresponding representation (16) are

∂et(ϑ)

∂aj
=
c(z)

a(z)b(z) x(t − j) +
1

a(z)e(t − j), ∂et(ϑ)

∂cl
= − 1

b(z)e(t − l) and∂et(ϑ)

∂bk
= − 1

b(z)et−k

where j = 1, . . . , p, l = 1, . . . , r and k = 1, . . . , q. Combining all j, l and k yields the (p + r + q) × (p + r +
q) FIM

G(ϑ) =

⎛

⎜
⎝
Gaa(ϑ)
Gac(ϑ)
Gab(ϑ)
G⊤
ac(ϑ)
Gcc(ϑ)
Gcb(ϑ)
G⊤
ab(ϑ)
G⊤
cb(ϑ)
Gbb(ϑ)

⎞

⎟
⎠
(63)

where the submatrices of G (ϑ) are given by

100


Entropy 2014, 16, 2023–2055

Gaa(ϑ) =
1

2πi
�

|z|=1
Rx(z)
up(z)u⊤p (z−1)c(z)c(z−1)

a(z)a(z−1)b(z)b(z−1)
dz
z +
1

2πi
�

|z|=1

up(z)u⊤p (z−1)

a(z)a(z−1)
dz
z

=
1

2πi
�

|z|=1
Rx(z)
up(z)v⊤p (z)c(z)ˆc(z)

a(z)ˆa(z)b(z)ˆb(z)zr−q dz +
1

2πi
�

|z|=1

up(z)v⊤p (z)

a(z)ˆa(z) dz

Gab(ϑ) = − 1

2πi
�

|z|=1

up(z)u⊤
q (z−1)

a(z)b(z−1)
dz
z = − 1

2πi
�

|z|=1

up(z)v⊤
q (z)

a(z)ˆb(z) dz

Gac(ϑ) = − 1

2πi
�

|z|=1
Rx(z) up(z)u⊤
r (z−1)c(z)

a(z)b(z)b(z−1)
dz
z = − 1

2πi
�

|z|=1
Rx(z) up(z)v⊤
r (z)c(z)

a(z)b(z)ˆb(z)zr−q dz

Gcc(ϑ) =
1

2πi
�

|z|=1
Rx(z) ur(z)u⊤
r (z−1)

b(z)b(z−1)
dz
z =
1

2πi
�

|z|=1
Rx(z) ur(z)v⊤
r (z)

b(z)ˆb(z)zr−q dz

Gbb(ϑ) =
1

2πi
�

|z|=1

uq(z)u⊤
q (z−1)

b(z)b(z−1)
dz
z = − 1

2πi
�

|z|=1

uq(z)v⊤
q (z)

b(z)ˆb(z) dz, and Gcb(ϑ) = O

where Rx(z) is the spectral density of the process x(t) and is defined in (10).
Let K(z) =
a(z)a(z−1)b(z)b(z−1), combining all the expressions in (63) leads to the following representation of
G (ϑ) as the sum of two matrices

1

2πi
�

|z|=1

Rx(z)
K(z)

⎛

⎜
⎝
c(z)up(z)
−a(z)ur(z)
O

⎞

⎟
⎠

⎛

⎜
⎝
c(z)up(z)
−a(z)ur(z)
O

⎞

⎟
⎠

∗

dz
z +
1

2πi
�

|z|=1

1

K(z)

⎛

⎜
⎝
b(z)up(z)
O
−a(z)uq(z)

⎞

⎟
⎠

⎛

⎜
⎝
b(z)up(z)
O
−a(z)uq(z)

⎞

⎟
⎠

∗

dz
z
(64)

where (X)* is the complex conjugate transpose of the matrix X ∈ Cm×n. Like in (23) we set forth

S(−c, a) =

�
−Sp(c)
Sr(a)

�

here Sp (c) is formed by the top p rows of S (−c, a). In a similar way we decompose

S(−b, a) =

�
−Sp(b)
Sq(a)

�

The representation (64) can be expressed by the appropriate block representations of the Sylvester
resultant matrices, to obtain

G(ϑ) =

⎛

⎜
⎝
−Sp(c)
Sr(a)
O

⎞

⎟
⎠ W(ϑ)

⎛

⎜
⎝
−Sp(c)
Sr(a)
O

⎞

⎟
⎠

⊤

+

⎛

⎜
⎝
−Sp(b)
O
Sq(a)

⎞

⎟
⎠ P(ϑ)

⎛

⎜
⎝
−Sp(b)
O
Sq(a)

⎞

⎟
⎠

⊤

(65)

where the matrix ℘(ϑ) is given in (28) and the matrix P (ϑ) ∈ R(p+r)×(p+r) is of the form

W(ϑ) =
1

2πi

�

|z|=1

Rx(z)
up+r(z)u⊤
p+r(z−1)

a(z)a(z−1)b(z)b(z−1)
dz
z =
1

2πi

�

|z|=1

Rx(z)
up+r(z)v⊤
p+r(z)

a(z)b(z)ˆa(z)ˆb(z)
dz
(66)

It is shown in [4] that P (ϑ) ≻ O. As can be seen in (65), the ARMAX part is explained by the first
term, whereas the ARMA part is described by the second term, the combination of both terms is a

summary of the Fisher information of a ARMAX(p, r, q) process. The FIM G(ϑ) under form (65) allows

us to prove the following property, Theorem 3.1 in [4]. The FIM G (ϑ) of the ARMAX(p, r, q) process
with polynomials a(z), c(z) and b(z) of order p, r, q respectively becomes singular if and only if these

101


Entropy 2014, 16, 2023–2055

polynomials have at least one common root. Consequently, the class of resultant matrices is extended

by the FIM G (ϑ).

3.7. The Stein Equation - The Fisher Information Matrix of an ARMAX(p, r, q) Process

In Lemma 3.5 it is proved that the matrix ℘(ϑ) (28) fulfills the Stein 52. We will now consider

the conditions under which the matrix P (ϑ) (66) verifies an appropriate Stein equation. For that
purpose we consider the spectral density to be of the form Rx(z) = (1/h(z)h(z−1)). The degree of the
polynomial h(z) is ℓ and we assume the distinct roots of the polynomial h(z) to lie outside the unit

circle, consequently, the roots of the polynomial ˆh(z) lie within the unit circle. We therefore rewrite P

(ϑ) accordingly

W(ϑ) =
1

2πi

�

|z|=1

up+r(z)u⊤
p+r(z−1)

h(z)h(z−1)a(z)a(z−1)b(z)b(z−1)
dz
z

We consider a companion matrix of the form (48) and with size p + q + ℓ, it is denoted by Cf and the

entries fi are given by zp+q+ℓ + ∑
p+q+ℓ
i=1
fi(ϑ)zp+q+qℓ−i = ˆa(z)ˆb(z)ˆh(z) = ˆf (z, ϑ) and �f((ϑ) is the vector

�f((ϑ) = (fp+q+ℓ(ϑ), fp+q+ℓ−1(ϑ), . . . , f 1(ϑ))T. Likewise for the vector f(z, ϑ) = a(z)b(z)h(z) and f(ϑ) =

(f 1(ϑ), f 1(ϑ), . . . , fp+q+ℓ(ϑ))T. The property Det(zIp+q+ℓ − Cf ) = â(z)�b(z)ˆh(z) and Det(Ip+q+ℓ − z Cf ) =
a(z)b(z)h(z) holds and assume

r = q + ℓ or p + q + ℓ = p + r and r > q
(67)

P (ϑ) is then of the form

W(ϑ) =
1

2πi

�

|z|=1

up+r(z)v⊤
p+r(z)

h(z)ˆh(z)a(z)ˆa(z)b(z)ˆb(z)
dz
(68)

We will formulate a Stein equation when the matrix Γ = ep+re⊤
p+r and which is of the form

S − C f SC⊤
f = ep+re⊤
p+r
(69)

where ep+r is the last standard basis column vector in Rp+r. The next lemma is formulated.

3.7.1. Lemma 3.8

The matrix P (ϑ) given in (68) fulfills the Stein 69.

Proof

The unique solution of (69) is assured since the product of all the eigenvalues of Cf are different
from one, the solution is of the form

S =
1

2πi

�

|z|=1
(zIp+r − C f )−1ep+re⊤
p+r(Ip+r − zC f )−⊤dz

or

S =
1

2πi

�

|z|=1

adj(zIp+r − C f )ep+re⊤
p+radj(Ip+r − zC f )⊤

ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z)
dz

102


Entropy 2014, 16, 2023–2055

taking the property of the companion matrix Cf into account implies that the last column vector of

adj(zIp+r − Cf ) is the basic vector up+r(z), consequently the last column of adj(Ip+r − z Cf ) is the basic
vector vp+r(z), this yields

S =
1

2πi

�

|z|=1

up+r(z)v⊤
p+r(z)

ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z)
dz = W(ϑ)

Consequently, the matrix P (ϑ) defined in (68) verifies the Stein 69. This completes the proof.

The matrices, ℘(ϑ) and P (ϑ), in (65), verify under specific conditions appropriate Stein equations,
as has been shown in Lemma 3.5 and Lemma 3.8, respectively. We will now confirm the presence of

Vandermonde matrices by applying the standard residue theorem to P (ϑ) in (68), to obtain

W(ϑ) = VαβξR (ϑ) ˆVαβξ
(70)

The (p + r) × (p + r) diagonal matrix R(ϑ) is of the form

R (ϑ) = diag
��
1/ˆa(z; αi)ˆb(αi)ˆh(αi)ϕ(αi)
�
,
�
1/ˆa(βj)ˆb(z; βj)ˆh(βj)ϕ(βj)
�
,
�
1/ˆa(ξk)ˆb(ξk)ˆh(z; ξk)ϕ(ξk)
��

where ϕ(z) = a(z)b(z)h(z) and i = 1, . . . , p, j = 1, . . . , q and k = 1, . . . , ℓ. Whereas the (p + r) × (p + r)

matrices Vαβξ and �U αβξ are of the form

Vαβξ =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

1
α1
α2
1
· · ·
αp+r−1
1
...
...
...
...
...
1
αp
α2
p
· · ·
αp+r−1
p
1
β1
β2
1
· · ·
βp+r−1
1
...
...
...
...
...
1
βq
β2
q
· · ·
βp+r−1
q
1
ξ1
ξ2
1
· · ·
ξp+r−1
1
...
...
...
...
...
1
ξℓ
ξ2
ℓ
· · ·
ξp+r−1
ℓ

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

⊤

, ˆVαβξ =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

αp+r−1
1
αp+r−2
1
· · ·
α1
1
...
...
...
...
αp+r−1
p
αp+r−2
p
· · ·
αp
1
βp+r−1
1
βp+r−2
1
· · ·
β1
1
...
...
...
...
βp+r−1
q
βp+r−2
q
· · ·
βq
1
ξp+r−1
1
ξp+r−2
1
· · ·
ξ1
1
...
...
...
...
ξp+r−1
ℓ
ξp+r−2
ℓ
· · ·
ξℓ
1

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

The (p + r) × (p + r) Vandermonde matrices Vαβξ and �U αβξ are nonsingular when αi ̸= αj , βk ̸= βh,
ξm ̸= ξn, αi ̸= βk, αi ̸= ξm, βk ̸= ξm for all i, j = 1, . . . , p, k, h = 1, . . . , q and m,n = 1, . . . , ℓ. The

Vandermonde determinants DetVαβξ and Det �U αβξ, are

DetVαβξ = (−1)(p+r) (p+r−1)/2 Ψ (αi, βk, ξm)

where

Ψ (αi, βk, ξm) =
∏
1≤i<j≤p
(αi − αj)
∏
1≤k<h≤q
(βk − βh)
∏
1≤m<n≤ℓ
(ξm − ξn)
∏
r = 1, . . . , p
s = 1, . . . , q

(αr − βs)
∏
r = 1, . . . , p
w = 1, . . . , ℓ

(αr − ξw)
∏
s = 1, . . . , q
w = 1, . . . , ℓ

(βs − ξw)

Like for the Vandermonde matrices Vαβ and ˆV⊤
αβ,

103


Entropy 2014, 16, 2023–2055

Det ˆVαβξ = (−1)(p+r) (p+r−1) Ψ (αi, βk, ξm) ⇒| DetVαβξ |= | Det ˆVαβξ |

(70) is the ARMAX equivalent to (47). A combination of both equations generates a new representation

of the FIM G (ϑ), this is set forth in the following lemma.

3.7.2. Lemma 3.9

Assume the conditions (67) to hold and consider the representations of ℘(ϑ) and P (ϑ) in (47) and (70)
respectively, leads to an alternative form to (65), it is given by

G(ϑ) =

⎛

⎜
⎝
−Sp(c)
Sr(a)
O

⎞

⎟
⎠ VαβξR (ϑ) ˆVαβξ

⎛

⎜
⎝
−Sp(c)
Sr(a)
O

⎞

⎟
⎠

⊤

+

⎛

⎜
⎝
−Sp(b)
O
Sq(a)

⎞

⎟
⎠ VαβD(ϑ) ˆVαβ

⎛

⎜
⎝
−Sp(b)
O
Sq(a)

⎞

⎟
⎠

⊤

In Lemma 3.9, the FIM G (ϑ) is expressed by submatrices of two Sylvester matrices and various
Vandermonde matrices, both type of matrices become singular if and only if the appropriate
polynomials have at least one common root.

3.8. The Fisher Information Matrix of a Vector ARMA(p, q) Process

The process (5) is summarized as,

A(z)y(t) = B(z)e(t)

and we assume that {y(t), t ∈ N}, is a zero mean Gaussian time series and {ϵ(t), t ∈ N} is a n-dimensional
vector random variable, such that

Eϑ

{ϵ(t)} = 0 and Eϑ {ϵ(t)ϵT (t)} = ∑ and the parameter vector ϑ is of the form (7). In [6] it is shown that
representation (16) for the n2(p+q)×n2(p+q) asymptotic FIM of the VARMA process (6) is

F(ϑ) = Eϑ

�� ∂e

∂ϑ⊤

�⊤
Σ−1
� ∂e

∂ϑ⊤

��

(71)

where ∂ϵ/∂ϑT is of size n×n2(p+q) and for convenience t is omitted from ϵ(t). Using the differential
rules outlined in [6], yields

∂e
∂ϑ⊤ =
�
(A−1(z)B(z)e)
⊤ ⊗ B−1(z)
�∂vec A(z)

∂ϑ⊤
− (e⊤ ⊗ B−1(z))∂vec B(z)

∂ϑ⊤
(72)

The substitution of representation (72) of ∂ϵ/∂ϑ T in (71) yields the FIM of a VARMA process. The
purpose is to construct a factorization of the FIM F(ϑ) that should be a multiple variant of the
factorization (27), so that a multiple resultant matrix property can be proved for F(ϑ). As illustrated
in [6], the multiple version of the Sylvester resultant matrix (22) does not fulfill the multiple resultant
matrix property. In that case even when the matrix polynomials A(z) and B(z) have a common zero or
a common eigenvalue, the multiple Sylvester matrix is not neccessarily singular. This has also been

illustrated in [3]. In order to consider a multiple equivalent to the resultant matrix S −b, a), Gohberg
and Lerer set forth the n2(p + q) × n2(p + q) tensor Sylvester matrix

104


Entropy 2014, 16, 2023–2055

S⊗(−B, A) :=

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

(−In) ⊗ In
(−B1) ⊗ In
· · ·
(−Bq) ⊗ In
On2×n2
· · ·
On2×n2

On2×n2
...
...
...
...
...
...
...
...
...
...
On2×n2
On2×n2
· · ·
On2×n2
(−In) ⊗ In
(−B1) ⊗ In
· · ·
(−Bq) ⊗ In
In ⊗ In
In ⊗ A1
· · ·
In ⊗ Ap
On2×n2
· · ·
On2×n2

On2×n2
...
...
...
...
...
...
...
...
...
...
On2×n2
On2×n2
· · ·
On2×n2
In ⊗ In
In ⊗ A1
· · ·
In ⊗ Ap

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

(73)

In [3], the authors prove that the tensor Sylvester matrix S⊗ (−B,A) fulfills the multiple resultant
property, it becomes singular if and only if the appropriate matrix polynomials A(z) and B(z) have at
least one common zero. In Proposition 2.2 in [6], the following factorized form of the Fisher information
F(ϑ) is developed

F(ϑ) =
1

2πi

�

|z|=1

Φ(z)Θ(z)Φ∗(z)dz

z
(74)

where

Φ(z) =

�
Ip ⊗ A−1(z) ⊗ In
Opn2×qn2
Oqn2×pn2
Iq ⊗ In ⊗ A−1(z)

�

S⊗(−B, A) (up+q(z) ⊗ In2)

and

Θ(z) = Σ ⊗ σ(z), σ(z) = B−⊤(z)Σ−1B−1(z−1)
(75)

In order to obtain a multiple variant of (27), the following matrix is introduced,

M(ϑ) =
1

2πi

�

|z|=1

Λ(z)J (z)Λ∗(z)dz

z = S⊗(−B, A)P(ϑ) (S⊗(−B, A))⊤
(76)

where

J (z) = Φ(z)Θ(z)Φ∗(z) and Λ(z) =

�
Ip ⊗ A(z) ⊗ In
Opn2×qn2
Oqn2×pn2
Iq ⊗ In ⊗ A(z)

�

and the matrix P(ϑ) is a multiple variant of the matrix ℘(ϑ) in (28), it is of the form

P(ϑ) =
1

2πi

�

|z|=1
(up+q(z) ⊗ In2) Θ(z) (up+q(z) ⊗ In2)∗ dz

z
(77)

In Lemma 2.3 in [6], it is proved that the matrix M(ϑ) in (76) becomes singular if and only if the matrix
polynomials A(z) and B(z) have at least one common eigenvalue-zero. The proof is a multiple equivalent
of the proof of Corollary 2.2 in [5], since the equality (76) is a multiple version of (27). Consequently,

the matrix M(ϑ) like the tensor Sylvester matrix S⊗ (−B,A), fulfills the multiple resultant matrix
property. Since the matrix M(ϑ) is derived from the FIM F(ϑ), this enables us to prove that the matrix
F(ϑ) fulfills the multiple resultant matrix property by showing that it becomes singular if and only if
the matrix M(ϑ) is singular, this is done in Proposition 2.4 in [6]. Consequently, it can be concluded

from [6] that the FIM of a VARMA process F(ϑ) and the tensor Sylvester matrix S⊗ (−B,A) have the
same singularity conditions. The FIM of a VARMA process F(ϑ) can therefore be added to the class of
multiple resultant matrices.

105


Entropy 2014, 16, 2023–2055

A brief summary of the contribution of [6] follows, in order to show that the FIM of a VARMA
process F(ϑ) is a multiple resultant matrix two new representations of the FIM are derived. To
construct such representations appropriate matrix differential rules are applied. The newly obtained
representations are expressed in terms of the multiple Sylvester matrix and the tensor Sylvester matrix.
The representation of the FIM expressed by the tensor Sylvester matrix is used to prove that the FIM
becomes singular if and only if the autoregressive and moving average matrix polynomials have
at least one common eigenvalue. It then follows that the FIM and the tensor Sylvester matrix have
equivalent singularity conditions. In a numerical example it is shown, however, that the FIM fails to
detect common eigenvalues due to some kind of numerical instability. The tensor Sylvester matrix
reveals it clearly, proving the usefulness of the results derived in this paper.

3.9. The Fisher Information Matrix of a Vector ARMAX(p, r, q) Process

The n2(p + q + r) × n2(p + q + r) asymptotic FIM of the VARMAX(p, r, q) process (2)

A(z)y(t) = C(z)x(t) + B(z)e(t)

is displayed according to [23] and is an extension of the FIM of the VARMA(p, q) process (6).
Representation (16) of the FIM of the VARMAX(p, r, q) process is then

G(ϑ) = Eϑ

�� ∂e

∂ϑ⊤

�⊤
Σ−1
� ∂e

∂ϑ⊤

��

where

∂e
∂ϑ⊤ =
�
(A−1(z)C(z)x)⊤ ⊗ B−1(z)
� ∂vec A(z)

∂ϑ⊤
+
�
(A−1(z)B(z)e)⊤ ⊗ B−1(z)
� ∂vec A(z)

∂ϑ⊤
−{x⊤ ⊗ B−1(z)} ∂vec C(z)

∂ϑ⊤
−(e⊤ ⊗ B−1(z)) ∂vec B(z)

∂ϑ⊤

(78)

To obtain the term ∂ϵ/∂ϑ T, of size n × n2(p + q + r), the same differential rules are applied as for the
VARMA(p, q) process. In Proposition 2.3 in [23], the representation of the FIM of a VARMAX process
is expressed in terms of tensor Sylvester matrices, this obtained when ∂ϵ/∂ϑ T in (78) is substituted
in (16), to obtain

G(ϑ) =
1

2πi

�

|z|=1 Φx(z)Θ(z)Φ∗
x(z)dz

z +
1

2πi

�

|z|=1 Λx(z)Ψ(z)Λ∗
x(z)dz

z
(79)

The matrices in (79) are of the form

Φx(z) =

⎛

⎜
⎝

Ip ⊗ A−1(z) ⊗ In
Opn2×rn2
Opn2×qn2
Orn2×pn2
Orn2×rn2
Orn2×qn2
Oqn2×pn2
Oqn2×rn2
Iq ⊗ In ⊗ A−1(z)

⎞

⎟
⎠

⎛

⎜
⎝
−S⊗
p (B)
Orn2×n2(p+q)
S⊗
q (A)

⎞

⎟
⎠ (up+q(z) ⊗ In2)

Λx(z) =

⎛

⎜
⎝

Ip ⊗ A−1(z) ⊗ In
Opn2×rn2
Opn2×qn2
Orn2×pn2
Ir ⊗ In ⊗ A−1(z)
Orn2×qn2
Oqn2×pn2
Oqn2×rn2
Oqn2×qn2

⎞

⎟
⎠

⎛

⎜
⎝
−S⊗
p (C)
S⊗
r (A)
Oqn2×n2(p+r)

⎞

⎟
⎠ (up+r(z) ⊗ In2)

S⊗
p,q(−B, A) =

�
−S⊗
p (B)
S⊗
q (A)

�

, S⊗
p,r(−C, A) =

�
−S⊗
p (C)
S⊗
r (A)

�

(80)

additionally we have Ψ(z) = Rx(z) ⊗ σ(z) and the Hermitian spectral density matrix Rx(z) is defined
in (10), whereas the matrix polynomials Θ(z) and σ(z) are presented in (75). In (80), we have the pn2

× (p + q)n2 and qn2 × (p + q)n2 submatrices S⊗
p (−B) and S⊗
q (A) of the tensor Sylvester resultant
matrix S⊗
p,q(−B, A). Whereas the matrices S⊗
p (−C) and S⊗
r (A) are the upper and lower blocks of the
(p+r)n2×(p+r)n2 tensor Sylvester resultant matrix S⊗
p,r(−C, A). As for the FIM of the VARMA(p, q)
process, the objective is to construct a multiple version of (65), this done in [23], to obtain

106


Entropy 2014, 16, 2023–2055

Mx(ϑ) =
1

2πi
�

|z|=1 L(z)A(z)L∗(z) dz

z +
1

2πi
�

|z|=1 W(z)B(z)W∗(z) dz

z

=

⎛

⎜
⎝
−S⊗
p (B)
Orn2×n2(p+q)
S⊗
q (A)

⎞

⎟
⎠ P(ϑ)

⎛

⎜
⎝
−S⊗
p (B)
Orn2×n2(p+q)
S⊗
q (A)

⎞

⎟
⎠

⊤

+

⎛

⎜
⎝
−S⊗
p (C)
S⊗
r (A)
Oqn2×n2(p+r)

⎞

⎟
⎠ T(ϑ)

⎛

⎜
⎝
−S⊗
p (C)
S⊗
r (A)
Oqn2×n2(p+r)

⎞

⎟
⎠

⊤
(81)

The matrices involved are of the form

L(z) =

⎛

⎜
⎝

Ip ⊗ A(z) ⊗ In
Opn2×rn2
Opn2×qn2
Orn2×pn2
Orn2×rn2
Orn2×qn2
Oqn2×pn2
Oqn2×rn2
Iq ⊗ In ⊗ A(z)

⎞

⎟
⎠ and A(z) := Φx(z)Θ(z)Φ∗
x(z)

W(z) =

⎛

⎜
⎝

Ip ⊗ A(z) ⊗ In
Opn2×rn2
Opn2×qn2
Orn2×pn2
Ir ⊗ In ⊗ A(z)
Orn2×qn2
Oqn2×pn2
Oqn2×rn2
Oqn2×qn2

⎞

⎟
⎠ and B(z) := Λx(z)Ψ(z)Λ∗
x(z)

T(ϑ) =
1

2πi
�

|z|=1 (up+r(z) ⊗ In2) Ψ(z) (up+r(z) ⊗ In2)∗ dz

z

and P(ϑ) is given in (77). Note, the matrices Φx(z), Λx(z), L(z) and P (z) are the corrected versions of
the corresponding matrices in [23].
A parallel between the scalar and multiple structures is straightforward. This is best illustrated
by comparing the representations (27) and (28) with (76) and (77) respectively, confronting the FIM
for scalar and vector ARMA(p, q) processes. The FIM of the scalar ARMAX(p, r, q) process contains
an ARMA(p, q) part, this is confirmed by (65), through the presence of the matrix ℘(ϑ) which is
originally displayed in (28). The multiple resultant matrices M(ϑ) and Mx(ϑ) derived from the FIM
of the VARMA(p, q) and VARMAX(p, r, q) processes respectively both contain P(ϑ), whereas the first
matrix term of the matrices Φ(z) and Φx(z), which are of different size, consist of the same nonzero
submatrices. To summarize, in [23] compact forms of the FIM of a VARMAX process expressed in
terms of multiple and tensor Sylvester matrices are developed. The tensor Sylvester matrices allow
us to investigate the multiple resultant matrix property of the FIM of VARMAX(p, r, q) processes.
However, since no proof of the multiple resultant matrix property of the FIM G(ϑ) has been done yet,
justifies the consideration of a conjecture. A conjecture that states, the FIMG(ϑ) of a VARMAX(p, r, q)
process becomes singular if and only if the matrix polynomials A(z), B(z) and C(z) have at least one
common eigenvalue. A multiple equivalent to Theorem 3.1 in [4] and combined with Proposition 2.4
in [6], but based on the representations (79) and (81), can be envisaged to formulate a proof which will
be a subject for future study.

4. Conclusions

In this survey paper, matrix algebraic properties of the FIM of stationary processes are discussed.
The presented material is a summary of papers where several matrix structural aspects of the FIM
are investigated. The FIM of scalar and multiple processes like the (V)ARMA(X) are set forth with
appropriate factorized forms involving (tensor) Sylvester matrices. These representations enable us to
prove the resultant matrix property of the corresponding FIM. This has been done for (V)ARMA(p,
q) and ARMAX(p, r, q) processes in the papers [4,6]. The development of the stages that lead to the
appropriate factorized form of the FIM G(ϑ) (79) is set forth in [23]. However, there is no proof done
yet that confirms the multiple resultant matrix property of the FIM G(ϑ) of a VARMAX(p, r, q) process.
This justifies the consideration of a conjecture which is formulated in the former section, this can be a
subject for future study.
The statistical distance measure derived in [7], involves entries of the FIM. This distance measure
can be a challenge to its quantum information counterpart (41). Because (36) involves information
about m parameters estimated from n measurements. Whereas in quantum information, like in
e.g., [8,10], the information about one parameter in a particular measurement procedure is considered

107


Entropy 2014, 16, 2023–2055

for establishing an interconnection with the appropriate statistical distance measure. A possible
approach, by combining matrix algebra and quantum information, for developing a statistical distance
measure in quantum information or quantum statistics but at the matrix level, can be a subject of
future research. Some results concerning interconnections between the FIM of ARMA(X) models
and appropriate solutions to Stein matrix equations are discussed, the material is extracted from the
papers, [12] and [13]. However, in this paper, some alternative and new proofs that emphasize the
conditions under which the FIM fulfills appropriate Stein equations, are set forth. The presence of
various types of Vandermonde matrices is also emphasized when an explicit expansion of the FIM is
computed. These Vandermonde matrices are inserted in interconnections with appropriate solutions to
Stein equations. This explains, when the matrix algebraic structures of the FIM of stationary processes
are investigated, the involvement of structured matrices like the (tensor) Sylvester, Bezoutian and
Vandermonde matrices is essential.

Acknowledgments: The author thanks a perceptive reviewer for his comments which significantly improved the
quality and presentation of the paper.

Conflicts of Interest: The authors have declared no conflict of interest.

References

1.
Dym, H. Linear Algebra in Action; American Mathematical Society: Providence, RI, USA, 2006; Volume 78.
2.
Lancaster, P.; Tismenetsky, M. The Theory of Matrices with Applications, 2nd ed; Academic Press: Orlando, FL,
USA, 1985.
3.
Gohberg, I.; Lerer, L. Resultants of matrix polynomials. Bull. Am. Math. Soc 1976, 82, 565–567.
4.
Klein, A.; Spreij, P. On Fisher’s information matrix of an ARMAX process and Sylvester’s resultant matrices.
Linear Algebra Appl 1996, 237/238, 579–590.
5.
Klein, A.; Spreij, P. On Fisher’s information matrix of an ARMA process. In Stochastic Differential and Difference
Equations; Csiszar, I., Michaletzky, Gy., Eds.; Birkhäuser: Boston: Boston, USA, 1997; Progress in Systems and
Control Theory; Volume 23, pp. 273–284.
6.
Klein, A.; Mélard, G.; Spreij, P. On the Resultant Property of the Fisher Information Matrix of a Vector ARMA
process. Linear Algebra Appl 2005, 403, 291–313.
7.
Klein, A.; Spreij, P. Transformed Statistical Distance Measures and the Fisher Information Matrix.
Linear Algebra Appl 2012, 437, 692–712.
8.
Braunstein, S.L.; Caves, C.M. Statistical Distance and the Geometry of Quantum States. Phys. Rev. Lett 1994,
72, 3439–3443.
9.
Jones, P.J.; Kok, P. Geometric derivation of the quantum speed limit. Phys. Rev. A 2010, 82, 022107.
10.
Kok, P. Tutorial: Statistical distance and Fisher information; Oxford: UK, 2006.
11.
Lancaster, P.; Rodman, L. Algebraic Riccati Equations; Clarendon Press: Oxford, UK, 1995.
12.
Klein, A.; Spreij, P. On Stein’s equation, Vandermonde matrices and Fisher’s information matrix of time
series processes. Part I: The autoregressive moving average process. Linear Algebra Appl 2001, 329, 9–47.
13.
Klein, A.; Spreij, P. On the solution of Stein’s equation and Fisher’s information matrix of an ARMAX process.
Linear Algebra Appl 2005, 396, 1–34.
14.
Grenander, U.; Szeg˝o, G.P. Toeplitz Forms and Their Applications; University of California Press: New York, NY,
USA, 1958.
15.
Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods, 2nd ed; Springer Verlag: Berlin, Germany;
New York, NY, USA, 1991.
16.
Caines, P. Linear Stochastic Systems; John Wiley and Sons: New York, NY, USA, 1988.
17.
Ljung, L.; Söderström, T. Theory and Practice of Recursive Identification; M.I.T. Press: Cambridge, MA, USA, 1983.
18.
Hannan, E.J.; Deistler, M. The Statistical Theory of Linear Systems; John Wiley and Sons: New York, NY, USA, 1988.
19.
Hannan, E.J.; Dunsmuir, W.T.M.; Deistler, M. Estimation of vector Armax models. J. Multivar. Anal 1980, 10,
275–295.
20.
Horn, R.A.; Johnson, C.R. Topics in Matrix Analysis; Cambridge University Press: New York, NY, USA, 1995.
21.
Klein, A.; Spreij, P. Matrix differential calculus applied to multiple stationary time series and an extended
Whittle formula for information matrices. Linear Algebra Appl 2009, 430, 674–691.

108


Entropy 2014, 16, 2023–2055

22.
Klein, A.; Mélard, G. An algorithm for the exact Fisher information matrix of vector ARMAX time series.
Linear Algebra Its Appl 2014, 446, 1–24.
23.
Klein, A.; Spreij, P. Tensor Sylvester matrices and the Fisher information matrix of VARMAX processes.
Linear Algebra Appl 2010, 432, 1975–1989.
24.
Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
Math. Soc 1945, 37, 81–91.
25.
Ibragimov, I.A.; Has’minski˘ı, R.Z. Statistical Estimation. In Asymptotic Theory; Springer-Verlag: New York,
NY, USA, 1981.
26.
Lehmann, E.L. Theory of Point Estimation; Wiley: New York, NY, USA, 1983.
27.
Friedlander, B. On the computation of the Cramér-Rao bound for ARMA parameter estimation. IEEE Trans.
Acoust. Speech Signal Process 1984, 32, 721–727.
28.
Holevo, A.S. Probabilistic and Statistical Aspects of Quantum Theory, 2nd ed; Edizioni Della Normale, SNS Pisa:
Pisa, Italy, 2011.
29.
Petz, T. Quantum Information Theory and Quantum Statistics; Springer-Verlag: Berlin Heidelberg, Germany,
2008.
30.
Barndorff-Nielsen, O.E.; Gill, R.D. Fisher Information in quantum statistics. J. Phys. A 2000, 30, 4481–4490.
31.
Luo, S. Wigner-Yanase skew information vs. quantum Fisher information. Proc. Amer. Math. Soc 2004, 132,
885–890.
32.
Klein, A.; Mélard, G. On algorithms for computing the covariance matrix of estimates in autoregressive
moving average processes. Comput. Stat. Q 1989, 5, 1–9.
33.
Klein, A.; Mélard, G. An algorithm for computing the asymptotic Fisher information matrix for seasonal
SISO models. J. Time Ser. Anal 2004, 25, 627–648.
34.
Bistritz, Y.; Lifshitz, A. Bounds for resultants of univariate and bivariate polynomials. Linear Algebra Appl
2010, 432, 1995–2005.
35.
Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: New York, NY, USA, 1996.
36.
Golub, G.H.; van Loan, C.F. Matrix Computations, 3rd ed; John Hopkins University Press: Baltimore, USA, 1996.
37.
Kullback, S. Information Theory and Statistics; John Wiley and Sons: New York, NY, USA, 1959.
38.
Klein, A.; Spreij, P. The Bezoutian, state space realizations and Fisher’s information matrix of an ARMA
process. Linear Algebra Appl 2006, 416, 160–174.

© 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

109


entropy

Article
Asymptotically Constant-Risk Predictive Densities
When the Distributions of Data and Target Variables
Are Different

Keisuke Yano 1,* and Fumiyasu Komaki 1,2

1 Department of Mathematical Informatics, Graduate School of Information Science and Technology,
The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan; E-Mail:
komaki@mist.i.u-tokyo.ac.jp
2 RIKEN Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan
*
E-Mail: Keisuke_Yano@mist.i.u-tokyo.ac.jp; Tel.: +81-3-5841-6909.

Received: 28 March 2014; in revised form: 9 May 2014 / Accepted: 22 May 2014 /
Published: 28 May 2014

Abstract: We investigate the asymptotic construction of constant-risk Bayesian predictive densities
under the Kullback–Leibler risk when the distributions of data and target variables are different and
have a common unknown parameter. It is known that the Kullback–Leibler risk is asymptotically
equal to a trace of the product of two matrices: the inverse of the Fisher information matrix for the
data and the Fisher information matrix for the target variables. We assume that the trace has a unique
maximum point with respect to the parameter. We construct asymptotically constant-risk Bayesian
predictive densities using a prior depending on the sample size. Further, we apply the theory to the
subminimax estimator problem and the prediction based on the binary regression model.

Keywords:
Bayesian prediction; Fisher information; Kullback–Leibler divergence; minimax;
predictive metric; subminimax estimator

1. Introduction

Let x(N) = (x1, · · · , xN) be independent N data distributed according to a probability density,
p(x|θ), that belongs to a d-dimensional parametric model, {p(x|θ) : θ ∈ Θ}, where θ = (θ1, · · · , θd)
is an unknown d-dimensional parameter and Θ is the parameter space. Let y be a target variable
distributed according to a probability density, q(y|θ), that belongs to a d-dimensional parametric
model, {q(y|θ) : θ ∈ Θ} with the same parameter, θ. Here, we assume that the distributions of the
data and the target variables, p(x|θ) and q(y|θ), are different. For simplicity, we assume that the data
and the target variables are independent, given by θ.
We construct predictive densities for target variables based on the data.
We measure
the performance of the predictive density,
ˆq(y; x(N)), by the Kullback–Leibler divergence,
D(q(·|θ), ˆq(·; x(N))), from the true density, q(y|θ), to the predictive density, ˆq(y; x(N)):

D(q(·|θ), ˆq(·; x(N)))
=
�
q(y|θ) log
q(y|θ)

ˆq(y; x(N))dy.

Then, the risk function, R(θ, ˆq(y; x(N))), of the predictive density, ˆq(y; x(N)), is given by:

R(θ, ˆq(y; x(N)))
=
�
p(x(N)|θ)D(q(·|θ), ˆq(·; x(N)))dx(N)

=
�
p(x(N)|θ)
�
q(y|θ) log
q(y|θ)

ˆq(y; x(N))dydx(N).

Entropy 2014, 16, 3026–3048; doi:10.3390/e16063026
www.mdpi.com/journal/entropy
110


Entropy 2014, 16, 3026–3048

For the construction of predictive densities, we consider the Bayesian predictive density defined
by:

ˆqπ(y|x(N))
=
�
q(y|θ)p(x(N)|θ)π(θ; N)dθ
�
p(x(N)|θ)π(θ; N)dθ
,

where π(θ; N) is a prior density for θ, possibly depending on the sample size, N. Aitchison [1]
showed that, for a given prior density, π(θ; N), the Bayesian predictive density, ˆqπ(y|x(N)), is a Bayes
solution under the Kullback–Leibler risk. Based on the asymptotics as the sample size goes to infinity,
Komaki [2] and Hartigan [3] showed its superiority over any plug-in predictive density, q(y| ˆθ), with
any estimator, ˆθ. However, there remains a problem of prior selection for constructing better Bayesian
predictive densities. Thus, a prior, π(θ; N), must be chosen based on an optimality criterion for actual
applications.
Among various criteria, we focus on a criterion of constructing minimax predictive densities
under the Kullback–Leibler risk. For simplicity, we refer to the priors generating minimax predictive
densities as minimax priors. Minimax priors have been previously studied in various predictive
settings; see [4–8]. When the simultaneous distributions of the target variables and the data belong to
the submodel of the multinomial distributions, Komaki [7] shows that minimax priors are given as
latent information priors maximizing the conditional mutual information between target variables and
the parameter given the data. However, the explicit forms of latent information priors are difficult to
obtain, and we need asymptotic methods, because they require the maximization on the space of the
probability measures on Θ.
Except for [7], these studies on minimax priors are based on the assumption that the distributions,
p(x|θ) and q(y|θ), are identical. Let us consider the prediction based on the logistic regression model
where the covariates of the data and the target variables are not identical. In this predictive setting, the
assumption that the distributions, p(x|θ) and q(y|θ), are identical is no longer valid.
We focus on the minimax priors in predictions where the distributions, p(x|θ) and q(y|θ), are
different and have a common unknown parameter. Such a predictive setting has traditionally been
considered in statistical prediction and experiment design. It has recently been studied in statistical
learning theory; for example, see [9]. Predictive densities where the distributions, p(x|θ) and q(y|θ),
are different and have a common unknown parameter are studied by [10–13].
Let gX
ij (θ) be the (i, j)-component of the Fisher information matrix of the distribution, p(x|θ), and

let gY
ij(θ) be the (i, j)-component of the Fisher information matrix of the distribution, q(y|θ). Let gX,ij(θ)

and gY,ij(θ) denote the (i, j)-components of their inverse matrices. We adopt Einstein’s summation
convention: if the same indices appear twice in any one term, it implies summation over that index
from one to d. For the asymptotics below, we assume that the prior densities, π(θ; N), are smooth.
On the asymptotics as the sample size N goes to infinity, we construct the asymptotically
constant-risk prior, π(θ; N), in the sense that the asymptotic risk:

R(θ, ˆqπ(y|x(N))) = 1

N R1(θ, ˆqπ(y|x(N))) +
1

N
√

N
R2(θ, ˆqπ(y|x(N))) + O(N−2)

is constant up to O(N−2). Since the proper prior with the constant risk is a minimax prior for any finite
sample size, the asymptotically constant-risk prior relates to the minimax prior; in Section 4, we verify
that the asymptotically constant-risk prior agrees with the exact minimax prior in binomial examples.
When we use the prior, π(θ), independent of the sample size, N, it is known that the N−1-order
term, R1(θ, ˆqπ(y|x(N))), of the Kullback–Leibler risk is equal to the trace, gX,ij(θ)gY
ij(θ). If the trace
does not depend on the parameter, θ, the construction of the asymptotically constant-risk prior is
parallel to [6]; see also [13].
However, we consider the settings where there exists a unique maximum point of the trace,
gX,ij(θ)gY
ij(θ); for example, these settings appear in predictions based on the binary regression model,

111


Entropy 2014, 16, 3026–3048

where the covariates of the data and the target variables are not identical. In the settings, there do
not exist asymptotically constant-risk priors among the priors independent of the sample size, N.
The reason is as follows: we consider the prior, π(θ), independent of the sample size, N. Then, the
Kullback–Leibler risk of the Bayesian predictive density is expanded as:

R(θ, ˆqπ(y|x(N))) =
1
2N gY
ij(θ)gX,ij(θ) + O(N−2).

Since, in our settings, the first-order term, gY
ij(θ)gX,ij(θ), is not constant, the prior independent of the
sample size, N, is not an asymptotically constant-risk prior.
When there exists a unique maximum point of the trace, gX,ij(θ)gY
ij(θ), we construct the

asymptotically constant-risk prior, π(θ; N), up to O(N−2), by making the prior dependent on the
sample size, N, as:

π(θ; N)

|gX(θ)|1/2
∝
{ f (θ)}
√

Nh(θ),

where f (θ) and h(θ) are the scalar functions of θ independent of N and |gX(θ)| denotes the determinant
of the Fisher information matrix, gX(θ).
The key idea is that, if the specified parameter point has more undue risk than the other parameter
points, then the more prior weights should be concentrated on that point.
Further, we clarify the subminimax estimator problem based on the mean squared error from
the viewpoint of the prediction where the distributions of data and target variables are different and
have a common unknown parameter. We obtain the improvement achieved by the minimax estimator
over the subminimax estimators up to O(N−2). The subminimax estimator problem [14,15] is the
problem that, at first glance, there seems to exist asymptotically dominating estimators of the minimax
estimator. However, any relationship between such subminimax estimator problems and predictions
have not been investigated, and further, in general, the improvement by the minimax estimator over
the subminimax estimators has not been investigated.

2. Information Geometrical Notations

In this section, we prepare the information geometrical notations; see [16] for details. We
abbreviate ∂/∂θi to ∂i, where the indices, i, j, k, . . ., run from one to d. Similarly, we abbreviate ∂2/∂θi∂θj,
∂3/∂θi∂θj∂θk and ∂4/∂θi∂θj∂θk∂θl to ∂ij, ∂ijk and ∂ijkl, respectively. We denote the expectations of the
random variables, X, Y and X(N), by EX[·], EY[·] and EX(N)[·], respectively. We denote their probability
densities by p(x|θ), q(y|θ) and p(x(N)|θ), respectively.
We define the predictive metric proposed by Komaki [13] as:

˚gij(θ)
=
gX
ik(θ)gY,kl(θ)gX
lj (θ).

When the parameter is one-dimensional, gθθ(θ) denotes Fisher information and gθθ(θ) denotes its

inverse. Let

e
Γ X
ij,k(θ) and

m
Γ X
ij,k(θ) be the quantities given by:

e
Γ X
ij,k(θ)
:=
EX[∂ij log p(x|θ)∂k log p(x|θ)]

and:

m
Γ X
ij,k(θ)
:=
�
1

p(x|θ)[∂ijp(x|θ)∂kp(x|θ)]dx.

112


Entropy 2014, 16, 3026–3048

Using these quantities, the e-connection and m-connection coefficients with respect to the parameter, θ,
for the model, {p(x|θ) : θ ∈ Θ}, are given by:

e
Γ X
ij,k(θ)
:=
gX,lk(θ)

e
Γ X
ij,l(θ)

and:

m
Γ X
ij,k(θ)
:=
gX,kl(θ)

m
Γ X
ij,l(θ),

respectively.
The (0, 3)-tensor, TX
ijk(θ), is defined by:

TX
ijk(θ)
:=
EX[∂i log p(x|θ)∂j log p(x|θ)∂k log p(x|θ)].

The tensor, TX
ijk(θ), also produces a (0, 1)-tensor:

TX
i (θ)
:=
TX
ijk(θ)gX,jk(θ).

In the same manner, the information geometrical quantities,

e
Γ X
ij,l(θ),

m
Γ X
ij,l(θ) and TY
ijk(θ), are
defined for the model, {q(y|θ) : θ ∈ Θ}.
Let Mk
ij(θ) be a (1, 2)-tensor defined by:

Mk
ij(θ) :=

m
Γ Y,k
ij
(θ) −

m
Γ X,k
ij
(θ).

For a derivative, (∂1v(θ), · · · , ∂dv(θ)), of the scalar function, v(θ), the e-covariant derivative is
given by:

e
∇i∂jv(θ)
:=
∂ijv(θ) −

e
Γ X,k
ij
(θ)∂kv(θ).

3. Asymptotically Constant-Risk Priors When the Distributions of Data and Target Variables
Are Different

In this section, we consider the settings where the trace, gX,ij(θ)gY
ij(θ), has a unique maximum
point. We construct the asymptotically constant-risk prior under the Kullback–Leibler risk in the sense
that the asymptotic risk up to O(N−2) is constant. We find asymptotically constant-risk priors up to
O(N−2) in two steps: first, expand the Kullback–Leibler risks of Bayesian predictive densities; second,
find the prior having an asymptotically constant risk using this expansion.
From now on, we assume the following two conditions for the prior, π(θ; N):

(C1) The prior, π(θ; N), has the form:

π(θ; N)

|gX(θ)|1/2
∝
exp{
√

N log f (θ) + log h(θ)},

where f (θ) and h(θ) are smooth scalar functions of θ independent of N.
(C2) The unique maximum point of the scalar function, f (θ), is equal to the unique maximum point
of the trace, gX,ij(θ)gY
ij(θ).

113


Entropy 2014, 16, 3026–3048

Based on Conditions (C1) and (C2), we expand the Kullback–Leibler risk of a Bayesian predictive
density up to O(N−2).

Theorem 1. The Kullback–Leibler risk of a Bayesian predictive density based on the prior, π(θ; N), satisfying
Condition (C1), is expanded as:

R(θ, ˆqπ(y|x(N)))

=
1
2N gY
ij(θ)gX,ij(θ) + 1

2N ˚gij(θ)∂i log f (θ)∂j log f (θ) −
1

N
√

N
TY
ijk(θ)gX,ij(θ)gX,kl(θ)∂l log f (θ)

+
1

N
√

N
˚gij(θ)
e
∇i∂j log f (θ) +
1

N
√

N
˚gij(θ)gX,kl(θ)
� e
∇i∂k log f (θ)
�
∂j log f (θ)∂l log f (θ)

−
1

3N
√

N
TY
ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)

+
1

2N
√

N
gY
kl(θ)Ml
ij(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)

+
1

2N
√

N
gX,ij(θ)gY
kl(θ)gX,kl(θ)Mm
ij (θ)∂m log f (θ) +
1

N
√

N
˚gij(θ)Mk
ij(θ)∂k log f (θ)

+
1

2N
√

N
˚gij(θ)TX
i (θ)∂j log f (θ) +
1

2N
√

N
gX,im(θ)gY
ij(θ)gX,kl(θ)Mj
kl(θ)∂m log f (θ)

+
1

N
√

N
˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
(1)

The proof is given in the Appendix. The first term in (1) represents that the precision of the
estimation is determined by the geometric quantity of the data, gX,ij(θ), and the metric of the parameter
is determined by the geometric quantity of the target variables, gY
ij(θ). Note that each term in (1) is
invariant under the reparametrization.

Remark 1. For the subsequent theorem, it is important that at the point, θ f , maximizing the scalar function,
log f (θ), R(θ f , ˆqπ(y|xN)) is given by:

R(θ f , ˆqπ(y|xN))

=
1
2N sup
θ∈Θ
{gX,ij(θ)gY
ij(θ)} +
1

N
√

N
˚gij(θ f )∂ij log f (θ f ) + O(N−2).
(2)

The N−3/2-order term of this risk is common whenever we use the same scalar function, log f (θ). This term is
negative because of the definition of the point, θ f . Under Condition (C2), θ f is equal to the unique maximum
point, θmax, of the trace, gX,ij(θ)gY
ij(θ).

Based on (1) and (2), we construct asymptotically constant-risk priors using the solutions of the
partial differential equations.

Theorem 2. Suppose that the scalar functions, log ˜f (θ) and log ˜h(θ), satisfy the following conditions:

(A1) log ˜f (θ) is the solution of the Eikonal equation given by:

˚gij(θ)∂i log ˜f (θ)∂j log ˜f (θ)
=
gX,ij(θmax)gY
ij(θmax) − gX,ij(θ)gY
ij(θ),
(3)

where θmax is the unique maximum point of the scalar function, gX,ij(θ)gY
ij(θ).

114


Entropy 2014, 16, 3026–3048

(A2) log ˜h(θ) is the solution of the first-order linear partial equation given by:

˚gij∂i log ˜f (θ)∂j log ˜h(θ) = − ˚gij(θ)
e
∇i∂j log ˜f (θ)

− ˚gij(θ)gX,kl(θ)
� e
∇i∂k log ˜f (θ)
�
∂j log ˜f (θ)∂l log ˜f (θ)

+ TY
ijk(θ)gX,ij(θ)gX,kl(θ)∂l log ˜f (θ)

+ 1

3TY
ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log ˜f (θ)∂t log ˜f (θ)∂u log ˜f (θ)

− 1

2 gY
kl(θ)Ml
ij(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log ˜f (θ)∂t log ˜f (θ)∂u log ˜f (θ)

− 1

2 gX,ij(θ)gY
kl(θ)gX,kl(θ)Mm
ij (θ)∂m log ˜f (θ) − ˚gij(θ)Mk
ij(θ)∂k log ˜f (θ)

− 1

2 ˚gij(θ)TX
i (θ)∂j log ˜f (θ) − 1

2 gX,im(θ)gY
ij(θ)gX,kl(θ)Mj
kl(θ)∂m log ˜f (θ)

+ ˚gij(θmax)∂ij log ˜f (θmax).
(4)

Let π(θ; N) be the prior that is constructed as:

π(θ; N)

|gX(θ)|1/2
∝
exp{
√

N log ˜f (θ) + log ˜h(θ)}.

Further, suppose that log ˜f (θ) satisfies Condition (C2).
Then, the Bayesian predictive density based on the prior, π(θ; N), has the asymptotically smallest constant
risk up to O(N−2) among all priors with the form (C1).

Proof. First, we consider the prior, φ(θ; N), constructed as:

φ(θ; N)

|gX(θ)|1/2
∝
exp{
√

N log ˜f (θ)}.

From Theorem 1, the Kullback–Leibler risk, R(θ, ˆqφ(y|x(N))), based on the prior, φ(θ; N), is given by:

R(θ, ˆqφ(y|x(N)))
=
1
2N gX,ij(θmax)gY
ij(θmax) + o(N−1).
(5)

This is constant up to o(N−1).
Suppose that there exists another prior, ϕ(θ; N), constructed as:

ϕ(θ; N)

|gX(θ)|1/2
∝
exp{
√

N log f (θ)},

and the Bayesian predictive density based on the prior, ϕ(θ; N), has the asymptotically constant risk:

R(θ, ˆqϕ(y|x(N))) =
k
2N + o(N−1).

From Theorem 1, the prior ϕ(θ; N) must satisfy the equation:

˚gij(θ)∂i log f (θ)∂j log f (θ)
=
k − gX,ij(θ)gY
ij(θ).

The left-hand side of the above equation is non-negative, because the matrix, ˚gij(θ), is positive-definite.
Hence, the infimum of the constant, k, is equal to gX,ij(θmax)gY
ij(θmax). From (5), the N−1-order term

of the risk based on the prior, φ(θ; N), achieves the infimum, gX,ij(θmax)gY
ij(θmax). Thus, the Bayesian

115


Entropy 2014, 16, 3026–3048

predictive density based on the prior, φ(θ; N), has the asymptotically smallest constant risk up to
o(N−1).
Second, we consider the prior, π(θ; N), constructed as:

π(θ; N)

|gX(θ)|1/2
∝
exp{
√

N log ˜f (θ) + log ˜h(θ)}.

The above argument ensures that the prior, π(θ; N), has the asymptotically smallest constant risk up
to o(N−1). Thus, we only have to check if the N−3/2-order term of the risk is the smallest constant.
From (2), the N−3/2-order term of the risk at the point, θmax, is unchanged by the choice of the
scalar function, log h(θ). In other words, the constant N−3/2-order term must agree with the quantity,
˚gij(θmax)∂ij log ˜f (θmax). From Theorem 1, if we choose the prior, π(θ; N), the N−3/2-order term of the
risk is the smallest constant, and it agrees with the quantity, ˚gij(θmax)∂ij log ˜f (θmax). Thus, the prior,
π(θ; N), has the asymptotically smallest constant risk up to O(N−2).

Remark 2. In Theorem 2, we choose log ˜f (θ), satisfying Condition (C2) among the solutions of (A1). We
consider the model with a one-dimensional parameter, θ. There are four possibilities to the solutions of (A1):

�

˚gθθ(θ)∂θ log ˜f (θ) =

⎧
⎨

⎩
±
�

gX,θθ(θmax)gY
θθ(θmax) − gX,θθ(θ)gY
θθ(θ)
if
θ ≤ θmax,

±
�

gX,θθ(θmax)gY
θθ(θmax) − gX,θθ(θ)gY
θθ(θ)
if
θ ≥ θmax,

where the double-sign corresponds. From the concavity around θmax as suggested by (C2), we choose log ˜f (θ)
as the solution of the following equation:

�

˚gθθ(θ)∂θ log ˜f (θ) =

⎧
⎨

⎩

�

gX,θθ(θmax)gY
θθ(θmax) − gX,θθ(θ)gY
θθ(θ)
if
θ ≤ θmax,

−
�

gX,θθ(θmax)gY
θθ(θmax) − gX,θθ(θ)gY
θθ(θ)
if
θ ≥ θmax.
(6)

Integrating both sides of Equation (6), the unique function, log ˜f (θ), is obtained.

Remark 3. Compare the Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), with
that based on the prior, λ(θ), independent of the sample size, N. From Theorem 1 and Theorem 2, the
Kullback–Leibler risk based on the asymptotically constant-risk prior, π(θ; N), is given as:

R(θ, ˆqπ(y|x(N)))
=
1
2N gX,ij(θmax)gY
ij(θmax)

+
1

N
√

N
˚gij(θmax)∂ij log ˜f (θmax) + O(N−2).
(7)

In contrast, the Kullback–Leibler risk based on the prior, λ(θ), is given as:

R(θ, ˆqλ(y|x(N)))
=
1
2N gX,ij(θ)gY
ij(θ) + O(N−2).
(8)

The N−1-order term in (8) is under the N−1-order term in (7); although the N−3/2-order term in (8) does not
exist, the N−3/2-order term in (7) is negative. Thus, the maximum of the risk based on the asymptotically
constant-risk prior, π(θ; N), is smaller than that of the risk based on the prior, λ(θ). This result is consistent
with the minimaxity of selecting the prior that constructs the predictive density with the smallest maximum of
the risk.

4. Subminimax Estimator Problem Based on the Mean Squared Error

In this section, we refer to the subminimax estimator problem based on the mean squared error,
from the viewpoint of the prediction where the distributions of data and target variables are different

116


Entropy 2014, 16, 3026–3048

and have a common unknown parameter. First, we give a brief review of subminimax estimator
problem through the binomial example.

Example 1. Let us consider the binomial estimation based on the mean squared error, RMSE(θ, ˆθ). For any
finite sample size, N, the Bayes estimator, ˆθπ, based on the Beta prior, π(θ; N) ∝ θ
√

N/2−1(1 − θ)
√

N/2−1, is
minimax under the mean squared error. The mean squared error of the minimax Bayes estimator, ˆθπ, is given by:

RMSE(θ, ˆθπ)
=
N

4(
√

N + N)2 =
1
4N −
1

2N
√

N
+ O(N−2).
(9)

In contrast, the mean squared error of the maximum likelihood estimator, ˆθMLE, is given by:

RMSE(θ, ˆθMLE)
=
θ(1 − θ)

N
.

We compare the two estimators, ˆθπ and ˆθMLE. In the comparison of the N−1-order terms of the mean
squared errors, it seems that the maximum likelihood estimator, ˆθMLE, dominates the minimax Bayes estimator,
ˆθπ. In other words, the N−1-order term of RMSE(θ, ˆθMLE) is not greater than that of RMSE(θ, ˆθπ) for every
θ ∈ Θ, and the equality holds when θ = 1/2. This seeming paradox is known as the subminimax estimator
problem; see [14,17,18] for details. See also [15] for the conditions that such problems do not occur in estimation.
However, this paradox does not mean the inferiority of the minimax Bayes estimator. This is because,
although the mean squared error of the minimax Bayes estimator, ˆθπ, has the negative N−3/2-order term, the
mean squared error of the maximum likelihood estimator, ˆθMLE, does not have the N−3/2-order term. Hence, in
comparison to the mean squared errors up to O(N−2), the maximum of the mean squared error, RMSE(θ, ˆθπ), is
below the maximum of the mean squared error, RMSE(θ, ˆθMLE).

Next, we construct the asymptotically constant-risk prior in the estimation based on the mean
squared error when the subminimax estimator problem occurs, from the viewpoint of the prediction.
We consider the priors, π(θ; N), satisfying (C1). From Lemma A3 in the Appendix, the mean squared
error of the Bayes estimator, ˆθπ, is equal to the Kullback–Leibler risk of the ˆθπ-plugin predictive density,
q(y| ˆθπ), by assuming that the target variable, y, is a d-dimensional Gaussian random variable with the

mean vector, θ, and unit variance. Note that gY
ij(θ) = 1,

m
Γ Y
ij,k = 0 and

e
Γ Y
ij,k = 0 for i, j, k = 1, · · · , d.

Thus, if gY
ij(θ)gX,ij(θ) = Σd
i=1gX,ii(θ) has a unique maximum point, we obtain the asymptotically

constant-risk prior, π(θ; N), up to O(N−2) from Lemma A2 in the Appendix and Theorem 2.
Finally, we compare the mean squared error of the asymptotically constant-risk Bayes estimator,
ˆθπ, with that of the maximum likelihood estimator, ˆθMLE. The mean squared error of the asymptotically
constant-risk Bayes estimator, ˆθπ, is given as:

RMSE(θ, ˆθπ)
=
1
N

d
Σ
i=1
gX,ii(θmax) +
2

N
√

N
Σd
k=1gX,ik(θmax)gX,jk(θmax)∂ij log ˜f (θmax) + O(N−2).

In contrast, the mean squared error of the maximum likelihood estimator, ˆθMLE, is given as:

RMSE(θ, ˆθMLE) = 1

N Σd
i=1gX,ii(θ) + O(N−2).

See [16,19].
Thus, the maximum of the mean squared error of the asymptotically constant-risk Bayes estimator
is smaller than that of estimators by the improvement of order N−3/2 in proportion to the Hessian
of the scalar function, log ˜f (θ), at θmax. In the prediction where the trace, gX,ij(θ)gY
ij(θ), has a unique
maximum point, the same improvement holds (Remark 3).

117


Entropy 2014, 16, 3026–3048

Example 2. Using the above results, we consider the binomial estimation based on the mean squared error from
the viewpoint of the prediction. The geometrical quantities to be used are given by:

gX
θθ(θ) =
1

θ(1 − θ),
gY
θθ(θ) = 1,

m
Γ X
θθ,θ (θ) = 0,

m
Γ X
θθ,θ(θ) = 0,

e
Γ X
θθ,θ(θ) (θ) = −
1 − 2θ

θ2(1 − θ)2 ,

e
Γ Y
θθ,θ (θ) = 0,

TX
θθθ(θ) =
1 − 2θ

θ2(1 − θ)2 ,
and
TY
θθθ(θ) = 0,

respectively. Since
m X,θ
θθ ,
mY,θ
θθ
and TY
θθθ vanish, the asymptotically constant-risk prior in the estimation is
identical to the asymptotically constant-risk prior in the prediction; compare Theorem 1 with the expansion of
gY,ij(θ)EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)] in Lemma A2 in the Appendix.
In this example, Equation (3) is given by:

θ2(1 − θ)2{∂θ log ˜f (θ)}2
=

�

1
4 − θ(1 − θ),

and the solution, log ˜f (θ), is (1/2) log{θ(1 − θ)}. Here, the second-order derivative of the function, log ˜f (θ),
is given by:

∂θθ log ˜f (θ)
=
−1 − 2θ + 2θ2

2θ2(1 − θ)2 .

From this, Equation (4) is given by:

1
2θ(1 − θ)(1 − 2θ)∂θ log ˜h(θ) + θ2 − θ
=
−1

4,

and the solution, log ˜h(θ), is (1/2) log{θ(1 − θ)}. Hence, the asymptotically constant-risk prior, π(θ; N), is a
Beta prior with the parameters, α =
√

N/2 and β =
√

N/2. Note that the asymptotically constant-risk prior
coincides with the exact minimax prior. Since gX,θθ(θmax) = 1/2 and gX,θθ(θmax)∂θθ log ˜f (θmax) = −1, the
mean squared error of the asymptotically constant-risk Bayes estimator, ˆθπ, agrees with (9) up to O(N−2).

5. Application to the Prediction of the Binary Regression Model under the Covariate Shift

In this section, we construct asymptotically constant-risk priors in the prediction based on the
binary regression model under the covariate shift; see [10].
We consider that we predict a binary response variable, y, based on the binary response variables,
x(N). We assume that the target variable, y, and the data, x(N), follow the logistic regression models
with the same parameter, β, given by:

log
Πx

1 − Πx
=
α + zβ

and:

log
Πy

1 − Πy
=
˜α + ˜zβ,

118


Entropy 2014, 16, 3026–3048

where Πx is the success probability of the data and Πy is the success probability of the target variable.
Let α and ˜α denote known constant terms, and let β denote the common unknown parameter. Further,
we assume that the covariates, z and ˜z, are different.
Using the parameter θ = Πx, we convert this predictive setting to binomial prediction where the
data, x, and the target variable, y, are distributed according to:

p(x|θ) :=

�
θ
if x = 1,
1 − θ
if x = 0,

and:

q(y|θ) :=

⎧
⎨

⎩

e˜α−˜zz−1αθ ˜zz−1/
�
(1 − θ)˜zz−1 + e˜α−˜zz−1αθ ˜zz−1�
if y = 1,

(1 − θ)˜zz−1/
�
(1 − θ)˜zz−1 + e˜α−˜zz−1αθ ˜zz−1�
if y = 0,

respectively. We obtain two Fisher information for x and y as:

gX
θθ(θ)
=
1

θ(1 − θ)

and:

gY
θθ(θ)
=
� ˜z

z

�2
e−˜α+˜zz−1α
(1 − θ)˜zz−1−2θ ˜zz−1−2

�
θ ˜zz−1 + e−˜α+˜zz−1α(1 − θ)˜zz−1�2 ,

respectively.
For simplicity, we consider the setting where z = 1, ˜z = 2 and α = ˜α = 0. The geometrical
quantities for the model, {p(x|θ) : θ ∈ Θ}, are given by:

gX
θθ(θ) =
1

θ(1 − θ),

m
Γ X
θθ,θ (θ) = 0,

e
Γ X
θθ,θ(θ) (θ) = −
1 − 2θ

θ2(1 − θ)2 ,
and
TX
θθθ(θ) =
1 − 2θ

θ2(1 − θ)2 ,

respectively. In the same manner, the geometrical quantities for the model, {q(y|θ) : θ ∈ Θ}, are
given by:

gY
θθ(θ) =
4

{(1 − θ)2 + θ2}2 ,

m
Γ X
θθ,θ(θ) = 4 (1 − 2θ)(1 + 2θ − 2θ2)

θ(1 − θ){(1 − θ)2 + θ2}3 ,

e
Γ Y
θθ,θ (θ) = −4
1 − 2θ

θ(1 − θ){(1 − θ)2 + θ2}2 ,
and
TY
θθθ(θ) = 8
1 − 2θ

θ(1 − θ){(1 − θ)2 + θ2}3 ,

respectively.
Using these quantities, Equation (3) is given by:

4
θ2(1 − θ)2

{θ2 + (1 − θ)2}2 (∂θ log ˜f (θ))2 = 4 − 4
θ(1 − θ)

{θ2 + (1 − θ)2}2 .

119


Entropy 2014, 16, 3026–3048

By noting that the maximum point of gX,θθ(θ)gY
θθ(θ) is 1/2, the solution, log ˜f (θ), of this equation is
given by:

log ˜f (θ)
=
2
�

1 − θ + θ2 + log{θ(1 − θ)}

− log(2 − θ + 2
�

1 − θ + θ2) − log(1 + θ + 2
�

1 − θ + θ2).

Using this solution, we obtain the solution of Equation (4) given by:

log ˜h(θ)
=
1
6

�
−
1

1 − θ − 1

θ − 12θ(1 − θ) − 12
√

3
�

1 − θ + θ2

+(3 − 6
√

3){log θ + log(1 − θ)} − 3 log(1 − θ + θ2) + 10 log{(1 − θ)2 + θ2}

−6 log(
√

3 + 2
�

1 − θ + θ2) + 6
√

3 log{1 + (1 − θ) + 2
�

1 − θ + θ2}

+6
√

3 log{1 + θ + 2
�

1 − θ + θ2}
�
.

The asymptotically constant-risk priors for the different sample sizes are shown in Figure 1. The prior
weight is found to be more concentrated to 1/2 as the sample size, N, grows.
In this example, we obtain the Kullback–Leibler risk of the Bayesian predictive density based on
the asymptotically constant-risk prior, π(θ; N), as:

R(θ, ˆqπ(y|x(N)))
=
2
N − 4
√

3

N
√

N
+ O(N−2).

We compare this value with the Bayes risk calculated using the Monte Carlo simulation; see Figure 2.
As the sample size, N, grows, the difference appears negligible. Further, we compare this value with
the risk itself calculated by the Monte Carlo simulation; see Figure 3. As the sample size, N, grows, the
risk becomes more constant.

Figure 1. Asymptotically constant-risk prior in the prediction where the data are distributed according
to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial
distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).

120


Entropy 2014, 16, 3026–3048

�����
�����
�����
�����

���������������

����������

���
���
���
���
���
���
���
���
���
����

��������������������
����� ��� ���

�����
�����
�����
�����

Figure 2. Bayes risk based on the asymptotically constant-risk prior in the prediction where the data
are distributed according to the binomial distribution, Bin(N, θ), and the target variable is distributed
according to the binomial distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).

Figure 3. Comparison of the Kullback–Leibler risk calculated using the Monte Carlo simulations and
the asymptotic risk, 2/N − (4
√

3)/(N
√

N), in the prediction where the data are distributed according
to the binomial distribution, Bin(N, θ), and the target variable is distributed according to the binomial
distribution, Bin(1, θ2/(θ2 + (1 − θ)2)).

6. Discussion and Conclusions

We have considered the setting where the quantity, gX,ij(θ)gY
ij(θ)—the trace of the product of the

inverse Fisher information matrix, gX,ij(θ), and the Fisher information matrix, gY
ij(θ)—has a unique
maximum point, and we have investigated the asymptotically constant-risk prior in the sense that the
asymptotic risk is constant up to O(N−2).
In Section 3, we have considered the prior depending on the sample size, N, and constructed
the asymptotically constant-risk prior using Equations (3) and (4). In Section 4, we have clarified the
relationship between the subminimax estimator problem based on the mean squared error and the
prediction where the distributions of data and target variables are different. In Section 5, we have
constructed the asymptotically constant-risk prior in the prediction based on the logistic regression
model under the covariate shift.
We have assumed that the trace, gX,ij(θ)gY
ij(θ), is finite. However, the trace may diverge in the
non-compact parameter space; for example, it diverges under the predictive setting, where the

121


Entropy 2014, 16, 3026–3048

distribution, q(y|θ), of the target variable is the Poisson distribution and the data distribution, p(x|θ),
is the exponential distribution, with Θ equivalent to R. Therefore, for our future work, in such a
setting, we should adopt criteria other than minimaxity.

Acknowledgments: The authors thank the referees for their helpful comments. This research was partially
supported by a Grant-in-Aid for Scientific Research (23650144, 26280005).

Author Contributions: Both authors contributed to the research and writing of this paper. Both authors read and
approved the final manuscript.

Conflicts of Interest: The authors declare no conflict of interest.

Appendix A

We prove Theorem 1. First, we introduce some lemmas for the proof. For the expansion, we
follow the following six steps (the first five steps are arranged in the form of lemmas): the first is to
expand the MAPestimator; the second is to calculate their bias and mean squared error; the third
is to expand the Kullback–Leibler risk using ˆθπ-plugin predictive density, q(y| ˆθπ); the fourth is to
expand the Bayesian predictive density based on the prior π(θ; N); the fifth is to expand the Bayesian
estimator minimizing the Bayes risk; and the last is to prove Theorem 1 using these lemmas.
We use some additional notations for the expansion. Let ˆθπ be the maximum point of the
scalar function log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Let l(θ|x(N)) denote the log likelihood of
the data, x(N). Let lij(θ|x(N)), lijk(θ|x(N)) and lijkl(θ|x(N)) be the derivatives of order 2, 3 and 4 of the
log likelihood, l(θ|x(N)). Let Hij(θ|x(N)) denote the quantity, lij(θ|x(N)) + NgX
ij (θ). Let ˜li(θ|x(N)) and
˜Hij(θ|x(N)) denote (1/
√

N)li(θ|x(N)) and (1/
√

N)Hij(θ|x(N)), respectively. In addition, the brackets ( )
denotes the symmetrization: for any two tensors, aij and bij, ai(jbk)l denotes ai(jbk)l = (aijbkl + aikbjl)/2.

Lemma A1. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th
component of this estimator ˆθπ is expanded as follows:

ˆθi
π
=
θi +
1
√

N
gX,ik(θ)˜lk(θ|x(N)) +
1
√

N
gX,ik(θ)∂k log f (θ)

+ 1

N gX,ik(θ) ˜Hkm(θ|x(N))gX,mr(θ)˜lr(θ|xN)

+ 1

2N gX,ik(θ)LX
kmr(θ)gX,mq(θ)gX,rs(θ)˜lq(θ|xN)˜ls(θ|x(N))

+ 1

N gX,ik(θ) ˜Hkm(θ|xN)gX,mr(θ)∂r log f (θ)

+ 1

N gX,ik(θ)LX
kmr(θ)gX,mq(θ)gX,rs(θ)˜lq(θ|xN)∂s log f (θ)

+ 1

2N gX,ik(θ)LX
kmr(θ)gX,mq(θ)gX,rs(θ)∂q log f (θ)∂s log f (θ)

+ 1

N gX,ik(θ)gX,mq(θ)∂km log f (θ)˜lq(θ|x(N))

+ 1

N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ)

+ 1

N gX,ik(θ)∂k log h(θ) + OP(N−3/2).
(A1)

Proof. By the definition of ˆθπ, we get the equation given by:

∂i log p(x(N)| ˆθπ) + ∂i log
π( ˆθπ; N)

|gX( ˆθπ)|1/2
=
0.

122


Entropy 2014, 16, 3026–3048

From our assumption that prior π(θ; N) has the form given by:

π(θ; N)

|gX(θ)|1/2
∝
exp{
√

N log f (θ) + log h(θ)},

we rewrite this equation as:

∂i log p(x(N)| ˆθπ) +
√

N∂i log f ( ˆθπ) + ∂i log h( ˆθπ)
=
0.

By applying Taylor expansion around θ to this new equation, we derive the following expansion:

∂i log p(x(N)|θ) + {∂ij log p(x(N)|θ)}( ˆθj
π − θj)

+1

2{∂ijk log p(x(N)|θ)}( ˆθj
π − θj)( ˆθk
π − θk) +
√

N∂i log f (θ)

+
√

N{∂ij log f (θ)}( ˆθj
π − θj) + ∂i log h(θ) + oP(1) = 0.

From the law of large numbers and the central limit theorem, we rewrite the above expansion as:

NgX
ij (θ)( ˆθj
π − θj)
=
∂i log p(x(N)|θ) +
√

N∂i log f (θ) + Hij(θ|x(N))( ˆθj
π − θj)

+ N

2 Lijk(θ)( ˆθj
π − θj)( ˆθk
π − θk) +
√

N∂ij log f (θ)( ˆθj
π − θj)

+∂i log h(θ) + oP(1).
(A2)

By substituting the deviation, ˆθπ − θ, recursively into Expansion (A2), we obtain Expansion (A1).

Lemma A2. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. Then, the i-th
component of the bias of the estimator, ˆθπ, is given by:

EX(N)[ ˆθi
π]
=
θi +
1
√

N
gX,ik∂k log f (θ)

− 1

2N

m
Γ X,i(θ) + 1

2N gX,ik(θ)gX,mq(θ)gX,rs(θ)LX
kmr(θ)∂q log f (θ)∂s log f (θ)

+ 1

N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ)

+ 1

N gX,ik(θ)∂k log h(θ) + O(N−3/2).
(A3)

123


Entropy 2014, 16, 3026–3048

The (i, j)-component of the mean squared error of ˆθπ is given by:

EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)]

= 1

N gX,ij(θ) + 1

N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)

−
1

N
√

N
gX,k(i(θ)
m
Γ X,j)(θ)∂k log f (θ) +
2

N
√

N
gX,k(i(θ)gX,j)l(θ)∂kl log f (θ)

+
2

N
√

N
gX,k(i(θ)∂kgX,j)l(θ)∂l log f (θ)

+
1

N
√

N
gX,k(i(θ)gX,j)l(θ)gX,nr(θ)gX,pt(θ)LX
lrt(θ)∂k log f (θ)∂n log f (θ)∂p log f (θ)

+
2

N
√

N
gX,k(i(θ)gX,j)l(θ)gX,nr(θ)∂ln log f (θ)∂r log f (θ)∂k log f (θ)

+
2

N
√

N
gX,k(i(θ)gX,j)l(θ)∂k log f (θ)∂l log h(θ)

+O(N−2),
(A4)

where gX,k(i(θ)

m
Γ X,j) (θ) denotes (1/2){gX,ki(θ)

m
Γ X,j(θ) + gX,ki(θ)

m
Γ X,j(θ)} and gX,k(i(θ)∂kgX,j)l(θ)
denotes (1/2){gX,ki(θ)∂kgX,jl(θ) + gX,kj(θ)∂kgX,il(θ)}. The (i, j, k)-component of the mean of the third
power of the deviation, ˆθπ − θ, is given by:

EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)( ˆθk
π − θk)]

=
1

N
√

N
gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)

+
3

N
√

N
gX,(ij(θ)gX,k)l(θ)∂l log f (θ) + O(N−2).
(A5)

Proof. First, using Lemma A1, we determine the i-th component of the bias of ˆθπ given by:

EX(N)[ ˆθi
π − θi]

=
1
√

N
gX,ik∂k log f (θ)

− 1

2N

m
Γ X,i(θ) + 1

2N gX,ik(θ)gX,mq(θ)gX,rs(θ)LX
kmr(θ)∂q log f (θ)∂s log f (θ)

+ 1

N gX,ik(θ)gX,mq(θ)∂km log f (θ)∂q log f (θ) + 1

N gX,ik(θ)∂k log h(θ) + O(N−3/2).

Second, consider the following relationship:

EX(N)

��
ˆθi
π − θi −
1
√

N
gX,ik(θ)˜lk(θ|x(N)) −
1
√

N
gX,ik(θ)∂k log f (θ)
�

×
�
ˆθj
π − θj −
1
√

N
gX,jl(θ)˜ll(θ|xN) −
1
√

N
gX,jl(θ)∂l log f (θ)
��

= EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)] + 1

N gX,ij(θ) + 1

N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)

− 1
√

N
gX,ki(θ)EX(N)[( ˆθj
π − θj)˜lk(θ|x(N))] −
1
√

N
gX,kj(θ)EX(N)[( ˆθi
π − θi)˜lk(θ|x(N))]

− 1
√

N
gX,ki(θ)EX(N)[( ˆθj
π − θj)∂k log f (θ)] −
1
√

N
gX,kj(θ)EX(N)[( ˆθi
π − θi)∂k log f (θ)]. (A6)

124


Entropy 2014, 16, 3026–3048

By differentiating the j-th component of the bias, EX(N)[ ˆθj
π − θj], we obtain the equation given by:

1
N ∂kEX(N)[ ˆθj
π − θj]
=
− 1

N δj
k +
1
√

N
EX(N)[( ˆθj
π − θj)˜lk(θ|xN)],
(A7)

where δi
j denotes the delta function: if the upper and the lower indices agree, then the value of
this function is one and otherwise zero. Equation (A7) has been used by [2,16,19]. By substituting
Equations (A7) and (A3) into Relationship (A6), we obtain the (i, j)-component of the mean squared
error of ˆθπ given by:

EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)]

= 1

N gX,ij(θ) + 1

N gX,ik(θ)gX,jl(θ)∂k log f (θ)∂l log f (θ)

−
1

N
√

N
gX,k(i(θ)
m
Γ X,j)(θ)∂k log f (θ) +
2

N
√

N
gX,k(i(θ)gX,j)l(θ)∂kl log f (θ)

+
2

N
√

N
gX,k(i(θ)∂kgX,j)l(θ)∂l log f (θ)

+
1

N
√

N
gX,k(i(θ)gX,j)l(θ)gX,nr(θ)gX,pt(θ)LX
lrt(θ)∂k log f (θ)∂n log f (θ)∂p log f (θ)

+
2

N
√

N
gX,k(i(θ)gX,j)l(θ)gX,nr(θ)∂ln log f (θ)∂r log f (θ)∂k log f (θ)

+
2

N
√

N
gX,k(i(θ)gX,j)l(θ)∂k log f (θ)∂l log h(θ) + O(N−2).

Finally, by taking the expectation of the third power of the deviation, ˆθi
π − θi, we obtain the
following expansion:

EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)( ˆθk
π − θk)]

=
1

N
√

N
gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)

+
3

N
√

N
gX,(ij(θ)gX,k)l(θ)∂l log f (θ) + O(N−2).

125


Entropy 2014, 16, 3026–3048

Lemma
A3. Let
ˆθπ
be
the
maximum
point
of
log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}.
The Kullback–Leibler risk of the plug-in predictive density, q(y(N)| ˆθπ), with the estimator,
ˆθπ, is
expanded as follows:

R(θ, q(y| ˆθπ))

=
1
2N gY
ij(θ)gX,ij(θ) + 1

2N ˚gij(θ)∂i log f (θ)∂j log f (θ) +
1

N
√

N
˚gij(θ)
� e
∇i∂j log f (θ)
�

+
1

N
√

N
˚gij(θ)gX,kl(θ)
� e
∇i∂k log f (θ)
�
∂j log f (θ)∂l log f (θ)

−
1

3N
√

N
TY
ijk(θ)gX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)

+
1

2N
√

N
gY
kl(θ)Ml
ijgX,is(θ)gX,jt(θ)gX,ku(θ)∂s log f (θ)∂t log f (θ)∂u log f (θ)

+
1

2N
√

N
gX,ij(θ)gY
kl(θ)gX,kl(θ)Mm
ij ∂m log f (θ) −
1

N
√

N
TY
ijk(θ)gX,ij(θ)gX,kl(θ)∂l log f (θ)

+
1

N
√

N
˚gij(θ)Mk
ij∂k log f (θ) +
1

N
√

N
˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
(A8)

Proof. By applying the Taylor expansion, the Kullback–Leibler risk, R(θ, q(y| ˆθπ)), is expanded as:

Ex(N)[D(q(·|θ), q(·| ˆθπ))]

= EX(N)

��
q(y|θ)
�
−li(θ|y) ˜θi
π − 1

2lij(θ|y)( ˆθi
π − θi)( ˆθj
π − θj)

−1

6lijk(θ|y)( ˆθi
π − θi)( ˆθj
π − θj)( ˆθk
π − θk) + OP(N−2)
�
dy
�

= 1

2 gY
ij(θ)EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)] − 1

6 LY
ijk(θ)EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)( ˆθk
π − θk)] + O(N−2)

= 1

2 gY
ij(θ)EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)]

+

⎧
⎪
⎨

⎪
⎩

3
2

m
Γ Y
(ij,k)(θ) − 1

3TY
ijk(θ)

⎫
⎪
⎬

⎪
⎭
EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)( ˆθk
π − θk)] + O(N−2)

= 1

2 gY
ij(θ)EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)] − 1

3TY
ijk(θ)EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)( ˆθk
π − θk)]

+1

2

⎧
⎪
⎨

⎪
⎩
gY
kl(θ)

m
Γ Y,l
ij (θ) − gY
kl(θ)

m
Γ X,l
ij
(θ)

⎫
⎪
⎬

⎪
⎭
EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)( ˆθk
π − θk)]

+1

2 gY
kl(θ)

m
Γ X,l
ij
(θ)EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)( ˆθk
π − θk] + O(N−2),
(A9)

where

e
Γ Y
(ij,k) denotes (1/3){

e
Γ Y
ij,k +

e
Γ Y
jk,i +

e
Γ Y
ki,j}.

126


Entropy 2014, 16, 3026–3048

By the definition of the predictive metric, ˚gij(θ) = gX
ik(θ)gY,kl(θ)gX
lj (θ), by Expansions (A4) and

(A5) and by the relationship LX
ijk(θ) = −

e
Γ X
ij,k(θ) −
e X
jk,i(θ) −
e X
ki,j(θ) − TX
ijk(θ), the last two terms of the
above expansion (A9) are expanded as:

1
2 gY
ij(θ)EX(N)[ ˆ(θ
i
π − θi)( ˆθj
π − θj)] + 1

2 gY
kl(θ)

m
Γ X,l
ij
(θ)EX(N)[( ˆθi
π − θi)( ˆθj
π − θj)( ˆθk
π − θk)]

=
1
2N gY
ij(θ)gX,ij(θ) + 1

2N ˚gij(θ)∂i log f (θ)∂j log f (θ)

+
1

N
√

N
˚gij(θ)

⎧
⎪
⎨

⎪
⎩
∂ij log f (θ) −

e
Γ X,k
ij
(θ)∂k log f (θ)

⎫
⎪
⎬

⎪
⎭

+
1

N
√

N
˚gij(θ)gX,kl(θ)
�
∂ik log f (θ) −
e X,m
ik
∂m log f (θ)
�
∂j log f (θ)∂l log f (θ)

+
1

N
√

N
˚gij(θ)∂i log f (θ)∂j log h(θ) + O(N−2).
(A10)

By substituting Expansion (A10) into Expansion (A9), Expansion (A8) is obtained.

Note that Expansion (A8) is invariant up to O(N−2) under the reparametrization, so that each
term of this expansion is a scalar function of θ.

Lemma A4. Let ˆθπ be the maximum point of log p(x(N)|θ) + log{π(θ; N)/|gX(θ)|1/2}. The Bayesian
predictive density based on the prior, π(θ; N), is expanded as:

ˆqπ(y|x(N))
=
q(y| ˆθπ) + 1

N gX,ij( ˆθπ)

⎧
⎪
⎨

⎪
⎩
∂i log |gX( ˆθπ)|
1
2 −

e
Γ X,k
ik
( ˆθπ)

⎫
⎪
⎬

⎪
⎭
∂jq(y| ˆθπ)

+ 1

2N gX,ij( ˆθπ)

⎧
⎨

⎩∂ijq(y| ˆθπ) −

m
Γ X,k
ij
( ˆθπ)∂kq(y| ˆθπ)

⎫
⎬

⎭ + OP(N−3/2).
(A11)

Proof. Let ˜θπ denote ˆθπ − θ. First, using a Taylor expansion twice, we expand the posterior density,
π(θ|x(N)), as:

π(θ|x(N))
=
|gX( ˆθπ)|
1
2
π( ˆθπ)

|gX( ˆθπ)|
1
2
p(x(N)| ˆθπ) exp
�
−1

2{−lij( ˆθπ|x(N))} ˜θi
π ˜θj
π

�

×

�

1 − {∂i log |gX( ˆθπ)|
1
2 } ˜θi
π + 1

2

�
∂ij|gX( ˆθπ)|
1
2

|gX( ˆθπ)|
1
2

�
˜θi
π ˜θj
π + OP(N−3/2)

�

×
�
1 + 1

2{
√

N∂ij log f ( ˆθπ)} ˜θi
π ˜θj
π − 1

6{lijk( ˆθπ|x(N))} ˜θi
π ˜θj
π ˜θk
π + 1

2{log h( ˆθπ)} ˜θi
π ˜θj
π

−1

6{
√

N∂ijk log f ( ˆθπ)} ˜θi
π ˜θj
π ˜θk
π + 1

24lijkl( ˆθπ|x(N)) ˜θi
π ˜θj
π ˜θk
π ˜θl
π

+1

2

�1

2{
√

N∂ij log f ( ˆθπ)} ˜θi
π ˜θj
π − 1

6lijk( ˆθπ|xN) ˜θi
π ˜θj
π ˜θk
π

�

×
�1

2{
√

N∂ij log f ( ˆθπ)} ˜θi
π ˜θj
π − 1

6lijk( ˆθπ|x(N)) ˜θi
π ˜θj
π ˜θk
π

�
+ OP(N−3/2)
�

×

��
p(x(N)|θ) π(θ; N)

|gX(θ)|
1
2
|gX(θ)|
1
2 dθ

�−1
.

127


Entropy 2014, 16, 3026–3048

We denote the N−1/2-order,
N−1-order and N−3/2-order terms by (N−1/2)a0( ˜θπ; ˆθπ),
(N−1)a1( ˜θπ; ˆθπ) and (N−3/2)a2( ˜θπ; ˆθπ), respectively. Then, this expansion is rewritten as:

π(θ|x(N))
=
|gX( ˆθπ)|
1
2
π( ˆθπ)

|gX( ˆθπ)|
1
2
p(x(N)| ˆθπ) exp
�
−1

2{−lij( ˆθπ|x(N))} ˜θi
π ˜θj
π

�

×
�
1 +
1
√

N
a0( ˜θπ; ˆθπ)

+ 1

N a1( ˜θπ; ˆθπ) +
1

N
√

N
a2( ˜θπ; ˆθπ) + OP(N−2)
�

×

��
p(x(N)|θ) π(θ; N)

|gX(θ)|
1
2
|gX(θ)|
1
2 dθ

�−1
.

To make the expansion easier to see, the following notations are used. Let φ(η; −lij( ˆθπ|x(N))) be
the probability density function of the d-dimensional normal distribution with the precision matrix
whose (i, j)-component is −lij( ˆθπ|x(N)). Let η = (η1, · · · , ηd) be a d-dimensional random vector
distributed according to the normal density, φ(η; −lij( ˆθπ|x(N))) The notations, ¯a0( ˆθπ), ¯a1( ˆθπ), ¯a2( ˆθπ)
and ˆωij( ˆθπ), denote the expectations of a0(η; ˆθπ), a1(η; ˆθπ), a2(η; ˆθπ) and ηiηj, respectively.
Using the above notations, we get the following posterior expansion:

π(θ|x(N))
=
φ( ˆθπ; −lij( ˆθπ|x(N)))

×
�
1 +
1
√

N
{a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)} + 1

N {a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)}

− 1

N ¯a0( ˆθπ){a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)} +
1

N
√

N
{a2( ˜θπ; ˆθπ) − ¯a2( ˆθπ)}

−
1

N
√

N
¯a0( ˆθπ){a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)} −
1

N
√

N
¯a1( ˆθπ){a0( ˜θπ; ˆθπ) − ¯a0( ˆθπ)}

+
1

N
√

N
¯a2
0( ˆθπ){a1( ˜θπ; ˆθπ) − ¯a1( ˆθπ)} + OP(N−2)
�
.
(A12)

Second, using (A12), the Bayesian predictive density, ˆqπ(y|x(N)), based on the prior, π(θ; N), is
expanded as:

ˆqπ(y|x(N))

=
�
q(y| ˆθπ)

�

1 − {∂i log q(y| ˆθπ)} ˜θi
π + 1

2
∂ijq(y| ˆθπ)

q(y| ˆθπ)
˜θi
π ˜θj
π + oP(N−1)

�

π(θ|xN)dθ

=
�
q(y| ˆθπ)
�
1 + {∂i log |gX( ˆθπ)|
1
2 }{∂j log q(y| ˆθπ)} ˜θi
π ˜θj
π

+1

6{∂ijk log p(x(N)| ˆθπ) +
√

N∂ijk log f ( ˆθπ)}{∂l log q(y| ˆθπ)} ˜θi
π ˜θj
π ˜θk
π ˜θl
π

+1

2
∂ijq(y| ˆθπ)

q(y| ˆθπ)
˜θi
π ˜θj
π + oP(N−1)

�

φ( ˜θπ; −lij( ˆθπ|xN))d ˜θπ

= q(y| ˆθπ) + ˆωij( ˆθπ){∂i log |gX( ˆθπ)|
1
2 }∂jq(y| ˆθπ) + 1

2 ˆωik( ˆθπ) ˆωjl( ˆθπ)lijk( ˆθπ|xN)∂lq(y| ˆθπ)

+1

2 ˆωij( ˆθπ)∂ijq(y| ˆθπ) + OP(N−3/2).
(A13)

Here, the following two equations hold:

−lij( ˆθπ|x(N))
=
NgX
ij ( ˆθπ) −
√

N ˜Hij( ˆθπ|xN) + OP(1),
(A14)

128


Entropy 2014, 16, 3026–3048

lijk( ˆθπ|x(N))
=
−2N

e
Γ X
ij,k( ˆθπ) − N

m
Γ X
ik,j( ˆθπ) +
√

N ˜Hijk( ˆθ|xN).
(A15)

By combining Equation (A14) with the Sherman–Morrison–Woodbury formula, the following
expansion is obtained:

ˆωij( ˆθπ)
=
1
N gX,ij( ˆθπ) +
1

N
√

N
gX,ik( ˆθπ)gX,jl( ˆθπ)Hkl( ˆθπ|x(N)) + OP(N−2).
(A16)

By substituting Equations (A14), (A15) and (A16) into Expansion (A13), Expansion (A11) is
obtained.

Note that the integration of Expansion (A11) is one up to OP(N−2). Further, Expansion (A11) is
similar to the expansion in [2]. However, the estimator that is the center of the expansion is different,
because of the dependence of the prior on the sample size.

Lemma A5. The Bayesian estimator, ˆθopt, minimizing the Bayes risk,
�
R(θ, q(y| ˆθ))dπ(θ; N), among plug-in predictive densities is given by:

ˆθi
opt
=
ˆθi
π + 1

2N gX,ij( ˆθπ)TX
j ( ˆθπ)

+ 1

2N gX,jk( ˆθπ)

⎧
⎪
⎨

⎪
⎩

m
Γ Y,i
jk ( ˆθπ) −

m
Γ X,i
jk ( ˆθπ)

⎫
⎪
⎬

⎪
⎭
+ OP(N−3/2).
(A17)

Proof. The Bayes risk, �
R(θ, q(y| ˆθ))dπ(θ; N), is decomposed as:

�
R(θ, q(y| ˆθ))dπ(θ; N)
=
�
π(θ; N)
�
p(x(N)|θ)
�
q(y|θ) log
q(y|θ)

ˆqπ(y|x(N))dydx(N)dθ

+
�
π(θ; N)
�
p(x(N)|θ)
�
q(y|θ) log ˆqπ(y|x(N))

q(y| ˆθ)
dydx(N)dθ.

The first term of this decomposition is not dependent on ˆθ. From Fubini’s theorem and Lemma A4, the
proof is completed.

Using these lemmas, we prove Theorem 1. First, we find that the Kullback–Leibler risk of the
plug-in predictive density with the estimator, ˆθopt, defined in Lemma A5, is given by:

R(θ, q(y| ˆθopt))
=
R(θ, q(y| ˆθπ)) +
1

2N
√

N
˚gij(θ)TX
i (θ)∂j log f (θ)

+
1

2N
√

N
gX,im(θ)gY
ij(θ)gX,kl(θ)

×

⎧
⎪
⎨

⎪
⎩

m
Γ Y,j
kl (θ) −

m
Γ X,j
kl
((θ)

⎫
⎪
⎬

⎪
⎭
∂m log f (θ).
(A18)

Using Expansion (A18) and Lemma A3, we expand the Kullback–Leibler risk, R(θ, ˆqπ(y|x(N))).
Here, the risk, R(θ, ˆqπ(y|x(N))), is equal to the risk, R(θ, q(y| ˆθopt)), up to O(N−2), because we expand
the Bayesian predictive density, ˆqπ(y|x(N)) as:

q(y|x(N)) = q(y| ˆθopt) + 1

2N gX,ij( ˆθπ)

⎧
⎪
⎨

⎪
⎩
∂ijq(y| ˆθπ) −

m
Γ Y,k
ij
( ˆθπ)∂kq(y| ˆθπ)

⎫
⎪
⎬

⎪
⎭
+ OP(N−3/2).
(A19)

129


Entropy 2014, 16, 3026–3048

Thus, we obtain Expansion (1).

References

1. Aitchison, J. Goodness of prediction fit. Biometrika 1975, 62, 547–554.
2. Komaki, F. On asymptotic properties of predictive distributions. Biometrika 1996, 83, 299–313.
3. Hartigan, J. The maximum likelihood prior. Ann. Stat. 1998, 26, 2083–2103.
4. Bernardo, J. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. B 1979, 41, 113–147.
5. Clarke, B.; Barron, A. Jeffreys prior is asymptotically least favorable under entropy risk.
J. Stat. Plan.
Inference 1994, 41, 37–60.
6. Aslan, M. Asymptotically minimax Bayes predictive densities. Ann. Stat. 2006, 34, 2921–2938.
7. Komaki, F. Bayesian predictive densities based on latent information priors. J. Stat. Plan. Inference 2011, 141,
3705–3715.
8. Komaki, F. Asymptotically minimax Bayesian predictive densities for multinomial models. Electron. J. Stat.
2012, 6, 934–957.
9. Kanamori, T.; Shimodaira, H. Active learning algorithm using the maximum weighted log-likelihood
estimator. J. Stat. Plan. Inference 2003, 116, 149–162.
10. Shimodaira,
H.
Improving
predictive
inference
under
covariate
shift
by
weighting
the
log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244.
11. Fushiki, T.; Komaki, F.; Aihara, K. On parametric bootstrapping and Bayesian prediction. Scand. J. Stat. 2004,
31, 403–416.
12. Suzuki, T.; Komaki, F. On prior selection and covariate shift of β-Bayesian prediction under α-divergence
risk. Commun. Stat. Theory 2010, 39, 1655–1673.
13. Komaki, F. Asymptotic properties of Bayesian predictive densities when the distributions of data and target
variables are different. Bayesian Anal. 2014, submitted for publication.
14. Hodges, J.L.; Lehmann, E.L. Some problems in minimax point estimation. Ann. Math. Stat. 1950, 21, 182–197.
15. Ghosh, M.N. Uniform approximation of minimax point estimates. Ann. Math. Stat. 1964, 35, 1031–1047.
16. Amari, S. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1985.
17. Robbins, H.
Asymptotically Subminimax solutions of Compound Statistical Decision Problems.
In Proceedings of the Second Berkley Symposium Mathematical Statistics and Probability, Berkeley, CA,
USA, 31 July–12 August 1950; University of California Press: Oakland, CA, USA, 1950; pp. 131–148.
18. Frank, P.; Kiefer, J. Almost subminimax and biased minimax procedures. Ann. Math. Stat. 1951, 22, 465–468.
19. Efron, B. Defining curvature of a statistical problem (with applications to second order efficiency). Ann. Stat.
1975, 3, 189–1372.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

130


entropy

Article
Information-Geometric Markov Chain Monte Carlo
Methods Using Diffusions

Samuel Livingstone 1,* and Mark Girolami 2

1 Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK
2 Department of Statistics, University of Warwick, Coventry CV4 7AL, UK; E-Mail:m.girolami@warwick.ac.uk
*
E-Mail: samuel.livingstone@ucl.ac.uk; Tel.: +44-20-7679-1872.

Received: 29 March 2014; in revised form: 23 May 2014 / Accepted: 28 May 2014 /
Published: 3 June 2014

Abstract: Recent work incorporating geometric ideas in Markov chain Monte Carlo is reviewed
in order to highlight these advances and their possible application in a range of domains beyond
statistics. A full exposition of Markov chains and their use in Monte Carlo simulation for statistical
inference and molecular dynamics is provided, with particular emphasis on methods based on
Langevin diffusions. After this, geometric concepts in Markov chain Monte Carlo are introduced.
A full derivation of the Langevin diffusion on a Riemannian manifold is given, together with
a discussion of the appropriate Riemannian metric choice for different problems. A survey of
applications is provided, and some open questions are discussed.

Keywords: information geometry; Markov chain Monte Carlo; Bayesian inference; computational
statistics; machine learning; statistical mechanics; diffusions

1. Introduction

There are three objectives to this article. The first is to introduce geometric concepts that have
recently been employed in Monte Carlo methods based on Markov chains [1] to a wider audience.
The second is to clarify what a “diffusion on a manifold” is, and how this relates to a diffusion
defined on Euclidean space. Finally, we review the state-of-the-art in the field and suggest avenues for
further research.
The connections between some Monte Carlo methods commonly used in statistics, physics
and application domains, such as econometrics, and ideas from both Riemannian and information
geometry [2,3] were highlighted by Girolami and Calderhead [1] and the potential benefits
demonstrated empirically. Two Markov chain Monte Carlo methods were introduced, the manifold
Metropolis-adjusted Langevin algorithm and Riemannian manifold Hamiltonian Monte Carlo. Here,
we focus on the former for two reasons. First, the intuition for why geometric ideas can improve
standard algorithms is the same in both cases. Second, the foundations of the methods are quite
different, and since the focus of the article is on using geometric ideas to improve performance,
we considered a detailed description of both to be unnecessary. It should be noted, however, that
impressive empirical evidence exists for using Hamiltonian methods in some scenarios (e.g., [4]). We
refer interested readers to [5,6].
We take an expository approach, providing a review of some necessary preliminaries from
Markov chain Monte Carlo, diffusion processes and Riemannian geometry. We assume only a minimal
familiarity with measure-theoretic probability. More informed readers may prefer to skip these
sections. We then provide a full derivation of the Langevin diffusion on a Riemannian manifold and
offer some intuition for how to think about such a process. We conclude Section 4 by presenting the
Metropolis-adjusted Langevin algorithm on a Riemannian manifold.
A key challenge in the geometric approach is which manifold to choose. We discuss this in
Section 4.4 and review some candidates that have been suggested in the literature, along with the

Entropy 2014, 16, 3074–3102; doi:10.3390/e16063074
www.mdpi.com/journal/entropy
131


Entropy 2014, 16, 3074–3102

reasoning for each. Rather than provide a simulation study here, we instead reference studies where
the methods we describe have been applied in Section 5. In Section 6, we discuss several open
questions, which we feel could be interesting areas of further research and of interest to both theorists
and practitioners.
Throughout, π(·) will refer to an n-dimensional probability distribution and π(x) its density with
respect to the Lebesgue measure.

2. Markov Chain Monte Carlo

Markov chain Monte Carlo (MCMC) is a set of methods for drawing samples from a distribution,
π(·), defined on a measurable space (X , B), whose density is only known up to some proportionality
constant. Although the i-th sample is dependent on the (i − 1)-th, the Ergodic Theorem ensures that
for an appropriately constructed Markov chain with invariant distribution π(·), long-run averages are
consistent estimators for expectations under π(·). As a result, MCMC methods have proven useful
in Bayesian statistical inference, where often, the posterior density π(x|y) ∝ f (y|x)π0(x) for some
parameter, x (where f (y|x) denotes the likelihood for data y and π0(x) the prior density), is only
known up to a constant [7]. Here, we briefly introduce some concepts from general state space Markov
chain theory together with a short overview of MCMC methods. The exposition follows [8].

2.1. Markov Chain Preliminaries

A time-homogeneous Markov chain, {Xm}m∈N, is a collection of random variables, Xm, each of
which is defined on a measurable space (X , B), such that:
P[Xm ∈ A|X0 = x0, ..., Xm−1 = xm−1] = P[Xm ∈ A|Xm−1 = xm−1],
(1)

for any A ∈ B. We define the transition kernel P(xm−1, A) = P[Xm ∈ A|Xm−1 = xm−1] for the chain to
be a map for which P(x, ·) defines a distribution over (X , B) for any x ∈ X , and P(·, A) is measurable
for any A ∈ B. Intuitively, P defines a map from points to distributions in X . Similarly, we define the
m-step transition kernel to be:

Pm(x0, A) = P[Xm ∈ A|X0 = x0].
(2)

We call a distribution π(·) invariant for {Xm}m∈N if:

π(A) =
�

X P(x, A)π(dx)
(3)

for all A ∈ B. If P(x, ·) admits a density, p(x′|x), this can be equivalently written:

π(x′) =
�

X π(x)p(x′|x)dx.
(4)

The connotation of Equations (3) and (4) is that if Xm ∼ π(·), then Xm+s ∼ π(·) for any s ∈ N. In this
instance, we say the chain is “at stationarity”. Of interest to us will be Markov chains for which there
is a unique invariant distribution, which is also the limiting distribution for the chain, meaning that for
any x0 ∈ X for which π(x0) > 0:
lim
m→∞ Pm(x0, A) = π(A)
(5)

for any A ∈ B. Certain conditions are required for Equation (5) to hold, but for all Markov chains
presented here, these are satisfied (though, see [8]).
A useful condition, which is sufficient (though not necessary) for π(·) to be an invariant
distribution, is reversibility, which can be shown by the relation:

π(x)p(x′|x) = π(x′)p(x|x′).
(6)

132


Entropy 2014, 16, 3074–3102

Integrating over both sides with respect to x, we recover Equation (4). In other words, a chain is
reversible if, at stationarity, the probability that xi ∈ A and xi+1 ∈ B are equal to the probability that
xi+1 ∈ A and xi ∈ B. The relation (6) will be the primary tool used to construct Markov chains with a
desired invariant distribution in the next section.

2.1.1. Monte Carlo Estimates from Markov Chains

Of most interest here are estimators constructed from a Markov chain. The Ergodic Theorem
states that for any chain, {Xm}m∈N, satisfying Equation (5) and any g ∈ L1(π), we have that:

lim
m→∞
1
m

m
∑
i=1
g(Xi) = Eπ[g(X)]
(7)

with probability one [7]. This is a Markov chain analogue to the Law of large numbers.
The efficiency of estimators of the form ˆtm
= ∑i g(Xi)/m can be assessed through the
autocorrelation between elements in the chain. We will assess the efficiency of ˆtm relative to estimators
¯tm = ∑i g(Zi)/m, where {Zi}m∈N is a sequence of independent random variables, each having
distribution π(·). Provided Varπ[g(Zi)] < ∞, then Var[¯tm] = Varπ[g(Zi)]/m. We now seek a similar
result for estimators of the form, ˆtm.
It follows directly from the Kipnis–Varadhan Theorem [9] that an estimator, ˆtm, from a reversible
Markov chain for which X0 ∼ π(·) satisfies:

lim
m→∞
Var[ˆtm]
Var[¯tm] = 1 + 2
∞
∑
i=1
ρ(0,i) = τ,
(8)

provided that ∑∞
i=1 i|ρ(0,i)| < ∞, where ρ(0,i) = Corrπ[g(X0), g(Xi)]. We will refer to the constant, τ, as
the autocorrelation time for the chain.
Equation (8) implies that for large enough m, Var[ˆtm] ≈ τVar[¯tm]. In practical applications, the
sum in Equation (8) is truncated to the first p − 1 realisations of the chain, where p is the first instance
at which |ρ(0,p)| < ϵ for some ϵ > 0. For example, in the Convergence Diagnosis and Output Analysis
for MCMC (CODA) package within the R statistical software ϵ = 0.05 [10,11].
Another commonly used measure of efficiency is the effective sample size me f f = m/τ, which
gives the number of independent samples from π(·) needed to give an equally efficient estimate for
Eπ[g(X)]. Clearly, minimising τ is equivalent to maximising me f f .
The measures arising from Equation (8) give some intuition for what sort of Markov chain gives
rise to efficient estimators. However, in practice, the chain will never be at stationarity. Therefore, we
also assess Markov chains according to how far away they are from this point. For this, we need to
measure how close Pm(x0, ·) is from π(·), which requires a notion of distance between probability
distributions.
Although there are several appropriate choices [12], a common option in the Markov chain
literature is the total variation distance:

∥μ(·) − ν(·)∥TV := sup
A∈B
|μ(A) − ν(A)|,
(9)

which informally gives the largest possible difference between the probabilities of a single event in
B according to μ(·) and ν(·). If both distributions admit densities, Equation (9) can be written (see
Appendix A):

∥μ(·) − ν(·)∥TV = 1

2

�

X |μ(x) − ν(x)|dx.
(10)

which is proportional to the L1 distance between μ(x) and ν(x). Our metric, ∥ · ∥TV ∈ [0, 1], with
∥ · ∥TV = 1 for distributions with disjoint supports and ∥μ(·) − ν(·)∥TV = 0, implies μ(·) ≡ ν(·).

133


Entropy 2014, 16, 3074–3102

Typically, for an unbounded X , the distance ∥Pm(x0, ·) − π(·)∥TV will depend on x0 for any finite
m. Therefore, bounds on the distance are often sought via some inequality of the form:

∥Pm(x0, ·) − π(·)∥TV ≤ MV(x0) f (m),
(11)

for some M < ∞, where V : X → [1, ∞) depends on x0 and is called a drift function, and f : N → [0, ∞)
depends on the number of iterations, m (and is often defined, such that f (0) = 1).
A Markov chain is called geometrically ergodic if f (m) = rm in Equation (11) for some 0 < r < 1.
If in addition to this, V is bounded above, the chain is called uniformly ergodic. Intuitively, if either
condition holds, then the distribution of Xm will converge to π(·) geometrically quickly as m grows,
and in the uniform case, this rate is independent of x0. As well as providing some (often qualitative
if M and r are unknown) bounds on the convergence rate of a Markov chain, geometric ergodicity
implies that a central limit theorem exists for estimators of the form, ˆtm. For more detail on this,
see [13,14].
In practice several approximate methods also exist to assess whether a chain is close enough to
stationarity for long-run averages to provide suitable estimators (e.g., [15]). The MCMC practitioner
also uses a variety of visual aids to judge whether an estimate from the chain will be appropriate for
his or her needs.

2.2. Markov Chain Monte Carlo

Now that we have introduced Markov chains, we turn to simulating them. The objective here
is to devise a method for generating a Markov chain, which has a desired limiting distribution, π(·).
In addition, we would strive for the convergence rate to be as fast as possible and the effective sample
size to be suitably large relative to the number of iterations. Of course, the computational cost of
performing an iteration is also an important practical consideration. Ideally, any method would
also require limited problem-specific alterations, so that practitioners are able to use it with as little
knowledge of the inner workings as is practical.
Although other methods exist for constructing chains with a desired limiting distribution, a
popular choice is the Metropolis–Hastings algorithm [7]. At iteration i, a sample is drawn from some
candidate transition kernel, Q(xi−1, ·), and then either accepted or rejected (in which case, the state of
the chain remains xi−1). We focus here on the case where Q(xi−1, ·) admits a density, q(x′|xi−1), for all
xi−1 ∈ X (though, see [8]). In this case, a single step is shown below (the wedge notation a ∧ b denotes
the minimum of a and b). The “acceptance rate”, α(xi−1, x′), governs the behaviour of the chain, so
that, when it is close to one, then many proposed moves are accepted, and the current value in the
chain is constantly changing. If it is on average close to zero, then many proposals are rejected, so
that the chain will remain in the same place for many iterations. However, α ≈ 1 is typically not ideal,
often resulting in a large autocorrelation time (see below). The challenge in practice is to find the right
acceptance rate to balance these two extremes.

Algorithm 1 Metropolis–Hastings, single iteration.

Require: xi−1
Draw X′ ∼ Q(xi−1, ·)
Draw Z ∼ U[0, 1]
Set α(xi−1, x′) ← 1 ∧
π(x′)q(xi−1|x′)

π(xi−1)q(x′|xi−1)
if z < α(xi−1, x′) then
Set xi ← x′
else
Set xi ← xi−1
end if

Combining the “proposal” and “acceptance” steps, the transition kernel for the resulting Markov
chain is:

134


Entropy 2014, 16, 3074–3102

P(x, A) = r(x)δx(A) +
�

A α(x, x′)q(x′|x)dx′,
(12)

for any A ∈ B, where:

r(x) = 1 −
�

X α(x, x′)q(x′|x)dx′

is the average probability that a draw from Q(x, ·) will be rejected, and δx(A) = 1 if x ∈ A and zero,
otherwise. A Markov chain defined in this way will have π(·) as an invariant distribution, since the
chain is reversible for π(·). We note here that:

π(xi−1)q(xi|xi−1)α(xi−1, xi) = π(xi−1)q(xi|xi−1) ∧ π(xi)q(xi−1|xi)

= α(xi, xi−1)q(xi−1|xi)π(xi)

in the case that the proposed move is accepted and that if the proposed move is rejected, then xi = xi−1;
so the chain is reversible for π(·). It can be shown that π(·) is also the limiting distribution for the
chain [7].
The convergence rate and autocorrelation time of a chain produced by the algorithm are dependent
on both the choice of proposal, Q(xi−1, ·), and the target distribution, π(·). For simple forms of the
latter, less consideration is required when choosing the former. A broad objective among researchers
in the field is to find classes of proposal kernels that produce chains that converge and mix quickly for
a large class of target distributions. We first review a simple choice before discussing one that is more
sophisticated, and the will be the focus of the rest of the article.

2.3. Random Walk Proposals

An extremely simple choice for Q(x, ·) is one for which:

q(x′|x) = q(∥x′ − x∥)
(13)

where ∥ · ∥ denotes some appropriate norm on X , meaning the proposal is symmetric. In this case, the
acceptance rate reduces to:

α(x, x′) = 1 ∧ π(x′)

π(x) .
(14)

In addition to simplifying calculations, Equation (14) strengthens the intuition for the method, since
proposed moves with higher density under π(·) will always be accepted. A typical choice for Q(x, ·)
is N (x, λ2Σ), where the matrix, Σ, is often chosen in an attempt to match the correlation structure
of π(·) or simply taken as the identity [16]. The tuning parameter, λ, is the only other user-specific
input required.
Much research has been conducted into properties of the random walk Metropolis algorithm
(RWM). It has been shown that the optimal acceptance rate for proposals tends to 0.234 as the
dimension, n, of the state space, X , tends to ∞ for a wide class of targets (e.g., [17,18]). The intuition
for an optimal acceptance rate is to find the right balance between the distance of proposed moves
and the chances of acceptance. Increasing the former will reduce the autocorrelation in the chain if the
proposal is accepted, but if it is rejected, the chain will not move at all, so autocorrelation will be high.
Random walk proposals are sometimes referred to as blind (e.g., [19]), as no information about π(·)
is used when generating proposals, so typically, very large moves will result in a very low chance of
acceptance, while small moves will be accepted, but result in very high autocorrelation for the chain.
Figure 1 demonstrates this in the simple case where π(·) is a one-dimensional N (0, 12) distribution.

135


Entropy 2014, 16, 3074–3102

�
����
����
�����

��
��
�
�
�

����������

�������������
����������������������������

�
����
����
�����

��
��
�
�
�

����������

������������
�������������������������������

�
����
����
�����

��
��
�
�
�

����������

������������
�����������������������������

Figure 1. These traceplots show the evolution of three RWM Markov chains for which π(·) is a N (0, 12)
distribution, with different choices for λ.

Several authors have also shown that for certain classes of π(·), the tuning parameter, λ, should
be chosen, such that λ2 ∝ n−1, so that α ↛ 0 as n → ∞ [20]. Because of this, we say that algorithm
efficiency “scales” O(n−1) as the dimension n of π(·) increases.
Ergodicity results for a Markov chain constructed using the RWM algorithm also exist [21–23].
At least exponentially light tails are a necessity for π(x) for geometric ergodicity, which means that
π(x)/e−∥x∥ → c as ∥x∥ → ∞, for some constant, c. For super-exponential tails (where π(x) → 0 at
a faster than the exponential rate), additional conditions are required [21,23]. We demonstrate with
a simple example why heavy-tailed forms of π(x) pose difficulties here (where π(x) → 0 at a rate
slower than e−∥x∥).

Example: Take π(x) ∝ 1/(1 + x2), so that π(·) is a Cauchy distribution. Then, if X′ ∼ N (x, λ2), the ratio
π(x′)/π(x) = (1 + x2)/(1 + (x′)2) → 1 as |x| → ∞. Therefore, if x0 is far away from zero, the Markov
chain will dissolve into a random walk, with almost every proposal being accepted.

It should be noted that starting the chain from at or near zero can also cause problems in the
above example, as the tails of the distribution may not be explored. See [7] for more detail here.
Ergodicity results for the RWM also exist for specific classes of the statistical model. Conditions for
geometric ergodicity in the case of generalised linear mixed models are given in [24], while spherically
constrained target densities are discussed in [25]. In [26], the authors provide necessary conditions
for the geometric convergence of RWM algorithms, which are related to the existence of exponential
moments for π(·) and P(x, ·). Weaker forms of ergodicity and corresponding conditions are also
discussed in the paper.
In the remainder of the article, we will primarily discuss another approach to choosing Q, which
has been shown empirically [1] and, in some cases, theoretically [20] to be superior to the RWM
algorithm, though it should be noted that random walk proposals are still widely used in practice and
are often sufficient for more straightforward problems [16].

3. Diffusions

In MCMC, we are concerned with discrete time processes. However, often, there are benefits
to first considering a continuous time process with the properties we desire. For example, some
continuous time processes can be specified via a form of differential equation. In this section, we
derive a choice for a Metropolis–Hastings proposal kernel based on approximations to diffusions,

136


Entropy 2014, 16, 3074–3102

those continuous-time n-dimensional Markov processes (Xt)t≥0 for which any sample path t �→ Xt(ω)
is a continuous function with probability one. For any fixed t, we assume Xt is a random variable
taking values on the measurable space (X , B) as before. The motivation for this section is to define a
class of diffusions for which π(·) is the invariant distribution. First, we provide some preliminaries,
followed by an introduction to our main object of study, the Langevin diffusion.

3.1. Preliminaries

We focus on the class of time-homogeneous Itô diffusions, whose dynamics are governed by a
stochastic differential equation of the form:

dXt = b(Xt)dt + σ(Xt)dBt, X0 = x0,
(15)

where (Bt)t≥0 is a standard Brownian motion and the drift vector, b, and volatility matrix, σ, are
Lipschitz continuous [27]. Since E[Bt+△t − Bt|Bt = bt] = 0 for any △t ≥ 0, informally, we can see that:

E[Xt+△t − Xt|Xt = xt] = b(xt)△t + o(△t),
(16)

implying that the drift dictates how the mean of the process changes over a small time interval, and if
we define the process (Mt)t≥0 through the relation:

Mt = Xt −
� t

0 b(Xs)ds
(17)

then we have:

E[(Mt+△t − Mt)(Mt+△t − Mt)T|Mt = mt, Xt = xt] = σ(xt)σ(xt)T△t + o(△t),
(18)

giving the stochastic part of the relationship between Xt+△t and Xt for small enough △t; see, e.g., [28].
While Equation(15) is often a suitable description of an Itô diffusion, it can also be characterised
through an infinitesimal generator, A, which describes how functions of the process are expected to
evolve. We define this partial differential operator through its action on a function, f ∈ C0(X ), as:

A f (Xt) = lim
△t→0
E[ f (Xt+△t)|Xt = xt] − f (xt)

△t
,
(19)

though A can be associated with the drift and volatility of (Xt)t≥0 by the relation:

A f (x) = ∑
i
bi(x) ∂ f

∂xi
(x) + 1

2 ∑
i,j
Vij(x) ∂2 f

∂xi∂xj
(x),
(20)

where Vij(x) denotes the component in row i and column j of σ(x)σ(x)T [27].
As in the discrete case, we can describe the transition kernel of a continuous time Markov process,
Pt(x0, ·). In the case of an Itô diffusion, Pt(x0, ·) admits a density, pt(x|x0), which, in fact, varies
smoothly as a function of t. The Fokker–Planck equation describes this variation in terms of the drift
and volatility and is given by:

∂
∂t pt(x|x0) = −∑
i

∂
∂xi
[bi(x)pt(x|x0)] + 1

2 ∑
i,j

∂2

∂xi∂xj
[Vij(x)pt(x|x0)].
(21)

137


Entropy 2014, 16, 3074–3102

Although, typically, the form of Pt(x0, ·) is unknown, the expectation and variance of Xt ∼ Pt(x0, ·)
are given by the integral equations:

E[Xt|X0 = x0] = x0 + E
�� t

0 b(Xs)ds
�
,

E[(Xt − E[Xt])(Xt − E[Xt])T|X0 = x0] = E
�� t

0 σ(Xs)σ(Xs)Tds
�
,

where the second of these is a result of the Itô isometry [27]. Continuing the analogy, a natural question
is whether a diffusion process has an invariant distribution, π(·), and whether:

lim
t→∞ Pt(x0, A) = π(A)
(22)

for any A ∈ B and any x0 ∈ X , in some sense. For a large class of diffusions (which we confine
ourselves to), this is, in fact, the case. Specifically, in the case of positive Harris recurrent diffusions
with invariant distribution π(·), all compact sets must be small for some skeleton chain, see [29] for
details. In addition, Equation (21) provides a means of finding π(·), given b and σ. Setting the left-hand
side of Equation (21) to zero gives:

∑
i

∂
∂xi
[bi(x)π(x)] = 1

2 ∑
i,j

∂2

∂xi∂xj
[Vij(x)π(x)],
(23)

which can be solved to find π(·).

3.2. Langevin Diffusions

Given Equation (23), our goal becomes clearer: find drift and volatility terms, so that the resulting
dynamics describe a diffusion, which converges to some user-defined invariant distribution, π(·). This
process can then be used as a basis for choosing Q in a Metropolis–Hastings algorithm. The Langevin
diffusion, first used to describe the dynamics of molecular systems [30], is such a process, given by the
solution to the stochastic differential equation:

dXt = 1

2∇ log π(Xt)dt + dBt, X0 = x0.
(24)

Since Vij(x) = 1{i=j}, it is clear that

1
2
∂
∂xi
[log π(x)]π(x) = 1

2
∂
∂xi
π(x), ∀i,
(25)

which is a sufficient condition for Equation (23) to hold. Therefore, for any case in which π(x) is
suitably regular, so that ∇ log π(x) is well-defined and the derivatives in Equation (23) exist, we can
use (24) to construct a diffusion, which has invariant distribution, π(·).
Roberts and Tweedie [31] give sufficient conditions on π(·) under which a diffusion, (Xt)t≥0,
with dynamics given by Equation (24), will be ergodic, meaning:

∥Pt(x0, ·) − π(·)∥TV → 0
(26)

as t → ∞, for any x0 ∈ X .

3.3. Metropolis-Adjusted Langevin Algorithm

We can use Langevin diffusions as a basis for MCMC in many ways, but a popular
variant is known as the Metropolis-adjusted Langevin algorithm (MALA), whereby Q(x, ·) is

138


Entropy 2014, 16, 3074–3102

constructed through a Euler–Maruyama discretisation of (24) and used as a candidate kernel in
a Metropolis–Hastings algorithm. The resulting Q is:

Q(x, ·) ≡ N
�
x + λ2

2 ∇ log π(x), λ2I
�
,
(27)

where λ is again a tuning parameter.
Before we discuss the theoretical properties of the approach, we first offer an intuition for the
dynamics. From Equation (27), it can be seen that Langevin-type proposals comprise a deterministic
shift towards a local mode of π(x), combined with some random additive Gaussian noise, with
variance λ2 for each component. The relative weights of the deterministic and random parts are fixed,
given as they are by the parameter, λ. Typically, if λ1/2 ≫ λ, then the random part of the proposal will
dominate and vice versa in the opposite case, though this also depends on the form of ∇ log π(x) [31].
Again, since this is a Metropolis–Hastings method, choosing λ is a balance between proposing
large enough jumps and ensuring that a reasonable proportion are accepted. It has been shown that in
the limit, as n → ∞, the optimal acceptance rate for the algorithm is 0.574 [20] for forms of π(·), which
either have independent and identically distributed components or whose components only differ
by some scaling factor [20]. In these cases, as n → ∞, the parameter, λ, must be ∝ n−1/3, so we say
the algorithm efficiency scales O(n−1/3). Note that these results compare favourably with the O(n−1)
scaling of the random walk algorithm.
Convergence properties of the method have also been established. Roberts and Tweedie [31]
highlight some cases in which MALA is either geometrically ergodic or not. Typically, results are
based on the tail behaviour of π(x). If these tails are heavier than exponential, then the method is
typically not geometrically ergodic and similarly if the tails are lighter than Gaussian. However, in the
in between case, the converse is true. We again offer two simple examples for intuition here.

Example: Take π(x) ∝ 1/(1 + x2) as in the previous example. Then, ∇ log π(x) = −2x/(1 + x2)2 → 0
as |x| → ∞. Therefore, if x0 is far away from zero, then the MALA will be approximately equal to the RWM
algorithm and, so, will also dissolve into a random walk.

Example: Take π(x) ∝ e−x4. Then, ∇ log π(x) = −4x3 and X′ ∼ N (x − 4λ2x3, λ2). Therefore, for any fixed
λ, there exists c > 0, such that, for |x0| > c, we have |4λ2x3| >> x and |x − 4λ2x3| >> λ, suggesting that
MALA proposals will quickly spiral further and further away from any neighbourhood of zero, and hence, nearly
all will be rejected.

For cases where there is a strong correlation between elements of x or each element has a different
marginal variance, the MALA can also be “pre-conditioned” in a similar way to the RWM, so that the
covariance structure of proposals more accurately reflects that of π(x) [32]. In this case, proposals take
the form:

Q(x, ·) ≡ N
�
x + λ2

2 Σ∇ log π(x), λ2Σ
�
,
(28)

where λ is again a tuning parameter. It can be shown that provided Σ is a constant matrix, π(x) is still
the invariant distribution for the diffusion on which Equation (28) is based [33].

4. Geometric Concepts in Markov Chain Monte Carlo

Ideas from information geometry have been successfully applied to statistics from as early as [34].
More widely, other geometric ideas have also been applied, offering new insight into common problems
(e.g., [35,36]). A survey is given in [37]. In this section, we suggest why some ideas from differential
geometry may be beneficial for sampling methods based on Markov chains. We then review what is

139


Entropy 2014, 16, 3074–3102

meant by a “diffusion on a manifold”, before turning to the specific case of Equation (24). After this,
we discuss what can be learned from work in information geometry in this context.

4.1. Manifolds and Markov Chains

We often make assumptions in MCMC about the properties of the space, X , in which our
Markov chains evolve. Often X = Rn or a simple re-parametrisation would make it so. However,
here, Rn = {(a1, ..., an) : ai ∈ (−∞, ∞) ∀i}. The additional assumption that is often made is that Rn is
Euclidean, an inner product space with the induced distance metric:

d(x, y) =
�

∑
i
(xi − yi)2.
(29)

For sampling methods based on Markov chains that explore the space locally, like the RWM and
MALA, it may be advantageous to instead impose a different metric structure on the space, X , so that
some points are drawn closer together and others pushed further apart. Intuitively, one can picture
distances in the space being defined, such that if the current position in the chain is far from an area
of X , which is “likely to occur” under π(·), then the distance to such a typical set could be reduced.
Similarly, once this region is reached, the space could be “stretched” or “warped”, so that it is explored
as efficiently as possible.
While the idea is attractive, it is far from a constructive definition. We only have the pre-requisite
that (X , d) must be a metric space. However, as Langevin dynamics use gradient information, we will
require (X , d) to be a space on which we can do differential calculus. Riemannian manifolds are an
appropriate choice, therefore, as the rules of differentiation are well understand for functions defined
on them [38,39], while we are still free to define a more local notion of distance than Euclidean. In this
section, we write Rn to denote the Euclidean vector space.

4.2. Preliminaries

We do not provide a full overview of Riemannian geometry here [38–40]. We simply note that
for our purposes, we can consider an n-dimensional Riemannian manifold (henceforth, manifold)
to be an n-dimensional metric space, in which distances are defined in a specific way. We also only
consider manifolds for which a global coordinate chart exists, meaning that a mapping r : Rn → M
exists, which is both differentiable and invertible and for which the inverse is also differentiable (a
diffeomorphism). Although this restricts the class of manifolds available (the sphere, for example, is
not in this class), it is again suitable for our needs and avoids the practical challenges of switching
between coordinate patches. The connection with Rn defined through r is crucial for making sense of
differentiability in M. We say a function f : M → R is “differentiable” if ( f ◦ r) : Rn → R is [39].
As has been stated, Equation (29) can be induced via a Euclidean inner product, which we denote
⟨·, ·⟩. However, it will aid intuition to think of distances in Rn via curves:

γ : [0, 1] → Rn.
(30)

We could think of the distance between two points in x, y ∈ Rn as the minimum length among all
curves that pass through x and y. If γ(0) = x and γ(1) = y, the length is defined as:

L(γ) =
� 1

0

�

⟨γ′(t), γ′(t)⟩dt,
(31)

giving the metric:
d(x, y) = inf {L(γ) : γ(0) = x, γ(1) = y} .
(32)

In Rn, the curve with a minimum length will be a straight line, so that Equation (32) agrees with
Equation (29). More generally, we call a solution to Equation (32) a geodesic [38].

140


Entropy 2014, 16, 3074–3102

In a vector space, metric properties can always be induced through an inner product (which also
gives a notion of orthogonality). Such a space can be thought of as “flat”, since for any two points, y
and z, the straight line ay + (1 − a)z, a ∈ [0, 1] is also contained in the space. In general, manifolds do
not have vector space structure globally, but do so at the infinitesimal level. As such, we can think
of them as “curved”. We cannot always define an inner product, but we can still define distances
through (32). We define a curve on a manifold, M, as γM : [0, 1] → M. At each point γM(t) = p ∈ M,
the velocity vector, γ′
M(t), lies in an n-dimensional vector space, which touches M at p. These are
known as tangent spaces, denoted TpM, which can be thought of as local linear approximations to M.
We can define an inner product on each as gp : TpM → R, which allows us to define a generalisation
of (31) as:

L(γM) =
� 1

0

�

gp(γ′
M(t), γ′
M(t))dt.
(33)

and
provides
a
means
to
define
a
distance
metric
on
the
manifold
as
d(x, y)
=
inf {L(γM) : γM(0) = x, γM(1) = y}. We emphasise the difference between this distance metric on M
and gp, which is called a Riemannian metric or metric tensor and which defines an inner product on
TpM.

Embeddings and Local Coordinates

So far, we have introduced manifolds as abstract objects. In fact, they can also be considered as
objects that are embedded in some higher-dimensional Euclidean space. A simple example is any
two-dimensional surface, such as the unit sphere, lying in R3. If a manifold is embedded in this way,
then metric properties can be induced from the ambient Euclidean space.
We seek to make these ideas more concrete through an example, the graph of a function, f (x1, x2),
of two variables, x1 and x2. The resulting map, r, is:

r : R2 → M
(34)

r(x1, x2) = (x1, x2, f (x1, x2)).
(35)

We can see that M is embedded in R3, but that any point can be identified using only two coordinates,
x1 and x2. In this case, each TpM is a plane, and therefore, a two-dimensional subspace of R3, so: (i) it
inherits the Euclidean inner product, ⟨·, ·⟩; and (ii) any vector, v ∈ TpM, can be expressed as a linear
combination of any two linearly independent basis vectors (a canonical choice is the partial derivatives
∂r/∂x1 := r1 and r2, evaluated at x = r−1(p) ∈ R2). The resulting inner product, gp(v, w), between
two vectors, v, w ∈ TpM, can be induced from the Euclidean inner product as:

⟨v, w⟩ = ⟨v1r1(x) + v2r2(x), w1r1(x) + w2r2(x)⟩,

= v1w1⟨r1(x), r1(x)⟩ + v1w2⟨r1(x), r2(x)⟩ + v2w1⟨r2(x), r1(x)⟩ + v2w2⟨r2(x), r2(x)⟩,

= vTG(x)w,

where:

G(x) =

�
⟨r1(x), r1(x)⟩
⟨r1(x), r2(x)⟩
⟨r1(x), r2(x)⟩
⟨r2(x), r2(x)⟩

�

(36)

and we use vi, wi to denote the components of v and w. To write (31) using this notation, we define the
curve, x(t) ∈ R2, corresponding to γM(t) ∈ M as x = (r−1 ◦ γM) : [0, 1] → R2. Equation (31) can then
be written:

L(γM) =
� 1

0

�

x′(t)TG(x(t))x′(t)dt,
(37)

which can be used in (32) as before.

141


Entropy 2014, 16, 3074–3102

The key point is that, although we have started with an object embedded in R3, we can compute
the Riemannian metric, gp(v, w) (and, hence, distances in M), using only the two-dimensional “local”
coordinates (x1, x2). We also need not have explicit knowledge of the mapping, r, only the components
of the positive definite matrix, G(x). The Nash embedding theorem [41] in essence enables us to define
manifolds by the reverse process: simply choose the matrix, G(x), so that we define a metric space
with suitable distance properties, and some object embedded in some higher-dimensional Euclidean
space will exist for which these metric properties can be induced as above. Therefore, to define our new
space, we simply choose an appropriate matrix-valued map, G(x) (we discuss this choice in Section
4.4). If G(x) does not depend on x, then M has a vector space structure and can be thought of as “flat”.
Trivially, G(x) = I gives Euclidean n-space.
We can also define volumes on a Riemannian manifold in local coordinates. Following standard
coordinate transformation rules, we can see that for the above example, the area element, dx, in R2

will change according to a Jacobian J = |(Dr)T(Dr)|1/2, where Dr = ∂(p1, p2, p3)/∂(x1, x2). This
reduces to J = |G(x)|1/2, which is also the case for more general manifolds [38]. We therefore define
the Riemannian volume measure on a manifold, M, in local coordinates as:

VolM(dx) = |G(x)|
1
2 dx.
(38)

If G(x) = I, then this reduces to the Lebesgue measure.

4.3. Diffusions on Manifolds

By a “diffusion on a manifold” in local coordinates, we actually mean a diffusion defined on
Euclidean space. For example, a realisation of Brownian motion on the surface, S ⊂ R3, defined
in Figure 2 through r(x1, x2) = (x1, x2, sin(x1) + 1) will be a sample path, which is defined on S
and “looks locally” like Brownian motion in a neighbourhood of any point, p ∈ S. However, the
pre-image of this sample path (through r−1) will not be a realisation of a Brownian motion defined on
R2, owing to the nonlinearity of the mapping. Therefore, to define “Brownian motion on S”, we define
some diffusion (Xt)t≥0 that takes values in R2, for which the process (r(Xt))t≥0 “looks locally” like a
Brownian motion (and lies on S). See [42] for more intuition here.
Our goal, therefore, is to define a diffusion on Euclidean space, which, when mapped onto a
manifold through r, becomes the Langevin diffusion described in (24) by the above procedure. Such a
diffusion takes the form:

dXt = 1

2
˜∇ log ˜π(Xt)dt + d ˜Bt,
(39)

where those objects marked with a tilde must be defined appropriately. The next few paragraphs
are technical, and readers aiming to simply grasp the key points may wish to skip to the end of
this Subsection.
We turn first to ( ˜Bt)t≥0, which we use to denote Brownian motion on a manifold. Intuitively,
we may think of a construction based on embedded manifolds, by setting ˜B0 = p ∈ M, and for
each increment sampling some random vector in the tangent space TpM, and then moving along the
manifold in the prescribed direction for an infinitesimal period of time before re-sampling another
velocity vector from the next tangent space [42]. In fact, we can define such a construction using
Stratonovich calculus and show that the infinitesimal generator can be written using only local
coordinates [28]. Here, we instead take the approach of generalising the generator directly from
Euclidean space to the local coordinates of a manifold, arriving at the same result. We then deduce the
stochastic differential equation describing ( ˜Bt)t≥0 in Itô form using (20).
For a standard Brownian motion on Rn, A = △/2, where △ denotes the Laplace operator:

△ f = ∑
i

∂2 f
∂x2
i
= div(∇ f ).
(40)

142


Entropy 2014, 16, 3074–3102

Substituting A = △/2 into (20) trivially gives bi(x) = 0 ∀i, Vij(x) = 1{i=j}, as required. The Laplacian,
△ f (x), is the divergence of the gradient vector field of some function, f ∈ C2(Rn), and its value at
x ∈ Rn can be thought of as the average value of f in some neighbourhood of x [43].

A 

B 

Figure 2. A two-dimensional manifold (surface) embedded in R3 through r(x1, x2) = (x1, x2, sin(x1) +
1), parametrised by the local coordinates, x1 and x2. The distance between points A and B is given by
the length of the curve γ(t) = (t, t, sin(t) + 1)).

To define a Brownian motion on any manifold, the gradient and divergence must be generalised.
We provide a full derivation in Appendix B, which shows that the gradient operator on a manifold can
be written in local coordinates as ∇M = G−1(x)∇. Combining with the operator, divM, we can define
a generalisation of the Laplace operator, known as the Laplace–Beltrami operator (e.g., [44,45]), as:

△LB f = divM(∇M f ) = |G(x|− 1

2
n
∑
i=1

∂
∂xi

�

|G(x)|
1
2
n
∑
j=1
{G−1(x)}ij
∂ f
∂xj

�

,
(41)

for some f ∈ C2
0(M).
The generator of a Brownian motion on M is △LB/2 [44]. Using (20), the resulting diffusion has
dynamics given by:

d ˜Bt = Ω(Xt)dt +
�

G−1(Xt)dBt,

Ωi(Xt) = 1

2|G(Xt)|− 1

2
n
∑
j=1

∂
∂xj

�
|G(Xt)|
1
2 {G−1(Xt)}ij
�
.

Those familiar with the Itô formula will not be surprised by the additional drift term, Ω(Xt). As Itô
integrals do not follow the chain rule of ordinary calculus, non-linear mappings of martingales, such
as (Bt)t≥0, typically result in drift terms being added to the dynamics (e.g., [27]).
To define ˜∇, we simply note that this is again the gradient operator on a general manifold, so
˜∇ = G−1(x)∇. For the density, ˜π(x), we note that this density will now implicitly be defined with
respect to the volume measure, |G(x)|
1
2 dx, on the manifold. Therefore, to ensure the diffusion (39) has
the correct invariant density with respect to the Lebesgue measure, we define:

˜π(x) = π(x)|G(x)|− 1

2 .
(42)

Putting these three elements together, Equation (39) becomes:

143


Entropy 2014, 16, 3074–3102

dXt = 1

2G−1(Xt)∇ log
�
π(Xt)|G(Xt)|− 1

2
�
dt + Ω(Xt)dt +
�

G−1(Xt)dBt,

which, upon simplification, becomes:

dXt = 1

2G−1(Xt)∇ log π(Xt)dt + Λ(Xt)dt +
�

G−1(Xt)dBt,
(43)

Λi(Xt) = 1

2 ∑
j

∂
∂xj
{G−1(Xt)}ij.
(44)

It can be shown that this diffusion has invariant Lebesgue density π(x), as required [33]. Intuitively,
when a set is mapped onto the manifold, distances are changed by a factor,
�

G(x). Therefore, to end
up with the initial distances, they must first be changed by a factor of
�

G−1(x) before the mapping,
which explains the volatility term in Equation (43).
The resulting Metropolis–Hastings proposal kernel for this “MALA on a manifold” was clarified
in [33] and is given by:

Q(x, ·) ≡ N
�
x + λ2

2 G−1(x)∇ log π(x) + λ2Λ(x), λ2G−1(x)
�
,
(45)

where λ2 is a tuning parameter. The nonlinear drift term here is slightly different to that reported
in [1,32], for reasons discussed in [33].

4.4. Choosing a Metric

We now turn to the question of which metric structure to put on the manifold, or equivalently,
how to choose G(x). In this section, we sometimes switch notation slightly, denoting the target density,
π(x|y), as some of the discussion is directed towards Bayesian inference, where π(·) is the posterior
distribution for some parameter, x, after observing some data, y. The problem statement is: what is an
appropriate choice of distance between points in the sample space of a given probability distribution?
A related (but distinct) question is how to define a distance between two probability distributions
from the same parametric family, but with different parameters. This has been a key theme in
information geometry, explored by Rao [46] and others [2] for many years.
Although generic
measures of distance between distributions (such as total variation) are often appropriate, based on
information-theoretic principles, one can deduce that for a given parametric family, {px(y) : x ∈ X },
it is in some sense natural to consider this “space of distributions” to be a manifold, where the Fisher
information is the matrix, G(x) (with the α = 0 connection employed; see [2] for details).
Because of this, Girolami and Calderhead [1] proposed a variant of the Fisher metric for geometric
Markov chain Monte Carlo, as:

G(x) = Ey|x

�

−
∂2

∂xi∂xj
log f (y|x)

�

−
∂2

∂xi∂xj
log π0(x),
(46)

where π(x|y) ∝ f (y|x)π0(x) is the target density, f denotes the likelihood and π0 the prior. The metric
is tailored to Bayesian problems, which are a common use for MCMC, so the Fisher information is
combined with the negative Hessian of the log-prior. One can also view this metric as the expected
negative Hessian of the log target, since this naturally reduces to (46).
The motivation for a Hessian-style metric can also be understood from studying MCMC proposals.
From (45) and by the same logic as for general pre-conditioning methods [32], the objective is to choose
G−1(x) to match the covariance structure of π(x|y) locally. If the target density were Gaussian with
covariance matrix, Σ, then:

−
∂2

∂xi∂xj
log π(x|y) = Σ.
(47)

144


Entropy 2014, 16, 3074–3102

In the non-Gaussian case, the negative Hessian is no longer constant, but we can imagine that it
matches the correlation structure of π(x|y) locally at least. Such ideas have been discussed in the
geostatistics literature previously [47]. One problem with simply using (47) to define a metric is that
unless π(x|y) is log-concave, the negative Hessian will not be globally positive-definite, although
Petra et al. [48] conjecture that it may be appropriate for use in some realistic scenarios and suggest
some computationally efficient approximation procedures [48].

Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = −∂2 log π(x)/∂x2. Then, G−1(x) = (1 + x2)2/(2 −
2x2), which is negative if x2 > 1, so unusable as a proposal variance.

Girolami and Calderhead [1] use the Fisher metric in part to counteract this problem. Taking
expectations over the data ensures that the likelihood contribution to G(x) in (46) will be positive
(semi-)definite globally (e.g., [49]); so, provided a log-concave prior is chosen, then (46) should
be a suitable choice for G(x). Indeed, Girolami and Calderhead [1] provide several examples in
which geometric MCMC methods using this Fisher metric perform better than their “non-geometric”
counterparts.
Betancourt [50] also starts from the viewpoint that the Hessian (47) is an appropriate choice for
G(x) and defines a mapping from the set of n × n matrices to the set of positive-definite n × n matrices
by taking a “smooth” absolute value of the eigenvalues of the Hessian. This is done in a way such that
derivatives of G(x) are still computable, inspiring the author to the name, SoftAbs metric. For a fixed
value of x, the negative Hessian, H(x), is first computed and, then, decomposed into UTDU, where
D is the diagonal matrix of eigenvalues. Each diagonal element of D is then altered by the mapping
tα : R → R, given by:
tα(λi) = λi coth(αλi),
(48)

where α is a tuning parameter (typically chosen to be as large as possible for which eigenvalues remain
non-zero numerically). The function, tα, acts as an absolute value function, but also uplifts eigenvalues,
which are close to zero to ≈ 1/α. It should be noted that while the Fisher metric is only defined for
models in which a likelihood is present and for which the expectation is tractable, the SoftAbs metric
can be found for any target distribution, π(·).
Many authors (e.g., [1,48]) have noted that for many problems, the terms involving derivatives
of G(x) are often small, and so, it is not always worth the computational effort of evaluating them.
Girolami and Calderhead [1] propose the simplified manifold, MALA, in which proposals are of
the form:

Q(x, ·) ≡ N
�
x + λ2

2 G−1(x)∇ log π(x), λ2G−1(x)
�
(49)

Using this method means derivatives of G(x) are no longer needed, so more pragmatic ways of
regularising the Hessian are possible. One simple approach would be to take the absolute values
of each eigenvalue, giving G(x) = UT|D|U, where H(x) = UTDU is the negative Hessian and |D|
is a diagonal matrix with {|D|}ii = |λi| (this approach may fall into difficulties if eigenvalues are
numerically zero). Another would be choosing G(x) as the “nearest” positive-definite matrix to the
negative Hessian, according to some distance metric on the set of n × n matrices. The problem has, in
fact, been well-studied in mathematical finance, in the context of finding correlations using incomplete
data sets [51], and tackled using distances induced by the Frobenius norm. Approximate solution
algorithms are discussed in Higham [51]. It is not clear to us at present whether small changes to
the Hessian would result in large changes to the corresponding positive definite matrix under a
given distance or, indeed, whether given a distance metric on the space of matrices, there is always a
well-defined unique “nearest” positive definite matrix. Below, we provide two simple examples, here
showing how a “Hessian-style metric” can alleviate some of the difficulties associated with both heavy
and light-tailed target densities.

145


Entropy 2014, 16, 3074–3102

Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) =
−x(1 + x2)/|1 − x2|, which no longer tends to zero as |x| → ∞, suggesting a manifold variant of MALA with
a Hessian-style metric may avoid some of the pitfalls of the standard algorithm. Note that the drift may become
very large if |x| ≈ 1, but since this event occurs with probability zero, we do not see it as a major cause for
concern.

Example: Take π(x) ∝ e−x4, and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) = −x/3,
which is O(x), so alleviating the problem of spiralling proposals for light-tailed targets demonstrated by MALA
in an earlier example.

Other choices for G(x) have been proposed, which are not based on the Hessian. These have the
advantage that gradients need not be computed (either analytically or using computational methods).
Sejdinovic et al. [52] propose a Metropolis–Hastings method, which can be viewed as a geometric
variant of the RWM, where the choice for G(x) is based on mapping samples to an appropriate
feature space, and performing principal component analysis on the resulting features to choose a local
covariance structure for proposals.
If we consider the RWM with Gaussian proposals to be a Euler–Maruyama discretisation of
Brownian motion on a manifold, then proposals will take the form Q(x, ·) ≡ N (x + λ2Ω(x), λ2G−1(x)).
If we assume (like in the simplified manifold MALA) that Ω(x) ≈ 0, then we have proposals centred
at the current point in the Markov chain with a local covariance structure (the full Hastings acceptance
rate must now be used as q(x′|x) ̸= q(x|x′) in general).
As no gradient information is needed, the Sejdinovic et al. metric can be used in conjunction with
the pseudo-marginal MCMC algorithm, so that π(x|y) need not be known exactly. Examples from the
article demonstrate the power of the approach [52].
An important property of any Riemannian metric is how it transforms under coordinate change
(e.g., [2]). The Fisher information metric commonly studied in information geometry is an example of a
“coordinate invariant” choice for G(x). If we consider two parametrisations for a statistical model given
by x and z = t(x), computing the Fisher information under x and then transforming this matrix using
the Jacobian for the mapping, t, will give the same result as computing the Fisher information under z.
It should be noted that because of either the prior contribution in (46) or the nonlinear transformations
applied in other cases, none of the metrics we have reviewed here have this property, which means
that we have no principled way of understanding how G(x) will relate to G(z). It is intuitive, however,
that using information from all of π(x), rather than only the likelihood contribution, f (y|x), would
seem sensible when trying to sample from π(·).

5. Survey of Applications

Rather than conduct our own simulation study, we instead highlight some cases in the literature
where geometric MCMC methods have been used with success.
Martin et al. [53] consider Bayesian inference for a statistical inverse problem, in which a surface
explosion causes seismic waves to travel down into the ground (the subsurface medium). Often,
the properties of the subsurface vary with distance from ground level or because of obstacles in the
medium, in which case, a fraction of the waves will scatter off these boundaries and be reflected
back up to ground level at later times. The observations here are the initial explosion and the waves,
which return to the surface, together with return times. The challenge is to infer the properties of the
subsurface medium from this data. The authors construct a likelihood based on the wave equation for
the data and perform Bayesian inference using a variant of the manifold MALA. Figures are provided
showing the local correlations present in the posterior and, therefore, highlighting the need for an
algorithm that can navigate the high density region efficiently. Several methods are compared in the
paper, but the variant of MALA that incorporates a local correlation structure is shown to be the most
efficient, particularly as the dimension of the problem increases [53].

146


Entropy 2014, 16, 3074–3102

Calderhead and Girolami [54] dealt with two models for biological phenomena based on nonlinear
dynamical systems. A model of circadian control in the Arabidopsis thaliana plant comprised a system
of six nonlinear differential equations, with twenty two parameters to be inferred. Another model
for cell signalling consisted of a system of six nonlinear differential equations with eight parameters,
with inference complicated by the fact that observations of the model are not recorded directly [54].
The resulting inference was performed using RWM, MALA and geometric methods, with the results
highlighting the benefits of taking the latter approach. The simplified variant of MALA on a manifold
is reported to have produced the most efficient inferences overall, in terms of effective sample size per
unit of computational time.
Stathopoulos and Girolami [55] considered the problem of inferring parameters in Markov jump
processes. In the paper, a linear noise approximation is shown, which can make inference in such
models more straightforward, enabling an approximate likelihood to be computed. Models based on
chemical reaction dynamics are considered; one such from chemical kinetics contained four unknown
parameters; another from gene expression consisted of seven. Inference was performed using the
RWM, the simplified manifold MALA and Hamiltonian methods, with the MALA reported as most
efficient according to the chosen diagnostics. The authors note that the simplified manifold method is
both conceptually simple and able to account for local correlations, making it an attractive choice for
inference [55].
Konukoglu et al. [56] designed a method for personalising a generic model for a physiological
process to a specific patient, using clinical data. The personalisation took the form of patient-specific
parameter inference. The authors highlight some of the difficulties of this task in general, including the
complexity of the models and the relative sparsity of the datasets, which often result in a parameter
identifiability issue [56]. The example discussed in the paper is the Eikonal-diffusion model describing
electrical activity in cardiac tissue, which results in a likelihood for the data based on a nonlinear partial
differential equation, combined with observation noise [56]. A method for inference was developed by
first approximating the likelihood using a spectral representation and then using geometric MCMC
methods on the resulting approximate posterior. The method was first evaluated on synthetic data
and then repeated on clinical data taken from a study for ventricular tachycardia radio-frequency
ablation [56].

6. Discussion

The geometric viewpoint in not necessary to understand manifold variants of the MALA. Indeed,
several authors [32,33] have discussed these algorithms without considering them to be “geometric”,
rather simply Metropolis–Hastings methods in which proposal kernels have a position-dependent
covariance structure. We do not claim that the geometric view is the only one that should be taken.
Our goal is merely to point out that such position-dependent methods can often be viewed as methods
defined on a manifold and that studying the structure of the manifold itself may lead to new insights on
the methods. For example, taking the geometric viewpoint and noting the connection with information
geometry enabled Girolami and Calderhead to adopt the Fisher metric for calculations [1]. We list here
a few open questions on which the geometric viewpoint may help shed some insight.
Computationally-minded readers will have noted that using position-dependent covariance
matrices adds a significant computational overhead in practice, with additional O(n3) matrix inversions
required at each step of the corresponding Metropolis–Hastings algorithms. Clearly, there will be
many problems for which the matrix, G(x), does not change very much, and therefore, choosing
a constant covariance G−1(x) = Σ may result in a more efficient algorithm overall. Geometrically,
this would correspond to a manifold with scalar curvature close to zero everywhere. It may be that
geometric ideas could be used to understand whether the manifold is flat enough that a constant choice
of G(x) is sufficient. To make sense of this truly would require a relationship between curvature, an
inherently local property and more global statements about the manifold. Many results in differential
geometry, beginning with the celebrated Gauss–Bonnet theorem, have previously related global and

147


Entropy 2014, 16, 3074–3102

local properties in this way [57]. It is unknown to the authors whether results exist relating the
curvature of a manifold to some global property, but this is an interesting avenue for further research.
A related question is when to choose the simplified manifold MALA over the full method.
Problems in which the term, ∥Λ(x)∥, is sufficient large to warrant calculation correspond to those for
which the manifold has very high curvature in many places; so again, making some global statement
related to curvature could help here.
Although there is a reasonably intuitive argument for why the Hessian is an appropriate starting
point for G(x), the lack of positive-definiteness may be seen as a cause for concern by some. After
all, it could be argued that if the curvature is not positive-definite in a region, then how can it be a
reasonable approximation to the local covariance structure. Many statistical models used to describe
natural phenomena are characterised by distributions with heavy tails or multiple modes, for which
this is the case. In addition, for target densities of the form π(x) ∝ e−|x|, the Hessian is everywhere
equal to zero!The attempts to force positive-definiteness we have described will typically result in
small moves being proposed in such regions of the sample space, which may not be an optimal strategy.
Much work in information geometry has centred on the geometry of Hessian structures [58], and some
insights from this field may help to better understand the question of choosing an appropriate metric.
In addition, the field of pseudo-Riemannian geometry deals with forms of G(x), which need not be
positive-definite [39]; so again, understanding could be gained from here.
Some recent work in high-dimensional inference has centred on defining MCMC methods for
which efficiency scales O(1) with respect to the dimension, n, of π(·) [19,59]. In the case where X
takes values in some infinite-dimensional function space, this can be done provided a Gaussian prior
measure is defined for X. A striking result from infinite-dimensional probability spaces is that two
different probability measures defined over some infinite dimensional space have a striking tendency
to have disjoint supports [60]. The key challenge for MCMC is to define transition kernels for which
proposed moves are inside the support for π(·). A straight-forward approach is to define proposals for
which the prior is invariant, since the likelihood contribution to the posterior typically will not alter
its support from that of the prior [19]. However, the posterior may still look very different from the
prior, as noted in [61], so this proposal mechanism, though O(1), can still result in slow exploration.
Understanding the geometry of the support and defining methods that incorporate the likelihood term,
but also respect this geometry, so as to ensure proposals remain in support of π(·), is an intriguing
research proposition.
The methods reviewed in this paper are based on first order Langevin diffusions. Algorithms
have also been developed that are based on second order Langevin diffusions, in which a stochastic
differential equation governs the behaviour of the velocity of a process [62,63]. A natural extension to
the work of Girolami and Calderhead [1] and Xifara et al. [33] would be to map such diffusions onto
a manifold and derive Metropolis–Hastings proposal kernels based on the resulting dynamics. The
resulting scheme would be a generalisation of [63], though the most appropriate discretisation scheme
for a second order process to facilitate sampling is unclear and perhaps a question worthy of further
exploration.
We have focused primarily here on the sample space X = Rn and on defining an appropriate
manifold on which to construct Markov chains. In some inference problems, however, the sample
space is a pre-defined manifold, for example the set of n × n rotation matrices, commonly found in the
field of directional statistics [64]. Such manifolds are often not globally mappable to Euclidean n-space.
Methods have been devised for sampling from such spaces [65,66]. In order to use the methods
described here for such problems, an appropriate approach for switching between coordinate patches
at the relevant time would need to be devised, which could be an interesting area of further study.
Alongside these geometric problems, we can also discuss geometric MCMC methods from a
statistical perspective. The last example given in the previous section hinted that the manifold MALA
may cope better with target distributions with heavy tails. In fact, Latuszynski et al. [67] have shown
that, in one dimension, the manifold MALA is geometrically ergodic for a class of targets of the

148


Entropy 2014, 16, 3074–3102

form π(x) ∝ exp(−|x|β) for any choice of β ̸= 1. This incorporates cases where tails are heavier
than exponential and lighter than Gaussian, two scenarios under which geometric ergodicity fails for
the MALA.
Finding optimal acceptance rates and scaling of λ with dimension are two other related challenges.
In this case, the picture is more complex. Traditional results have been shown for Metropolis–Hastings
methods in the case where target distributions are independent and identically-distributed or some
other suitable symmetry and regularity in the shape of π(·).
Manifold methods are, however,
specifically tailored to scenarios in which this is not the case, scenarios in which there is a high
correlation between components of x, which changes depending on the value of x. It is less clear how
to proceed with finding relevant results that can serve as guidelines to practitioners here. Indeed,
Sherlock [18] notes that a requirement for optimal acceptance rate results for the RWM to be appropriate
is that the curvature of π(x) does not change too much, yet this is the very scenario in which we would
want to use a manifold method.

Acknowledgments: We thank the two reviewers for helpful comments and suggestions. Samuel Livingstone is
funded by a PhD Scholarship from Xerox Research Centre Europe. Mark Girolami is funded by an Engineering
and Physical Sciences Research Council Established Career Research Fellowship, EP/J016934/1, and a Royal
Society Wolfson Research Merit Award.

Author Contributions: Author Contributions
The article was written by Samuel Livingstone under the guidance of Mark Girolami. All authors
have read and approved the final manuscript.

Appendix

Appendix Total Variation Distance

We show how to obtain (10) from (9). Denoting two probability distributions, μ(·) and ν(·), and
associated densities, μ(x) and ν(x), we have:

∥μ(·) − ν(·)∥TV := sup
A∈B
|μ(A) − ν(A)|.

Define the set B = {x ∈ X : μ(x) > ν(x)}. To see that B ∈ B, note that B = ∪q∈Q{x ∈ X : μ(x) >
q} ∩ {x ∈ X : ν(x) < q}, and the result follows from properties of B (e.g., [68]). Now, for any A ∈ B:

μ(A) − ν(A) ≤ μ(A ∩ B) − ν(A ∩ B) ≤ μ(B) − ν(B),

and similarly:
ν(A) − μ(A) ≤ ν(Bc) − μ(Bc),

so, the supremum will be attained either at B or Bc. However, since μ(X ) = ν(X ) = 1, then:

[μ(B) − ν(B)] − [ν(Bc) − μ(Bc)] = 0,

so that
|μ(B) − ν(B)| = |μ(Bc) − ν(Bc)|.

Using these facts gives an alternative characterisation of the total variation distance as:

∥μ(·) − ν(·)∥TV = 1

2 (|μ(B) − ν(B)| + |μ(Bc) − ν(Bc)|)

= 1

2

�

X |μ(x) − ν(x)|dx

as required.

149


Entropy 2014, 16, 3074–3102

Appendix Gradient and Divergence Operators on a Riemannian Manifold

The gradient of a function on Rn is the unique vector field, such that, for any unit vector, u:

⟨∇ f (x), u⟩ = Du [ f (x)] = lim
h→0

� f (x + hu) − f (x)

h

�
,
(A1)

the directional derivative of f along u at x ∈ Rn.
On a manifold, the gradient operator, ∇M, can still be defined, such that the inner product
gp(∇M f (x), u) = Du[ f (x)]. Setting ∇M = G(x)−1∇ gives:

gp(∇M f (x), u) = (G−1(x)∇ f (x))TG(x)u,

= ⟨∇ f (x), u⟩,

which is equal to the directional derivative along u as required.
The divergence of some vector field, v, at a point, x ∈ Rn, is the net outward flow generated by
v through some small neighbourhood of x. Mathematically, the divergence of v(x) ∈ R3 is given by
∑i ∂vi/∂xi. On a more general manifold, the divergence is also a sum of derivatives, but here, they
are covariant derivatives. A short introduction is provided in Appendix C. Here, we simply state
that the covariant derivative of a vector field, v, at a point p ∈ M is the orthogonal projection of the
directional derivative onto the tangent space, TpM. Intuitively, a vector field on a manifold is a field
of vectors, each of which lie in the tangent space to a point, p ∈ M. It only makes sense therefore to
discuss how vector fields change along the manifold or in the direction of vectors, which also lie in the
tangent space. Although the idea seems simple, the covariant derivative has some attractive geometric
properties; notably, it can be completely written in local coordinates,and, so, does not depend on
knowledge of an embedding in some ambient space.
The divergence of a vector field, v, defined on a manifold, M, at the point, p ∈ M, is defined as:

divM(v) =
n
∑
i=1
Dc
ei[vi],

where ei denotes the i-th basis vector for the tangent space, TpM, at p ∈ M, and vi denotes the i-th
coefficient. This can be written in local coordinates (see Appendix C) as:

divM(v) = |G(x)|− 1

2
n
∑
i=1

∂
∂xi

�
|G(x)|
1
2 vi
�
,

and can be combined with ∇M to form the Laplace–Beltrami operator (41).

Appendix Vector Fields and the Covariant Derivative

Here, we provide a short introduction to vector fields and differentiation on a smooth manifold;
see [38,39]. The following geometric notation is used here: (i) vector components are indexed with a
superscript, e.g., v = (v1, ..., vn); and (ii) repeated subscripts and superscripts are summed over, e.g.,
viei = ∑i viei (known as the Einstein summation convention).
For any smooth manifold, M, the set of all tangent vectors to points on M is known as the tangent
bundle and denoted TM.
A Cr vector field defined on M is a mapping that assigns to each point, p ∈ M, a tangent vector,
v(p) ∈ TpM. In addition, the components of v(p) in any basis for TpM must also be Cr [38]. We
will denote the set of all vector fields on M as Γ(TM). For some vector field, v ∈ Γ(TM), at any
point, p ∈ M, the vector, v(p) ∈ TpM, can be written as a linear combination of some n basis vectors
{e1, ..., en} as v = viei. To understand how v will change in a particular direction along M, it only
makes sense, therefore, to consider derivatives along vectors in TpM. Two other things must be

150


Entropy 2014, 16, 3074–3102

considered when defining a derivative along a manifold: (i) how the components, vi, of each basis
vector will change; and (ii) how each basis vector, ei, itself will change. For the usual directional
derivative on Rn, the basis vectors do not change, as the tangent space is the same at each point, but
for a more general manifold, this is no longer the case: the ei’s are referred to as a “local” basis for each
TpM.
The covariant derivative, Dc, is defined so as to account for these shortcomings. When considering
differentiation along a vector, u∗ /∈ TpM, u∗ is simply projected onto the tangent space. The derivative
with respect to any u ∈ TpM can now be decomposed into a linear combination of derivatives of basis
vectors and vector components:

Dc
u[v] = Dc
uiei[viei],
(A2)

where the argument, p, has been dropped, but is implied for both components and local basis vectors.
The operator, Dc
u[v], is defined to be linear in both u and v and to satisfy the product rule [38]; so,
Equation (A2) can be decomposed into:

Dc
u[v] = ui �
Dc
ei[vj]ej + vjDc
ei[ej]
�
.
(A3)

The operator, Dc, need, therefore, only be defined along the direction of basis vectors ei and for vector
component vi and basis vector ei arguments.
For components vi, Dc
ej[vi] is defined as simply the partial derivative ∂jvi := ∂vi/∂xj. The
directional derivative of some basis vector ei along some ej is best understood through the example of
a regular surface Σ ⊂ R3. Here, Dej[ei] will be a vector, w ∈ R3. Taking the basis for this space at the
point, p, as {e1, e2, ˆn}, where ˆn denotes the unit normal to TpΣ, we can write w = αe1 + βe2 + κ ˆn. The
covariant derivative, Dc
ej[ei], is simply the projection of w onto TpΣ, given by w∗ = αe1 + βe2. More

generally, at some point, p, in a smooth manifold, M, the covariant derivative Dc
ej[ei] = Γk
jiek (with

upper and lower indices summed over). The coefficients, Γk
ji, are known as the Christoffel symbols: Γk
ji
denotes the coefficient of the k-th basis vector when taking the derivative of the i-th with respect to the
j-th. If a Riemannian metric, g, is chosen for M; then, they can be expressed completely as a function
of g (or in local coordinates as a function of the matrix, G). Using these definitions, Equation (A3) can
be re-written as:
Dc
u[v] = ui �
∂ivk + vjΓk
ij
�
ek.
(A4)

The divergence of a vector field, v ∈ Γ(TM), at the point, p ∈ M, is given by:

divM(v) = Dc
ei[vi],
(A5)

where, again, repeated indices are summed over. If M = Rn, this reduces to the usual sum of partial
derivatives, ∂ivi. On a more general manifold, M, the equivalent expression is:"’

Dc
ei[vi] = ∂ivi + viΓj
ij,
(A6)

where, again, repeated indices are summed. As has been previously stated, if a metric, g, and coordinate
chart is chosen for M, the Christoffel symbols can be written in terms of the matrix, G(x). In this
case [69]:

Γj
ij = |G(x)|− 1

2 ∂i
�
|G(x)|
1
2
�
,
(A7)

so Equation (A6) becomes:

Dc
ei[vi] = |G(x)|− 1

2 ∂i
�
|G(x)|
1
2 vi�
,
(A8)

where v = v(x).

Conflicts of Interest: Conflicts of Interest

151


Entropy 2014, 16, 3074–3102

The authors declare no conflict of interest.

References

1.
Girolami, M.; Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat.
Soc. Ser. B 2011, 73, 123–214.
2.
Amari, S.I.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
USA, 2007; Volume 191.
3.
Marriott, P.; Salmon, M. Applications of Differential Geometry to Econometrics; Cambridge University Press:
Cambridge, UK, 2000.
4.
Betancourt, M.; Girolami, M. Hamiltonian Monte Carlo for Hierarchical Models. 2013, arXiv: 1312.0906.
5.
Neal, R. MCMC using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo; Chapman and
Hall/CRC: Boca Raton, FL, USA, 2011; pp. 113–162.
6.
Betancourt, M.; Stein, L.C. The Geometry of Hamiltonian Monte Carlo. 2011, arXiv: 1112.4118.
7.
Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer: New York, NY, USA, 2004; Volume 319.
8.
Tierney, L. Markov chains for exploring posterior distributions. Ann. Stat. 1994, 22, 1701–1728.
9.
Kipnis, C.; Varadhan, S. Central limit theorem for additive functionals of reversible Markov processes and
applications to simple exclusions. Commun. Math. Phys. 1986, 104, 1–19.
10.
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing:
Vienna, Austria, 2012.
11.
Plummer, M.; Best, N.; Cowles, K.; Vines, K. CODA: Convergence diagnosis and output analysis for MCMC.
R. News 2006, 6, 7–11.
12.
Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435.
13.
Jones, G.L.; Hobert, J.P. Honest exploration of intractable probability distributions via Markov chain Monte
Carlo. Stat. Sci. 2001, 16, 312–334.
14.
Jones, G.L. On the Markov chain central limit theorem. Probab. Surv. 2004, 1, 299–320.
15.
Gelman, A.; Rubin, D.B. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992, 7,
457–472.
16.
Sherlock, C.; Fearnhead, P.; Roberts, G.O. The random walk Metropolis: Linking theory and practice through
a case study. Stat. Sci. 2010, 25, 172–190.
17.
Sherlock, C.; Roberts, G. Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal
targets. Bernoulli 2009, 15, 774–798.
18.
Sherlock, C. Optimal scaling of the random walk Metropolis: General criteria for the 0.234 acceptance rule. J.
Appl. Probab. 2013, 50, 1–15.
19.
Beskos, A.; Kalogeropoulos, K.; Pazos, E. Advanced MCMC methods for sampling on diffusion pathspace.
Stoch. Processes Appl. 2013, 123, 1415–1453.
20.
Roberts, G.O.; Rosenthal, J.S. Optimal scaling for various Metropolis–Hastings algorithms. Stat. Sci. 2001,
16, 351–367.
21.
Roberts, G.O.; Tweedie, R.L. Geometric convergence and central limit theorems for multidimensional
Hastings and Metropolis algorithms. Biometrika 1996, 83, 95–110.
22.
Mengersen, K.L.; Tweedie, R.L. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Stat.
1996, 24, 101–121.
23.
Jarner, S.F.; Hansen, E. Geometric ergodicity of Metropolis algorithms.
Stoch. Processes Appl. 2000,
85, 341–361.
24.
Christensen, O.F.; Møller, J.; Waagepetersen, R.P. Geometric ergodicity of Metropolis–Hastings algorithms
for conditional simulation in generalized linear mixed models.
Methodol. Comput. Appl. Probab. 2001,
3, 309–327.
25.
Neal, P.; Roberts, G. Optimal scaling for random walk Metropolis on spherically constrained target densities.
Methodol. Comput. Appl. Probab. 2008, 10, 277–297.
26.
Jarner, S.F.;
Tweedie, R.L.
Necessary conditions for geometric and polynomial ergodicity of
random-walk-type. Bernoulli 2003, 9, 559–578.
27.
Øksendal, B. Stochastic Differential Equations; Springer: New York, NY, USA, 2003.

152


Entropy 2014, 16, 3074–3102

28.
Rogers, L.C.G.; Williams, D. Diffusions, Markov Processes and Martingales: Volume 2, Itô Calculus; Cambridge
University Press: Cambridge, UK, 2000; Volume 2.
29.
Meyn, S.P.; Tweedie, R.L. Stability of Markovian processes III: Foster–Lyapunov criteria for continuous-time
processes. Adv. Appl. Probab. 1993, 25, 518–518.
30.
Coffey, W.; Kalmykov, Y.P.; Waldron, J.T. The Langevin Equation: with Applications to Stochastic Problems in
Physics, Chemistry, and Electrical Engineering; World Scientific: Singapore, Singapore, 2004; Volume 14.
31.
Roberts, G.O.; Tweedie, R.L.
Exponential convergence of Langevin distributions and their discrete
approximations. Bernoulli 1996, 2, 341–363.
32.
Roberts, G.O.; Stramer, O. Langevin diffusions and Metropolis–Hastings algorithms. Methodol. Comput.
Appl. Probab. 2002, 4, 337–357.
33.
Xifara, T.; Sherlock, C.; Livingstone, S.; Byrne, S.; Girolami, M.
Langevin diffusions and the
Metropolis-adjusted Langevin algorithm. Stat. Probab. Lett. 2013, 91, 14–19.
34.
Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A
Math. Phys. Sci. 1946, 186, 453–461.
35.
Critchley, F.; Marriott, P.; Salmon, M. Preferred point geometry and statistical manifolds. Ann. Stat. 1993,
21, 1197–1224.
36.
Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 77–93.
37.
Barndorff-Nielsen, O.; Cox, D.; Reid, N. The role of differential geometry in statistical theory. Int. Stat. Rev.
1986, 54, 83–96.
38.
Boothby, W.M. An Introduction to Differentiable Manifolds and Riemannian Geometry; Academic Press: San
Diego, CA, USA, 1986; Volume 120.
39.
Lee, J.M. Smooth Manifolds; Springer: New York, NY, USA, 2003.
40.
Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992.
41.
Nash, J.F., Jr. The imbedding problem for Riemannian manifolds. In The Essential John Nash; Princeton
University Press: Princeton, NJ, USA, 2002; p. 151.
42.
Manton, J.H. A Primer on Stochastic Differential Geometry for Signal Processing. 2013, arXiv: 1302.0430.
43.
Stewart, J. Multivariable Calculus; Cengage Learning: Boston, MA, USA, 2011.
44.
Hsu, E.P. Stochastic Analysis on Manifolds; American Mathematical Society: Providence, RI, USA, 2002;
Volume 38.
45.
Kent, J. Time-reversible diffusions. Adv. Appl. Probab. 1978, 10, 819–835.
46.
Radhakrishna Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bull.
Calcutta Math. Soc. 1945, 37, 81–91.
47.
Christensen, O.F.; Roberts, G.O.; Sköld, M. Robust Markov chain Monte Carlo methods for spatial generalized
linear mixed models. J. Comput. Graph. Stat. 2006, 15, 1–17.
48.
Petra, N.; Martin, J.; Stadler, G.; Ghattas, O. A computational framework for infinite-dimensional Bayesian
inverse problems: Part II. Stochastic Newton MCMC with application to ice sheet flow inverse problems.
2013, arXiv: 1308.6221.
49.
Pawitan, Y. In All Likelihood: Statistical Modelling and Inference Using Likelihood; Oxford University Press:
Oxford, UK, 2001.
50.
Betancourt, M. A General Metric for Riemannian Manifold Hamiltonian Monte Carlo. In Geometric Science of
Information; Springer: New York, NY, USA, 2013; pp. 327–334.
51.
Higham, N.J. Computing the nearest correlation matrix—a problem from finance. IMA J. Numer. Anal. 2002,
22, 329–343.
52.
Sejdinovic, D.; Garcia, M.L.; Strathmann, H.; Andrieu, C.; Gretton, A. Kernel Adaptive Metropolis–Hastings.
2013, arXiv: 1307.5302.
53.
Martin, J.; Wilcox, L.C.; Burstedde, C.; Ghattas, O. A stochastic Newton MCMC method for large-scale
statistical inverse problems with application to seismic inversion.
SIAM J. Sci. Comput.
2012,
34, A1460–A1487.
54.
Calderhead, B.; Girolami, M. Statistical analysis of nonlinear dynamical systems using differential geometric
sampling methods. Interface Focus 2011, 1, 821–835.
55.
Stathopoulos, V.; Girolami, M.A. Markov chain Monte Carlo inference for Markov jump processes via the
linear noise approximation. Philos. Trans. R. Soc. A 2013, 371, 20110541.

153


Entropy 2014, 16, 3074–3102

56.
Konukoglu, E.; Relan, J.; Cilingir, U.; Menze, B.H.; Chinchapatnam, P.; Jadidi, A.; Cochet, H.; Hocini, M.;
Delingette, H.; Jaïs, P.; et al. Efficient probabilistic model personalization integrating uncertainty on data and
parameters: Application to eikonal-diffusion models in cardiac electrophysiology. Prog. Biophys. Mol. Biol.
2011, 107, 134–146.
57.
Do Carmo, M.P.; Do Carmo, M.P. Differential Geometry of Curves and Surfaces; Englewood Cliffs: Prentice-Hall,
NJ, USA, 1976; Volume 2.
58.
Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, Singapore, 2007; Volume 1.
59.
Cotter, S.; Roberts, G.; Stuart, A.; White, D. MCMC methods for functions: Modifying old algorithms to
make them faster. Stat. Sci. 2013, 28, 424–446.
60.
Da Prato, G.; Zabczyk, J. Stochastic Equations in Infinite Dimensions; Cambridge University Press: Cambridge,
UK, 2008.
61.
Law, K.J. Proposals which speed up function-space MCMC. J. Comput. Appl. Math. 2014, 262, 127–138.
62.
Ottobre, M.; Pillai, N.S.; Pinski, F.J.; Stuart, A.M. A Function Space HMC Algorithm With Second Order
Langevin Diffusion Limit. 2013, arXiv: 1308.0543.
63.
Horowitz, A.M. A generalized guided Monte Carlo algorithm. Phys. Lett. B 1991, 268, 247–252.
64.
Mardia, K.V.; Jupp, P.E. Directional Statistics; Wiley: New York, NY, USA, 2009; Volume 494.
65.
Byrne, S.; Girolami, M. Geodesic Monte Carlo on embedded manifolds. Scand. J. Stat. 2013, 40, 825–845.
66.
Diaconis, P.; Holmes, S.; Shahshahani, M. Sampling from a manifold. In Advances in Modern Statistical Theory
and Applications: A Festschrift in Honor of Morris L. Eaton; Institute of Mathematical Statistics: Washington,
DC, USA, 2013; pp. 102–125.
67.
Latuszynski, K.; Roberts, G.O.; Thiery, A.; Wolny, K. Discussion on “Riemann manifold Langevin and
Hamiltonian Monte Carlo methods” (by Girolami, M. and Calderhead, B.). J. R. Stat. Soc. Ser. B 2011,
73, 188–189.
68.
Capinski, M.; Kopp, P.E. Measure, Integral and Probability; Springer: New York, NY, USA, 2004.
69.
Schutz, B.F. Geometrical Methods of Mathematical Physics; Cambridge University Press: Cambridge, UK, 1984.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

154


entropy

Article
Variational Bayes for Regime-Switching
Log-Normal Models

Hui Zhao and Paul Marriott *

University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada; E-Mail:
h6zhao@uwaterloo.ca
*
E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567.

Received: 14 April 2014; in revised form: 12 June 2014 / Accepted: 7 July 2014 /
Published: 14 July 2014

Abstract: The power of projection using divergence functions is a major theme in information
geometry. One version of this is the variational Bayes (VB) method. This paper looks at VB in the
context of other projection-based methods in information geometry. It also describes how to apply
VB to the regime-switching log-normal model and how it provides a computationally fast solution to
quantify the uncertainty in the model specification. The results show that the method can recover
exactly the model structure, gives the reasonable point estimates and is very computationally efficient.
The potential problems of the method in quantifying the parameter uncertainty are discussed.

Keywords: information geometry; variational Bayes; regime-switching log-normal model; model
selection; covariance estimation

1. Introduction

While, in principle, the calculation of the posterior distribution is mathematically straightforward,
in practice, the computation of many of its features, such as posterior densities, normalizing constants
and posterior moments, is a major challenge in Bayesian analysis. Such computations typically
involve high dimensional integrals, which often have no analytical or tractable forms. The variational
Bayes (VB) method was developed to generate tractable approximations to these quantities. This
method provides analytic approximations to the posterior distribution by minimizing the Kullback–
Leibler (KL) divergence from the approximations to the actual posterior and has been demonstrated to
be computationally very fast.
VB gains its computational advantages by making simplifying assumptions about the posterior
dependence structure.
For example, in the simplest form, it assumes posterior independence
between selected sets of parameters. Under these assumptions, the resultant approximate posterior
is either known analytically or can be computed by a simple iteration algorithm similar to the
Expectation-Maximization (EM) algorithm. In this paper, we show that, as well as having advantages
of computational speed, the VB algorithm does an excellent job of model selection, in particular in
finding the appropriate number of regimes.
While the simplification in the dependence gives computational advantages, it also comes at
a cost. For example, we also found that the posterior variance may be underestimated. In [1], we
propose a novel method to compute the true posterior covariance matrix by only using the information
obtained from VB approximations.
The use of projections to particular families is, of course, not new to information geometry (IG).
In [2], we find the famous Pythagorean results concerning projection using α-divergences to α-families,
and other important results on projections based on divergences can be found in [3] and [4] (Chapter 7).

Entropy 2014, 16, 3832–3847; doi:10.3390/e16073832
www.mdpi.com/journal/entropy
155


Entropy 2014, 16, 3832–3847

1.1. Variational Bayes

Suppose, in a Bayesian inference problem that we use q(τ) to approximate the posterior p(τ|y),
where y is the data and τ = {τ1, · · · , τp} the model parameter vector. The KL divergence between
them is defined as,

KL [q(τ)||p(τ|y)] =
�
q(τ) log q(τ)

p(τ|y)dτ,
(1)

provided the integral exists. We want to balance two things, having the discrepancy between p and q
small, while keeping q tractable. Hence, we want to seek q(τ), which minimizes Equation (1), while
keeping q(τ) in an analytically tractable form. First, note that the evaluation of Equation (1) requires
p(τ|y), which may be unavailable, since in the general Bayesian problem, its normalizing constant is
one of the main intractable integrals. However, we note that:

KL [q(τ)||p(τ|y)]
=
�
q(τ) log
q(τ)

p(τ|y)p(y)dτ + log p(y)

=
−
�
q(τ) log p(τ, y)

q(τ) dτ + log p(y).
(2)

Thus, minimizing Equation (1) is equivalent to maximizing the first term of the right-hand side of
Equation (2). The key computational point is that, often, the term p(τ, y) is available even when the
full posterior
� p(τ,y)
p(τ,y)dτ is not.

Definition 1. Let F(q) = �
q(τ) log p(τ,y)

q(τ) dτ and:

ˆq = arg max
q∈Q
F(q),
(3)

where Q is a predetermined set of probability density functions over the parameter space. Then ˆq is called the
variational approximation or variational posterior distribution, and functions of ˆq (such as mean, variance, etc.),
are called variational parameters.

Some of the power of Definition 1 comes when we assume that all elements of Q have tractable
posteriors. In that case, all variational parameters will then also be tractable when the optimization
can be achieved. A prime example of a choice for Q is the set of all densities that factorize as

q(τ) =
d
∏
i=1
qi(τi).

This reduces the computational problem from computing a high dimensional integral to one of
computing a number of one-dimensional ones. Furthermore, as we see in the example of this paper, it
is often the case that the variational families are standard exponential families (since they are often
‘maximum entropy models’ in some sense), and the optimisation problem (3) can be solved by simple
iterative methods with very fast convergence.
The core of the method builds on the basis of the principle of the variational free energy
minimization in physics, which is concerned with finding the maxima and minima of a functional over
a class of functions, and the method gains its name from this root. Early developments of the method
can be found in machine learning, especially in applications on neural networks [5,6]. The method
has been successfully applied in many different disciplines and domains, for example, in independent
component analysis [7,8], graphical models [9,10], information retrieval [11] and factor analysis [12].

156


Entropy 2014, 16, 3832–3847

In the statistical literature, an early application of the variational principle can be found in the
work of [13] to construct Bayes estimators. In recent years, the method has obtained more attention
from both the application and theoretical perspective, for example [14–18].

1.2. Regime-Switching Models

In this paper, we illustrate the strengths and weaknesses of VB through a detailed case study.
In particular, we look at a model that is used in finance, risk management and actuarial science, the
so-called regime-switching log-normal model (RSLN) proposed, in this context, by [19].
Switching between different states, or regimes, is a common phenomenon in many time series,
and regime-switching models, originally proposed by [20], have been used to model these switching
processes. As demonstrated in [21], the maximum likelihood estimate (MLE) does not give a simple
method to deal with parameter uncertainty; for details of this method, see [21]. The asymptotic
normality of maximum likelihood estimators may not apply for sample sizes commonly found in
practice. Hence, to understand parameter uncertainty, [21] considered the RSLN model in a Bayesian
framework using the Metropolis–Hastings algorithm. Furthermore, model uncertainty, in particular
selecting the correct number of regimes, is a major issue. Hence, model selection criteria have to be
used to choose the “best” model. Hardy [19] found that a two-regime RSLN model maximized the
Bayes information criterion (BIC) [22] for both monthly TSE 300 total return data and S&P 500 total
return data; however, according to the Akaike information criterion (AIC) [23], a three-regime model
was the optimal on S&P data. To account for the model uncertainty associated with the number of
regimes, [24] offered a trans-dimensional model using reversible jump MCMC [25]. We note that BIC
is not necessarily ideal for model selection with state space models [26], while it is still commonly used
in the literature.
MCMC methods make possible the computation of all posterior quantities; however there are a
number of practical issues associated with their implementation. A primary concern is determining
that the generated chain has, in fact, “converged”. In practice, MCMC practitioners have to resort
to convergence diagnostic techniques. Furthermore, the computational cost can be a concern. Other
implementational issues include the difficulty of making good initalisation choices, implementing the
MCMC algorithm in one long chain or several shorter chains in parallel, etc. Detailed discussions can
be found in [27].
One of the main contributions of this paper is to apply the variational Bayes (VB) method to the
RSLN model and present a solution to quantify the uncertainty in model specification. The VB method
is a technique that provides analytical approximations to the posterior quantities, and in practice, it is
demonstrated to be a very much faster alternative to MCMC methods.

2. Variational Bayes and Informational Geometry

In this section, we explore the relationship between VB and IG, in particular the statistical
properties of divergence-based projections onto exponential families. Here, we used the IG of [2], in
particular the ±1 dual affine parameters for exponential families. One of the most striking results
from [2] is the Pythagorean property of these dual affine coordinate systems. This is illustrated in
Figure 1, which shows a schematic representing a model space containing the distribution f0(x) and
an exponential family f (x; θ).

157


Entropy 2014, 16, 3832–3847

θ

−1−geodesic

+1−geodesic

of (x)

f(x,   )

Figure 1. Projections onto an exponential family.

The Pythagorean result comes from using the KL divergence to project onto the exponential family
f (x; θ) = ν(x) exp {s(x)θ − ψ(θ)}, i.e.,

min
θ

�
− log f (x; θ)

f0(x) f0(x)dx.

All distributions that project to the same point form a −1-flat space defined by all distributions f (x)
with the same mean, i.e.,
E�θ(s(x)) = Ef (x)(s(x)),

and further, it is Fisher orthogonal to the +1-flat family f (x; θ). The statistical interpretation of this
concerns the behaviour of a model f (x, θ) when the data generation process does not lie in the model.
In contrast to this, we have the VB method, which uses the reverse KL divergence for the projection,
i.e.,

min
θ

�
log f (x; θ)

f0(x) f (x; θ)dx.

This results in a Fisher orthogonal projection, shown in Figure 1, but now using a +1-flat family.
This does not have the property that the mean of s(x) is constant, but as we shall see, it does have nice
computational properties when used in the context of Bayesian analysis.
In order to investigate the information geometry of VB, we consider two examples. The first,
in Section 3.1, is selected to maximally illustrate the underlying geometric issues and to get some
understanding of the quality of the VB approximation. The second, in Section 3.2, shows an important
real-world application from actuarial science and is illustrated with simulated and real data.

3. Applications of Variational Bayes

3.1. Geometric Foundation

We consider the simplest model that shows dependence. Let X1, X2 be two binary random
variables, with distribution π := (π00, π10, π01, π11), where P(X1 = i, X2 = j) = πij, i, j ∈ {0, 1}.
Further, let the marginal distributions be denoted by π1 = P(X1 = 1), π2 = P(X2 = 1). We want to
consider the geometry of the VB projection from a general distribution to the family of independent
distributions. This represents the way that VB gains its computational advantages by simplifying the
posterior dependence structure.
The model space is illustrated in Figure 2, where π is represented by a point in the three simplex,
and the independence surface, where π00π11 = π10π01, is also shown.

158


Entropy 2014, 16, 3832–3847

1

 y
0

0

0.5
 z

1.0

x

1

Figure 2. Space of distributions with independence surface: marginal probabilityand dependence.

Both the interior of the simplex and independence surface are exponential families, and it is
convenient to use the natural parameters for the interior of the simplex:

ξ1 = log π10

π00
, ξ2 = log π01

π00
, ξ3 = log π11π00

π10π01

where the independence surface is given by ξ3 = 0.
The independence surface can also be
parameterised by the marginal distributions π1, π2 or the corresponding natural parameters ξind
i
:=
log(πi/(1 − πi)). For any distribution, π, represented in natural parameters by (ξ1, ξ2, ξ3), has its VB
approximation defined implicitly by the simultaneous equations:

ξind
1 (π1)
=
ξ1 + ξ3π2,
(4)

ξind
2 (π2)
=
ξ2 + ξ3π1.
(5)

These can be solved, as is typical with VB methods, by iterating updated estimates of π1 and π2
across the two equations. We show this in a realistic example in the following section.
Having seen the VB solution in this simple model, we can investigate the quality of the
approximation. If we were using the forward KL project, as proposed by [2], then the mean will
be preserved by the approximation, while, of course, the variance structure is distorted. In the case of
using the reverse KL projection, as used by VB, the mean will not be preserved, but in this example,
we can investigate the distortion explicitly. Let (ξ1(α), ξ2(α), ξ3(α)) be a +1-geodesic, which cuts
the independence surface orthogonally and is parameterised by α, where α = 0 corresponds to the
independence surface. In this example, all such geodesics can be computed explicitly. Figure 3 shows
the distortion associated with the VB approximation. In the left-hand panel, we show the mean, which
is the marginal probability, P(X1 = 1), for all points on the orthogonal geodesic. We see, as expected,
that this is not constant, but it is locally constant at α = 0, showing that the distortion of the mean can
be small near the independence surface. The right-hand panel shows the dependence, as measured
by the log-odds, for points on the geodesic. As expected, the VB does not preserve the dependence
structure; indeed, it is designed to exploit the simplification of the dependence structure.

159


Entropy 2014, 16, 3832–3847

Figure 3. Distortion implied by variational Bayes (VB) approximation.

3.2. Variational Bayes for the RSLN Model

The regime-switching log-normal model [19] with a fixed finite number, K, of regimes can be
described as a bivariate discrete time process with the observed data sequence w1:T = {wt}T
t=1 and
the unobserved regime sequence S1:T = {St}T
t=1, where St ∈ {1, · · · , K} and T is the number of
observations. The logarithm of wt, denoted by yt = log wt, is assumed normally distributed, having
mean μi and variance σ2
i both dependent on the hidden regime St. The sequence of S1:T is assumed
to follow a first order Markov chain having transition probabilities A = (aij) with the probabilities
π = (πi)K
i=1 to start the first regime.
The RSLN model is a special case of more general state-space models, which were studied in
detail by [28]. In this paper, we use this model and simulated and real data to illustrate the VB method
in practice. We also calibrate its performance by referring to [24], which used MCMC methods to fit the
same model to the same data. Here, we are regarding the MCMC analysis as a form of “gold-standard",
but with the cost of being orders-of-magnitude slower than VB in computational time.
In the Bayesian framework, we use a symmetric Dirichlet prior for π, that is p(π) =
Dir(π; Cπ

K , · · · , Cπ

K ), for Cπ > 0.
Let ai denote the i − th row vector of A. The prior for A is

chosen as p(A) = ∏K
i=1 p(ai) = ∏K
i=1 Dir(ai; CA

K , · · · , CA

K ), for CA > 0, and the prior distribution for

{(μi, σ2
i )}K
i=1 is chosen to be normal-inverse gamma, p({μi, σ2
i }K
i=1) = ∏K
i=1 N(μi|σ2
i ; γ, σ2
i
η2 )IG(σ2
i ; α, β).

In the above setting, Cπ, CA, γ, η2, α and β are hyper-parameters. Thus, the joint posterior distribution
of π, A, {μi, σ2
i }K
i=1, and S1:T is P(π, A, {μi, σ2
i }K
i=1, S1:T|y1:T) and is proportional to:

p(S1|π)
T−1
∏
t=1
p(St+1|St; A)
T
∏
t=1
p(yt|St; {μi, σ2
i }K
i=1)p(π)p(A)p({μi, σ2
i }K
i=1).
(6)

This posterior distribution and its corresponding marginal posterior distributions are analytically
intractable. In VB, we seek an approximation of Equation (6), denoted by q(π, A, {μi, σ2
i }K
i=1, S1:T),
to which we want to balance two things: having the discrepancy between Equation (6) and q small,
while keeping q tractable. In general, there are two ways to choose q. The first is to specify a particular
distributional family for q, for example the multivariate normal distribution. The other is to choose
q with a simpler dependency structure than that of Equation (6); for example, we choose q, which
factorizes as:

q(π, A, {μi, σ2
i }K
i=1, S1:T) = q(π)
K
∏
i=1
q(ai)
K
∏
i=1
q(μi|σ2
i )q(σ2
i )q(S1:T).
(7)

160


Entropy 2014, 16, 3832–3847

The Kullback–Leibler (KL) divergence [29] can be used as the measure of dissimilarity between
Equations (6) and (7). For succinctness, we denote τ = (π, A, {μi, σ2
i }K
i=1, S1:T); thus the KL divergence
is defined as:

KL(q(τ) || p(τ|y)) =
�
q(τ) log q(τ)

p(τ|y)dτ.
(8)

Note that the evaluation of Equation (8) requires p(τ|y), which is unavailable. However, we note that:

KL(q(τ) || p(τ|y)) = log p(y) −
�
q(τ) log p(τ, y)

q(τ) dτ

Given the factorization Equation (7), this can be written as:

KL(q(τ) || p(τ|y)) =

log p(y) −
�
∑
S1:T
q(π)q(A)
K
∏
i=1
q(μi|σ2
i )q(σ2
i )q(S1:T) log
p(π, A, {μi, σ2
i }K
i=1, S1:T, y1:T)

q(π)q(A)
K
∏
i=1
q(μi|σ2
i )q(σ2
i )q(S1:T)

dπdAd{μi, σ2
i }K
i=1

Consider first the q(π) term. The right-hand side can be rearranged as:

KL

⎛

⎜
⎜
⎜
⎜
⎝
q(π)
����

����

exp
� � ∑
S1:T
q(S1:T)q(A)
K
∏
i=1
q(μi|σ2
i )q(σ2
i ) log p(π, A, {μi, σ2
i }K
i=1, s1:T, y1:T)dAd{μi, σ2
i }K
i=1

�

Zπ

⎞

⎟
⎟
⎟
⎟
⎠
+ Kπ,
(9)

where:

Kπ =
�
∑
S1:T
q(S1:T)q(A)
K
∏
i=1
q(μi|σ2
i )q(σ2
i )q(S1:T) log q(A)
K
∏
i=1
q(μi|σ2
i )q(σ2
i )dAd{μi, σ2
i }K
i=1 − log Zπ + log p(y),

and Zπ is a normalizing term. The first term of Equation (9) is the only term that depends on q(π).
Thus, the minimum value of KL(q(τ) || p(τ|y)) is achieved when this term equals zero. Hence, we
obtained:

q(π) =
exp
� �
∑S1:T q(S1:T)q(A) ∏K
i=1 q(μi|σ2
i )q(σ2
i ) log p(π, A, {μi, σ2
i }K
i=1, s1:T, y1:T)dAd{μi, σ2
i }K
i=1

�

Zπ
(10)

Given the joint distribution of p(π, A, {μi, σ2
i }K
i=1, s1:T, y1:T) in the form of Equation (6), the
straightforward evaluation of Equation (10) results in:

q(π)
∝
K
∏
i=1
π

CKπ
K +ws
i −1
i
= Dir(π, wπ
1 , · · · , wπ
K); wπ
i = CK
π
K + ws
i , ws
i = Eq(S1:T)[S1,i]

(11)

where S1,i = 1, if the process is in state i at time 1, and zero otherwise.

161


Entropy 2014, 16, 3832–3847

Similarly, we can rearrange Equation (9) with respect to {q(ai)}K
i=1, {q(μi|σ2
i )}K
i=1, {q(σ2
i )}K
i=1 and
q(S1:T), respectively, and using the same arguments, then we can obtain:

q(A)
=
k
∏
i
Dir(ai; wA
i1, ..., wA
ik); wA
ij = CA

K + vs
ij,
(12)

q(μi|σ2
i )
=
N

�

γ′
i, σ2
i
κi

�

, γ′
i = η2γ + ps
i

η2 + qs
i
, κi = η2 + qs
i
(13)

q(σ2
i )
=
IG
�
α′
i, β′
i
�
, α′
i = α + qs
i
2 , β′
i = β + rs
i
2 + η2

2 (γ
′
i − γ)2
(14)

q(S1:T)
=

k
∏
i=1
π∗S1,i
i

T−1
∏
t=1

k
∏
i=1

k
∏
j=1
a
∗St,iSt+1,j
ij

T
∏
t=1

k
∏
i=1
θ∗St,i

˜Z
,
(15)

where St,i = 1, if the process in state i at time t, and zero otherwise, and with π∗
i = eEq(π)[log πi],

a∗
ij = eEq(A)[log(aij)], θ∗
i,t = eEq(μi|σ2
i )q(σ2
i )[log φi(yt)], vs
ij = ∑T−1
t=1 Eq(S1:T)
�
St,iSt+1,j
�
, ps
i = ∑T
t=1 Eq(S1:T)[st,i]yt,

qs
i = ∑T
t=1 Eq(S1:T)[st,i], rs
i = ∑T
t=1(γ′
i − yt)2Eq(S)[st,i]. Here, ψ is the digamma function, φ is the normal
density function and the exact functional forms used in the updates are shown in Algorithm 1.

Algorithm 1 Variational Bayes algorithm for the regime-switching log-normal model (RSLN) model.

Initialize ws
i
(0), vs
ij
(0), ps
i
(0), qs
i
(0), and rs
i
(0) at step 0
while wπ
i
(t−1), wA
ij
(t−1), γ′
i
(t−1), α′
i
(t−1), β′
i
(t−1), π∗
i
(t−1), a∗
ij
(t−1), and θ∗
i,t
(t−1) do not converge do

1.
Compute wπ
i
(t), wA
ij
(t), γ′
i
(t), κi(t), α′
i
(t), and β′
i
(t)at step t by

wπ
i
(t) = CK
π
K + ws
i
(t−1),
wA
ij
(t) = CA
π
K + vs
ij
(t−1),
γ′
i
(t) = η2γ + ps
i
(t−1)

η2 + qs
i
(t−1) ,

κi(t) = η2 + qs
i
(t−1),
α′
i
(t) = α + qs
i
(t−1)

2
,
β′
i
(t) = β + rs
i
(t−1)

2
+ η2

2 (γ
′
i
(t) − γ)2

2.
Compute π∗
i
(t), θ∗
i,t
(t) and a∗
ij
(t) at step t by:

π∗
i
(t) = exp
�
ψ(wπ
i
(t)) − ψ(∑
i
wπ
i
(t))
�
,
a∗
ij
(t) = exp
�
ψ(wA
ij
(t)) − ψ(∑
j=1
wA
ij
(t))
�

θ∗
i,t
(t) = exp
� − 1

2 log 2π − 1

2(log β′
i
(t) − ψ(α′
i
(t))) − 1

2

�

(yt − γ′
i
(t))2 α′
i
(t)

β′
i
(t) +
1

κi(t)

�
�

3.
Compute ws
i
(t), vs
ij
(t), ps
i
(t), qs
i
(t), and rs
i
(t) at step t by:

ws
i
(t) = Eq(t)(S1:T)[S1,i], vs
ij
(t) =
T−1
∑
t=1
Eq(t)(S1:T)
�
St,iSt+1,j
�
, ps
i
(t) =
T−1
∑
t=1
Eq(t)(S1:T)[st,i]yt,

qs
i
(t) =
T−1
∑
t=1
Eq(t)(S1:T)[st,i], rs
i
(t) =
T−1
∑
t=1
(γ′
i
(t) − yt)2Eq(t)(S)[st,i]

t ⇐ t + 1
end while

The VB method proceeds, as was shown with the simple Equations (4) and (5), by iterative
updating the variational parameters to solve a set of simultaneous equations. In this example, the
update equations for the variables π, A, {μi, σ2
i }K
i=1, S1:T are given explicitly by Algorithm 1. For the
initialisation, we choose symmetric values for most of the parameters and choose random values for

162


Entropy 2014, 16, 3832–3847

others, as appropriate. For this example, this worked very satisfactory, although we note that for more
general state space models [28], states that find good initial values can be non-trivial.

3.3. Interpretation of Results

First, all approximating distributions above turn out to lie in well-known parametric families.
The only unknown quantities are the parameters of these distributions, which are often called the
variational parameters.
The evaluation of parameters of q(π), q(A), q(μi|σ2
i ), and q(σ2
i ) requires knowledge of q(S1:T),
and also, the evaluation of π∗
i , a∗
ij and θ∗
i,t requires knowledge of q(π), q(A), q(μi|σ2
i ) and q(σ2
i ). This
structure leads to an iterative updating scheme, described in Algorithm 1.
The main computational effort in Algorithm 1 is computing Eq(S1:T)[St,i] and Eq(S1:T)
�
St,iSt+1,j
�
,
which have no simple tractable forms. We note that the distributional form of q(S1:T) has a very
similar structure as the conditional distribution of p(S1:T|Y1:T, τ) for which the forward-backward algo-
rithm [30] is commonly used to compute Ep(S1:T|Y1:T,τ)[St,i|Y1:T, τ] and Ep(S1:T|Y1:T,τ)
�
St,iSt+1,j|Y1:T, τ
�
.
Therefore, we also use the forward-backward algorithm to compute Eq(S1:T)[St,i] and Eq(S1:T)
�
St,iSt+1,j
�
.

The conditional distribution of q(μi|σ2
i ) is N
�
μi|σ2
i ; γ′
i, σ2
i
κi

�
, then the marginal distribution of μi

is the location-scale t distribution, denoted as t2α′
i(μi; γ′
i,
κi

β′
i/α′
i ), where the density function of tν(x; μ, λ)

is defined as p(x|ν, μ, λ) = Γ( ν+1

2 )

Γ( ν

2 )

�
λ
πν
� 1

2 �
1 + λ(x−μ)2

ν
�− ν+1

2 , for x, μ ∈ (−∞, +∞) and ν, λ > 0.

4. Numerical Studies

4.1. Simulated Data

In this section, we applied the VB solutions to four sets of simulated data, which are used in [24].
Through these simulated studies, we will test the performance of VB on detecting the number of
regimes and compare it with those of the BIC and the MCMC methods [24]. For this paper, we present
only an initial study with a relatively small number of datasets. The results are highly promising,
but more extensive studies are needed to draw comprehensive conclusions. Furthermore, see [28] for
general results on VB in hidden state space models.
To estimate the number of regimes, we construct a matrix, called the relative magnitude matrix

(RMM), defined as A′ =
�ˆa′
ij
�
, where ˆa′
ij =
wA
ij

wA
0 , wA
0 = ∑K
i=1 ∑K
j=1 wA
ij and wA
ij is the parameter of q(A).

Our model selection procedure is to fit a VB with a large number of regimes and to examine the rows
and columns in the RMM. If the values of the entries in the i − th row and the i − th column of A′ are
all equal to
CA/K

T−1+CA×K, then we will declare the regime i nonexistent. This method is validated by the

following observations. It can be shown that the parameter of vs
ij in wA
ij is equal to the number of times
the process leaves regime i and enters regime j. Therefore, for the i − th regime, the values of zero for
all of vs
ji and vs
ij with j = 1, · · · , K indicate that there is no transition process entering or leaving regime
i.
Table 1 specifies the parameters for the four cases, and we generate 671 observations for each
case (equal to the number of months from January 1956 to September 2011). The parameters used in
Case 1 are identical to the maximum likelihood estimates for TSX monthly return data from 1956 to
1999 [19]. Case 2 only has one regime present. Case 3 is similar to Case 1, but the two regimes have the
same mean. Case 4 adds a third regime. For each case, we use MLE to fit a one-regime, two-regime,
three-regime and four-regime RSLN model and report the corresponding BIC and log-likelihood scores.
We then misspecify the number of regimes and run a four-regime VB algorithm.

163


Entropy 2014, 16, 3832–3847

Table 1. Parameters of the simulated data.

Case
Regime 1
Regime 2
Regime 3
Transition Probability
(μi, σi)
(μi, σi)
(μi, σi)

1
(0.012, 0.035)
(−0.016, 0.078)
-

�0.963
0.037
0.210
0.790

�

2
(0.014, 0.050)
-
-
-

3
(0.000, 0.035)
(0.000, 0.078)
-

�0.963
0.037
0.210
0.790

�

4
(0.012, 0.035)
(−0.016, 0.078)
(0.04, 0.01)

⎛

⎝
0.953
0.037
0.01
0.210
0.780
0.01
0.80
0.190
0.01

⎞

⎠

Table 2 shows the number of iterations that VB takes to converge in each case and the
corresponding computational time (on a MacBook, 2 GHz processor). On average, VB converges after
a hundred iterations and takes about one minute. On the same computer, a 104-iteration Reverse Jump
MCMC (RJMCMC) will take about 10 h to finish. Using diagnostics, this seemed to be enough for
convergence, while not being an “unfair” comparison in terms of time with VB. We can see that the
computational efficiency will be a very attractive feature of the VB method. The results of the BIC with
the log-likelihood (in parentheses), the relative magnitude matrices and the posterior probabilities
for the models with the different number of regimes estimated by MCMC (cited from Hartman and
Heaton [24]) are given in Table 3. In Case 1, the BIC favors the two-regime model. The posterior
probability estimated by MCMC for the one-regime model is the largest, but there is still a large
probability for the two regime model. Note that the prior specification for the number of regimes
can effect these numbers and is always an issue with these forms of multidimensional MCMC. The
relative magnitude matrix clearly shows that there are only two regimes whose ˆa′
ij are not negligible.
This implies VB removes excess transition and emission processes and discovers the exact number of
hidden regimes. In Case 2 and Case 3, both VB and the BIC can select the correct number of regimes,
and the posterior probability for the one-regime model estimated by MCMC is still the largest. In Case
4, VB does not detect the third regime. The transition probability to this regime is only 0.01, and the
means and standard deviations of Regime 1 make the rare data from Regime 3 easily merged within
the data from Regime 1. From Table 3, it is clear that for all of the cases, the log-likelihood always
increases as the number of regimes increase.

Table 2. Computational efficiency of VB.

Case 1
Case 2
Case 3
Case 4

Iterations to converge
62
182
132
94
Computational time [s]
27.161
80.842
58.510
45.044

164


Entropy 2014, 16, 3832–3847

Table 3. The estimated number of regimes by VB, BIC and MCMC.

No. of
MLE
RJMCMC
VB
Case
Regimes
BIC (Log Likelihood)
Posterior
Probability
Relative Magnitude Matrix

1

1
2
3
4

1, 108.875(1, 115.384)
1, 158.227(1, 174.499)
1, 156.370(1, 182.405)
1, 153.150(1, 188.948)

0.647
0.214
0.088
<0.052

⎛

⎜
⎜
⎝

0.14357
0.00004
0.00004
0.03153
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.03018
0.00004
0.00004
0.79428

⎞

⎟
⎟
⎠

2

1
2
3
4

1, 045.448(1, 051.957)
1, 038.360(1, 054.632)
1, 030.733(1, 056.768)
1, 026.882(1, 062.680)

0.864
0.109
0.020
<0.006

⎛

⎜
⎜
⎝

0.99944
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004

⎞

⎟
⎟
⎠

3

1
2
3
4

1, 110.903(1, 117.411)
1, 139.214(1, 155.486)
1, 131.904(1, 157.719)
1, 121.921(1, 157.940)

0.629
0.221
0.098
<0.052

⎛

⎜
⎜
⎝

0.11322
0.00004
0.00004
0.02647
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.02659
0.00004
0.00004
0.83327

⎞

⎟
⎟
⎠

4

1
2
3
4

1, 044.819(1, 051.328)
1, 092.610(1, 108.881)
1, 087.435(1, 113.470)
1, 080.240(1, 116.038)

0.641
0.203
0.094
<0.06

⎛

⎜
⎜
⎝

0.22643
0.00004
0.00004
0.05518
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.00004
0.05377
0.00004
0.00004
0.66417

⎞

⎟
⎟
⎠

4.2. Real Data

In this section, we apply the VB solution to the TSX monthly total return index in the period from
January, 1956, to December, 1999 (528 observations in total and studied in [19,21]).
A four-regime VB is implemented first. VB converges after 100 iterations about 34.284 s (on a
MacBook, 2 GHz processor). The relative magnitude matrix, given in Table 4, clearly shows that VB
identifies two regimes. This matches both of the BIC and AIC-based results [19]. Based on these results,
we then fit a two-regime VB, which converges after 83 iterations in about 14.241 s. Table 5 gives the
marginal distributions for all of the parameters. Figure 4 presents the corresponding density functions,
where we can see that all of the plots show a symmetric and bell-shaped pattern.

Table 4. Estimations of the number of regimes for TSXdata.

January 1956–December 1999

R. M. M.

⎛

⎜
⎜
⎝

0.11496
0.00005
0.00005
0.02803
0.00005
0.00005
0.00005
0.00005
0.00005
0.00005
0.00005
0.00005
0.02853
0.00005
0.00005
0.82791

⎞

⎟
⎟
⎠

Table 5. The marginal distributions of the parameters estimated by VB.

Parameter
Distribution
Mean
s.d.
Transition Probability

μ1
t454.61(0.0123, 370778.19)
0.0123
0.00165
-
σ2
1
IG(227.30, 0.28)
0.00122(0.0349)
0.00008
-
μ2
t80.39(−0.0161, 12987.55)
−0.0161
0.00889
-
σ2
2
IG(40.20, 0.24)
0.00603(0.0777)
0.00098
-
p1,2
p2,1
Beta(15.21, 434.78)
Beta(15.00, 61.21)
0.0338
0.1969
0.00851
0.04525

�0.9662
0.0338
0.1969
0.8031

�

165


Entropy 2014, 16, 3832–3847

�����
�����
����
����
����

�
��
���
���
���
���
���
���

�������

(a)

�����
�����
�����
�����
�����

�
����
����
����
����
����

�������

(b)

���
���
���
���
���
���

�
��
��
��
��
��

�������

(c)

Figure 4. The VB marginal distributions of the parameters. (a) μ2 (left) and μ1 (right); (b) σ2
1 (left) and
σ2
2 (right) ; (c) p1,2 (left) and p2,1 (right) .

Table 6 (the upper part) gives the maximum likelihood estimates (cited from [19]), mean
parameters computed by the MCMC method (cited from [21]) and mean parameters computed
by VB. It clearly shows that the point estimates by VB are very close to those by MLE and MCMC.
The numbers in parenthesis in Table 6 are the standard deviations computed by the three methods,
respectively. It is worth noting that all of the variance estimated by VB are smaller than those by the
MLE or MCMC methods. In fact, some other researchers also report the underestimation of posterior
variance in other VB applications, for example [31,32]. In the paper [1], we look at some diagnostics
methods that can assess how well the VB approximates the true posterior, particularly with regards to
its covariance structure. The methods proposed also allow us to generate simple corrections when the
approximation error is large.

Table 6. Estimates and standard deviations by VB, MLE and MCMC.

μ1
σ1
p1,2
μ2
σ2
p2,1

VB
0.0123(0.00165)
0.0349(0.00008)
0.0338(0.00851)
−0.0161(0.00889)
0.0777(0.00098)
0.1969(0.04525)
MLE
0.0123(0.002)
0.0347(0.001)
0.0371(0.012)
−0.0157(0.010)
0.0778(0.009)
0.2101(0.086)
MCMC
0.0122(0.002)
0.0351(0.002)
0.0334(0.012)
−0.0164(0.010)
0.0804(0.009)
0.2058(0.065)

5. Conclusions

Variational Bayes can be thought of in terms of information geometry as a projection-based
approximation technique; it provides a framework to approximate posteriors. We applied this method
to the regime-switching log-normal model and provide solutions to account for both model uncertainty
and parameter uncertainty. The numerical results show that our method can recover exactly the
number of regimes and gives reasonable point estimates. The VB method is also demonstrated to be
very computationally efficient.
The application on the TSX monthly total return index data in the period from January 1956 to
December 1999, confirms the similar results in the literature in finding the number of regimes.

Author Contributions

The article was written by Hui Zhao under the guidance of Paul Marriott. All authors have read
and approved the final manuscript.

166


Entropy 2014, 16, 3832–3847

Conflicts of Interest

The authors declare no conflict of interest.

References

1.
Zhao, H.; Marriott, P. Diagnostics for variational bayes approximations. 2013, arXiv:1309.5117.
2.
Amari, S.-I. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1990.
3.
Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat.
1983, 11, 793–803.
4.
Kass, R.; Vos, P. Geometrical Foundations of Asymptotic Inference; Wiley: New York, NY, USA, 1997.
5.
Hinton, G.E.; van Camp, D. Keeping neural networks simple by minimizing the description length of the
weights. In Proceedings of the 6th ACM Conference on Computational Learning Theory, Santa Cruz, CA,
USA, 26–28 July 1993; ACM: New York, NY, USA, 1993.
6.
MacKay, D. Developments in Probabilistic Modelling with Neural Networks—Ensemble Learning. In Neural
Networks: Artifical Intelligence and Industrial Applications; Springer: London, UK, 1995; pp. 191–198.
7.
Attias, H. Independent Factor Analysis. Neur. Comput. 1999, 11, 803–851.
8.
Lappalainen, H. Ensemble Learning For Independent Component Analysis. In Proceedings of the First
International Workshop on Independent Component Analysis, Aussois, France, 11–15 January 1999; pp.
7–12.
9.
Beal, M.; Ghahramani, Z. The variational Bayesian EM algorithm for incomplete data: With application to
scoring graphical model structures. Bayesian Stat. 2003, 7, 453–463.
10.
Winn, J. Variational Message Passing and its Applications. Ph.D. Thesis, Department of Physics, University of
Cambridge, Cambridge, UK, 2003.
11.
Blei, D.M.; Ng, A.Y.; Jordan, M.I.; Lafferty, J. Latent Dirichlet allocation. J. Mach. Learn. Res.
2003, 3,
993–1022.
12.
Ghahramani, Z.; Beal, M.J. A Variational Inference for Bayesian Mixtures of Factor Analysers. Adv. Neur. Inf.
Process. Syst. 2000, 12, 449–455.
13.
Haff, L.R. The Variational Form of Certain Bayes Estimators. Ann. Stat. 1991, 19, 1163–1190.
14.
Faes, C.; Ormerod, J.T.; Wand, M.P. Variational Bayesian Inference for Parametric and Nonparametric
Regression With Missing Data. J. Am. Stat. Assoc. 2011, 106, 959–971.
15.
McGrory, C.; Titterington, D.; Reeves, R.; Pettitt, A.N. Variational Bayes for estimating the parameters of a
hidden Potts model. Stat. Comput. 2009, 19, 329–340.
16.
Ormerod, J.T.; Wand, M.P. Gaussian Variational Approximate Inference for Generalized Linear Mixed
Models. J. Comput. Graph. Stat. 2011, 21, 1–16.
17.
Hall, P.; Humphreys, K.; Titterington, D.M. On the Adequacy of Variational Lower Bound Functions for
Likelihood-Based Inference in Markovian Models with Missing Values. J. R. Stat. Soc. Ser. B 2002, 64,
549–564.
18.
Wang, B.; Titterington, M. Convergence Properties of a general algorithm for calculating variational Bayesian
estimates for a normal mixture model. Bayesian Anal. 2006, 1, 625–650.
19.
Hardy, M.R. A Regime-Switching Model of Long-Term Stock Returns. N. Am. Actuar. J. 2001, 5, 41–53.
20.
Hamilton, J.D. A New Approach to the Economic Analysis of Nonstationary Time Series and the Business
Cycle. Econometrica 1989, 57, 357–384.
21.
Hardy, M.R. Bayesian Risk Management for Equity-Linked Insurance. Scand. Actuar. J. 2002, 2002, 185–211.
22.
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464.
23.
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723.
24.
Hartman, B.M.; Heaton, M.J. Accounting for regime and parameter uncertainty in regime-switching models.
Insur. Math. Econ. 2011, 49, 429–437.
25.
Green, P.J. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.
Biometrika 1995, 82, 711–732.
26.
Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: Cambridge, UK, 2009.
27.
Brooks, S.P. Markov Chain Monte Carlo Method and Its Application. J. R. Stat. Soc. Ser. D 1998, 47, 69–100.

167


Entropy 2014, 16, 3832–3847

28.
Ghahramani, Z.; Hinton, G.E. Variational learning for switching state-space models. Neur. Comput. 1998, 12,
831–864.
29.
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat 1951, 22, 79–86.
30.
Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A maximization technique occurring in the statistical analysis of
probabilistic functions of markov chains. Ann. Math. Stat. 1970, 41, 164–171.
31.
Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian inference for latent Gaussian models by using
integrated nested Laplace approximations. J. R. Stat. Soc. Ser. B 2009, 71, 319–392.
32.
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

168


entropy

Article
On Clustering Histograms with k-Means by Using
Mixed α-Divergences

Frank Nielsen 1,2,*, Richard Nock 3 and Shun-ichi Amari 4

1 Sony Computer Science Laboratories, Inc, Tokyo 141-0022, Japan
2 École Polytechnique, 91128 Palaiseau Cedex, France
3 NICTA and The Australian National University, Locked Bag 9013, Alexandria NSW 1435, Australia
4 RIKEN Brain Science Institute, 2-1 Hirosawa Wako City, Saitama 351-0198, Japan; E-Mail: amari@brain.riken.jp
*
E-Mail: Frank.Nielsen@acm.org; Tel.:+81-3-5448-4380.

Received: 15 May 2014; in revised form: 10 June 2014 / Accepted: 13 June 2014 /
Published: 17 June 2014

Abstract: Clustering sets of histograms has become popular thanks to the success of the generic
method of bag-of-X used in text categorization and in visual categorization applications. In this paper,
we investigate the use of a parametric family of distortion measures, called the α-divergences, for
clustering histograms. Since it usually makes sense to deal with symmetric divergences in information
retrieval systems, we symmetrize the α-divergences using the concept of mixed divergences. First,
we present a novel extension of k-means clustering to mixed divergences. Second, we extend the
k-means++ seeding to mixed α-divergences and report a guaranteed probabilistic bound. Finally, we
describe a soft clustering technique for mixed α-divergences.

Keywords: bag-of-X; α-divergence; Jeffreys divergence; centroid; k-means clustering; k-means seeding

1. Introduction: Motivation and Background

1.1. Clustering Histograms in the Bag-of-Word Modeling Paradigm

A common task of information retrieval (IR) systems is to classify documents into categories.
Given a training set of documents labeled with categories, one asks to classify new incoming documents.
Text categorisation [1,2] proceeds by first defining a dictionary of words from a corpus. It then
models each document by a word count yielding a word distribution histogram per document (see
the University of California, Irvine, UCI, machine learning repository for such data-sets [3]). The
importance of the words in the dictionary can be weighted by the term frequency-inverse document
frequency [2] (tf-idf) that takes into account both the frequency of the words in a given document, but
also of the frequency of the words in all documents: Namely, the tf-idf weight for a given word in a
given document is the product of the frequency of that word in the document times the logarithm of
the ratio of the number of documents divided by the document frequency of the word [2]. Defining a
proper distance between histograms allows one to:

• Classify a new on-line document: We first calculate its word distribution histogram signature and
seek for the labeled document, which has the most similar histogram to deduce its category tag.
• Find the initial set of categories: we cluster all document histograms and assign a category
per cluster.

This text classification method based on the representation of the bag-of -words (BoWs) has also
been instrumental in computer vision for efficient object categorization [4] and recognition in natural
images [5]. This paradigm is called bag-of-features [6] (BoFs) in the general case. It first requires one
to create a dictionary of “visual words” by quantizing keypoints (e.g., affine invariant descriptors of
image patches) of the training database. Quantization is performed using the k-means [7–9] algorithm

Entropy 2014, 16, 3273–3301; doi:10.3390/e16063273
www.mdpi.com/journal/entropy
169


Entropy 2014, 16, 3273–3301

that partitions n data X = {x1, ..., xn} into k pairwise disjoint clusters C1, ..., Ck, where each data
element belongs to the closest cluster center (i.e., the cluster prototype). From a given initialization,
batched k-means first assigns data points to their closest centers and then updates the cluster centers
and reiterates this process until convergence is met to a local minimum (not necessarily the global
minimum) after a provably finite number of steps. Csurka et al. [4] used the squared Euclidean
distance for building the visual vocabulary. Depending on the chosen features, other distances
have proven useful. For example, the symmetrized Kullback–Leibler (KL) divergence was shown to
perform experimentally better than the Euclidean or squared Euclidean distances for a compressed
histogram of gradient descriptors [10] (CHoGs), even if it is not a metric distance, since its fails to
satisfy the triangular inequality. To summarize, k-means histogram clustering with respect to the
symmetrized KL (called Jeffreys divergence J) can be used to quantize both visual words and document
categories. Nowadays, the seminal bag-of-word method has been generalized fruitfully to various
settings using the generic bag-of-X paradigm, like the bag-of-textons [6], the bag-of-readers [11], etc.
Bag-of-X represents each data (e.g., document, image, etc.) as an histogram of codeword count indices.
Furthermore, the semantic space [12] paradigm has been recently explored to overcome two drawbacks
of the bag-of-X paradigms: the high-dimensionality of the histograms (number of bins) and difficult
human interpretation of the codewords due to the lack of semantic information. In semantic space,
modeling relies on semantic multinomials that are discrete frequency histograms; see [12].
In summary, clustering histograms with respect to symmetric distances (like the symmetrized KL
divergence) is playing an increasing role. It turns out that the symmetrized KL divergence belongs to a
1-parameter family of divergences, called symmetrized α-divergences, or Jeffreys α-divergence [13].

1.2. Contributions

Since divergences D(p : q) are usually asymmetric distortion measures between two objects
p and q, one has to often consider two kinds of centroids obtained by carrying the minimization
process either on the left argument or on the right argument of the divergences; see [14]. In theory,
it is enough to consider only one type of centroid, say the right centroid, since the left centroid with
respect to a divergence D(p : q) is equivalent to the right centroid with respect to the mirror divergence
D′(p : q) = D(q : p).
In this paper, we consider mixed divergences [15] that allow one to handle in a unified way the
arithmetic symmetrization S(p, q) = 1

2(D(p : q) + D(q : p)) of a given divergence D(p : q) with both
the sided divergences: D(p : q) and its mirror divergence D′(p : q). The mixed α-divergence is the
mixed divergence obtained for the α-divergence. We term α-clustering the clustering with respect
to α-divergences and mixed α-clustering the clustering w.r.t. mixed α-divergences [16]. Our main
contributions are to extend the celebrated batched k-means [7–9] algorithm to mixed divergences
by associating two dual centroids per cluster and to generalize the probabilistically guaranteed
good seeding of k-means++ [17] to mixed α-divergences. The mixed α-seedings provide guaranteed
probabilistic clustering bounds by picking up seeds from the data and do not require explicitly
computing of centroids. Therefore, it follows a fast clustering technique in practice, even when cluster
centers are not available in closed form. We also consider clustering histograms by explicitly building
the symmetrized α-centroids and end up with a variational k-means when the centroids are not
available in closed-form, Finally, we investigate soft mixed α-clustering and discuss topics related to
α-clustering. Note that clustering with respect to non-symmetrized α-divergences has been recently
investigated independently in [18] and proven useful in several applications.

1.3. Outline of the Paper

The paper is organized as follows: Section 2 introduces the notion of mixed divergences, presents
an extension of k-means to mixed divergences and recalls some properties of α-divergences. Section 3
describes the α-seeding techniques and reports a probabilistically-guaranteed bound on the clustering
quality. Section 4 investigates the various sided/symmetrized/mixed calculations of the α-centroids.

170


Entropy 2014, 16, 3273–3301

Section 5 presents the soft α-clustering with respect to α-mixed divergences. Finally, Section 6
summarises the contributions, discusses related topics and hints at further perspectives. The paper is
followed by two appendices. Appendix B studies several properties of α-divergences that are used to
derive the guaranteed probabilistic performance of the α-seeding. Appendix C proves that α-sided
centroids are quasi-arithmetic means for the power generator functions.

2. Mixed Centroid-Based k-Means Clustering

2.1. Divergences, Centroids and k-Means

Consider a set H of n histograms h1, ..., hn, each with d bins, with all positive real-valued bins:
hi
j > 0, ∀1 ≤ i ≤ d, 1 ≤ j ≤ n. A histogram h is called a frequency histogram when its bins sums up

to one: w(h) = wh = ∑i hi = 1. Otherwise, it is called a positive histogram that can eventually be
normalized to a frequency histogram:

˜h
.=
h

w(h).
(1)

The frequency histograms belong to the (d-1)-dimensional open probability simplex Δd:

Δd
.=

�

(x1, ..., xd) ∈ Rd | ∀i, xi > 0, and
d
∑
i=1
xi = 1

�

.
(2)

That is, although frequency histograms have d bins, the constraint that those bin values should
sum up to one yields d-1 degrees of freedom. In probability theory, the frequency or counting of
histograms either model discrete multinomial probabilities or discrete positive measures (also called
positive arrays [19]).
The celebrated k-means clustering [8,9] is one of the most famous clustering techniques that has
been generalized in many ways [20,21]. In information geometry [22], a divergence D(p : q) is a
smooth C3 differentiable dissimilarity measure that is not necessarily symmetric (D(p : q) ̸= D(q : p),
hence the notation “:” instead of the classical “,” reserved for metric distances), but is non-negative and
satisfies the separability property: D(p : q) = 0 iff p = q. More precisely, let ∂iD(x : y) =
∂
∂xi D(x : y),

∂,iD(x : y) =
∂
∂yi D(x : y). Then, we require ∂iD(x : x) = ∂,iD(x : x) = 0 and −∂i∂,jD(x : y) positive

definite for defining a divergence. For a distance function D(· : ·), we denote by D(x : H) the weighted
average distance of x to a set a weighted histograms:

D(x : H)
.=
n
∑
j=1
wiD(x : hj).
(3)

An important class of divergences on frequency histograms is the f-divergences [23–25] defined for a
convex generator f (with f (1) = f ′(1) = 0 and f ′′(1) = 1):

If (p : q)
.=
d
∑
i=1
qi f
� pi

qi

�
.

Those divergences preserve information monotonicity [19] under any arbitrary transition probability
(Markov morphisms). f-divergences can be extended to positive arrays [19].
The k-means algorithm on a set of weighted histograms can be tailored to any divergence as
follows: First, we initialize the k cluster centers C = {c1, ..., ck} (say, by picking up randomly arbitrary
distinct seeds). Then, we iteratively repeat until convergence the following two steps:

• Assignment: Assign each histogram hj to its closest cluster center:

l(hj)
.= arg
k
min
l=1 D(hj : cl).

171


Entropy 2014, 16, 3273–3301

This yields a partition of the histogram set H = ∪k
l=1Al, where Al denotes the set of histograms
of the l-th cluster: Al = {hj |l(hj) = l}.
• Center relocation: Update the cluster centers by taking their centroids:

cl
.= arg min
x
∑
hj∈Al
wjD(hj : x).

Throughout this paper, centroid shall be understood in the broader sense of a barycenter when
weights are non-uniform.

2.2. Mixed Divergences and Mixed k-Means Clustering

Since divergences are potentially asymmetric, we can define two-sided k-means or always consider
a right-sided k-means, but then define another sided divergence D′(p : q) = D(q : p). We can
also consider the symmetrized k-means with respect to the symmetrized divergence: S(p, q) =
D(p : q) + D(q : p). Eventually, we may skew the symmetrization with a parameter λ ∈ [0, 1]:
Sλ(p, q) = λD(p : q) + (1 − λ)D(q : p) (and consider other averaging schemes instead of the arithmetic
mean).
In order to handle those sided and symmetrized k-means under the same framework, let us
introduce the notion of mixed divergences [15] as follows:

Definition 1 (Mixed divergence).

Mλ(p : q : r)
.= λD(p : q) + (1 − λ)D(q : r),
(4)

for λ ∈ [0, 1].

A mixed divergence includes the sided divergences for λ ∈ {0, 1} and the symmetrized (arithmetic
mean) divergence for λ = 1

2.
We generalize k-means clustering to mixed k-means clustering [15] by considering two centers
per cluster (for the special cases of λ = 0, 1, it is enough to consider only one). Algorithm 1 sketches
the generic mixed k-means algorithm. Note that a simple initialization consists of choosing randomly
the k distinct seeds from the dataset with li = ri.

Algorithm 1: Mixed divergence-based k-means clustering.

Input: Weighted histogram set H, divergence D(·, ·), integer k > 0, real λ ∈ [0, 1];
Initialize left-sided/right-sided seeds C = {(li, ri)}k
i=1;
repeat

//Assignment
for i = 1, 2, ..., k do

Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj)};

// Dual-sided centroid relocation
for i = 1, 2, ..., k do

ri ← arg minx D(Ci : x) = ∑h∈Ci wjD(h : x);
li ← arg minx D(x : Ci) = ∑h∈Ci wjD(x : h);

until convergence;
Output: Partition of H into k clusters following C;

Notice that the mixed k-means clustering is different from the k-means clustering with respect to
the symmetrized divergences Sλ that considers only one centroid per cluster.

172


Entropy 2014, 16, 3273–3301

2.3. Sided, Symmetrized and Mixed α-Divergences

For α ̸= ±1, we define the family of α-divergences [26] on positive arrays [27] as:

Dα(p : q)
.=
d
∑
i=1

4

1 − α2

�1 − α

2
pi + 1 + α

2
qi − (pi)
1−α

2 (qi)
1+α

2
�
,

=
D−α(q : p), α ∈ R\{0, 1},
(5)

with the limit cases D−1(p : q) = KL(p : q) and D1(p : q) = KL(q : p), where KL is the extended
Kullback–Leibler divergence:

KL(p : q)
.=
d
∑
i=1
pi log pi

qi + qi − pi.
(6)

Divergence D0 is the squared Hellinger symmetric distance (scaled by a multiplicative factor of
four) extended to positive arrays:

D0(p : q) = 2
� ��

p(x) −
�

q(x)
�2
dx = 4H2(p, q),
(7)

with the Hellinger distance:

H(p, q) =

�

1
2

� ��

p(x) −
�

q(x)
�2
dx.
(8)

Note that α-divergences are defined for the full range of α values: α ∈ R.
Observe that
α-divergences of Equation (5) are homogeneous of degree one: Dα(λp : λq) = λDα(p : q) for
λ > 0.
When histograms p and q are both frequency histograms, we have:

Dα( ˜p : ˜q)
=
4

1 − α2

�

1 −
d
∑
i=1
( ˜pi)
1−α

2 ( ˜qi)
1+α

2

�

,

=
D−α( ˜q : ˜p), α ∈ R\{0, 1},
(9)

and the extended Kullback–Leibler divergence reduces to the traditional Kullback–Leibler

divergence: KL( ˜p : ˜q) = ∑d
i=1 ˜pi log ˜pi

˜qi .
The Kullback–Leibler divergence between frequency histograms ˜p and ˜q (α = ±1) is interpreted
as the cross-entropy minus the Shannon entropy:

KL( ˜p : ˜q)
.= H×( ˜p : ˜q) − H( ˜p).

Often, ˜p denotes the true model (hidden by nature), and ˜q is the estimated model from observations.
However, in information retrieval, both ˜p and ˜q play the same symmetrical role, and we prefer to deal
with a symmetric divergence.
The Pearson and Neyman χ2 distances are obtained for α = −3 and α = 3, respectively:

D3( ˜p : ˜q)
=
1
2 ∑
i

( ˜qi − ˜pi)2

˜pi
,
(10)

D−3( ˜p : ˜q)
=
1
2 ∑
i

( ˜qi − ˜pi)2

˜qi
.
(11)

173


Entropy 2014, 16, 3273–3301

The α-divergences belong to the class of Csiszár f-divergences with the following generator:

f (t) =

⎧
⎪
⎨

⎪
⎩

4

1−α2
�
1 − t(1+α)/2�
,
if α ̸= ±1,
t ln t,
if α = 1,
− ln t,
if α = −1
(12)

Remark 1. Historically, the α-divergences have been introduced by Chernoff [28,29] in the context of hypothesis
testing. In Bayesian binary hypothesis testing, we are asked to decide whether an observation belongs to one
class or the other class, based on prior w1 and w2 and class-conditional probabilities p1 and p2. The average
expected error of the best decision maximum a posteriori (MAP) rule is called the probability of error, denoted by
Pe. When prior probabilities are identical (w1 = w2 = 1

2), we have Pe(p1, p2) = 1

2
�
min(p1(x), p2(x))dx.
Let S(p, q) = �
min(p(x), q(x))dx denote the intersection similarity measure, with 0 < S ≤ 1 (generalizing
the histogram intersection distance often used in computer vision [30]). S is bounded by the α-Chernoff affinity
coefficient:

S(p, q) ≤ Cβ(p, q) =
�
pβ(x)q1−β(x)dx,

for all β ∈ [0, 1]. We can convert the affinity coefficient 0 < Cβ ≤ 1 into a divergence Dβ by simply taking
Dβ = 1 − Cβ. Since the absolute value of divergences does not matter, we can rescale appropriately the
divergence. One nice rescaling is by multiplying by
1

β(1−β): Dβ =
1

β(1−β)(1 − Cβ). This lets coincide the
parameterized divergence with the fundamental Kullback–Leibler divergence for the limit values β ∈ {0, 1}.
Last, by choosing β = 1−α

2 , it yields the well-known expression of the α-divergences.

Interestingly, the α-divergences can be interpreted as a generalized α-Kullback–Leibler
divergence [26] with deformed logarithms.
Next, we introduce the mixed α-divergence of a histogram x to two histograms p and q as follows:

Definition 2 (Mixed α-divergence). The mixed α-divergence of a histogram x to two histograms p and q is
defined by:

Mλ,α(p : x : q)
=
λDα(p : x) + (1 − λ)Dα(x : q),

=
λD−α(x : p) + (1 − λ)D−α(q : x),

=
M1−λ,−α(q : x : p),
(13)

The α-Jeffreys symmetrized divergence is obtained for λ = 1

2:

Sα(p, q) = M 1

2 ,α(q : p : q) = M 1

2 ,α(p : q : p).

The skew symmetrized α-divergence is defined by:

Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p).

2.4. Notations and Hard/Soft Clusterings

Throughout the paper, superscript index i denotes the histogram bin numbers and subscript
index j the histogram numbers. Index l is used to iterate on the clusters. The left-sided, right-sided
and symmetrized histogram positive and frequency α-centroids are denoted by lα, rα, sα and ˜lα, ˜rα, ˜sα,
respectively.
In this paper, we investigate the following kinds of clusterings for sets of histograms:

Hard clustering. Each histogram belongs to exactly one cluster:

• k-means with respect to mixed divergences Mλ,α.
• k-means with respect to symmetrized divergences Sλ,α.

174


Entropy 2014, 16, 3273–3301

• Randomized seeding for mixed/symmetrized k-means by extending k-means++ with
guaranteed probabilistic bounds for α-divergences.

Soft clustering. Each histogram belongs to all clusters according to some weight distribution:
the soft mixed α-clustering.

3. Coupled k-Means++ α-Seeding

It is well-known that the Lloyd k-means clustering algorithm monotonically decreases the loss
function and stops after a finite number of iterations into a local optimal. Optimizing globally
the k-means loss is NP-hard [17] when d > 1 and k > 1. In practice, the performance of the
k-means algorithm heavily relies on the initialization. A breakthrough was obtained by the k-means++
seeding [17], which guarantees in expectation a good starting partition. We extend this scheme to
the coupled α-clustering. However, we point out that although k-means++ prove popular and are
often used in practice with very good results; it has been recently pointed out that “worst case”
configurations exist and even in small dimensions, on which the algorithm cannot beat significantly
its expected approximability with a high probability [31]. Still, the expected approximability ratio,
roughly in O(log(k)), is very good, as long as the number of clusters is not too large.

Algorithm 2: Mixed α-seeding; MAS(H, k, λ, α)

Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R;
Let C ← hj with uniform probability ;
for i = 2, 3, ..., k do

Pick at random histogram h ∈ H with probability:

πH(h)
.=
whMλ,α(ch : h : ch)

∑y∈H wyMλ,α(cy : y : cy) ,
(14)

//where (ch, ch)
.= arg min(z,z)∈C Mλ,α(z : h : z);
C ← C ∪ {(h, h)};

Output: Set of initial cluster centers C;

Algorithm 2 provides our adaptation of k-means++ seeding [15,17]. It works for all three of our
sided/symmetrized and mixed clustering settings:

• Pick λ = 1 for the left-sided centroid initialization,
• Pick λ = 0 for the right-sided centroid initialization (a left-sided initialization for −α),
• with arbitrary λ, for the λ-Jα (skew Jeffreys) centroids or mixed λ centroids. Indeed, the
initialization is the same (see the MAS procedure in Algorithm 2).

Our proof follows and generalizes the proof described for the case of mixed Bregman seeding [15]
(Lemma 2). In fact, our proof is more precise, as it quantifies the expected potential with respect to the
optimum only, whereas in [15], the optimal potential is averaged with a dual optimal potential, which
depends on the optimal centers, but may be larger than the optimum sought.

Theorem 1. Let Cλ,α denote for short the cost function related to the clustering type chosen (left-, right-, skew
Jeffreys or mixed) in MASand Copt
λ,α denote the optimal related clustering in k clusters, for λ
∈ [0, 1] and α ∈ (−1, 1). Then, on average, with respect to distribution (14), the initial clustering of
MAS satisfies:

Eπ[Cλ,α]
≤
4

�
f (λ)g(k)h2(α)Copt
λ,α
if
λ ∈ (0, 1)

g(k)z(α)h4(α)Copt
λ,α
otherwise
.
(15)

175


Entropy 2014, 16, 3273–3301

Here, f (λ) = max
�
1−λ

λ ,
λ

1−λ
�
, g(k) = 2(2 + log k), z(α) =
� 1+|α|

1−|α|

�
8|α|2

(1−|α|)2 , h(α) = maxi p|α|
i / mini p|α|
i ;
the min is defined on strictly positive coordinates, and π denotes the picking distribution of Algorithm 2.

Remark 2. The bound is particularly good when λ is close to 1/2, and in particular for the α-Jeffreys clustering,
as in these cases, the only additional penalty compared to the Euclidean case [17] is h2(α), a penalty that relies
on an optimal triangle inequality for α-divergences that we provide in Lemma A6 below.

Remark 3. This guaranteed initialization is particularly useful for α-Jeffreys clustering, as there is no closed
form solution for the centroids (except when α = ±1, see [32]).

Algorithm 3: Mixed α-hard clustering: MAhC(H, k, λ, α)

Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1], real α ∈ R;
Let C = {(li, ri)}k
i=1 ← MAS(H, k, λ, α);
repeat

//Assignment
for i = 1, 2, ..., k do

Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj)};

// Centroid relocation
for i = 1, 2, ..., k do

ri ←
�
∑h∈Ai wih
1−α

2
�
2

1−α ;

li ←
�
∑h∈Ai wih
1+α

2
�
2

1+α ;

until convergence;
Output: Partition of H in k clusters following C;

Algorithm 3 presents the general hard mixed k-means clustering, which can be adapted also to
left- (λ = 1) and right- (λ = 0) α-clustering.
For skew Jeffreys centers, since the centroids are not available in closed form [32], we adopt a
variational approach of k-means by updating iteratively the centroid in each cluster (thus improving
the overall loss function without computing the optimal centroids that would eventually require
infinitely many iterations).

4. Sided, Symmetrized and Mixed α-Centroids

The k-means clustering requires assigning data elements to their closest cluster center and
then updating those cluster centers by taking their centroids. This section investigates the centroid
computations for the sided, symmetrized and mixed α-divergences.
Note that the mixed α-seeding presented in Section 3 does not require computing centroids and,
yet, guarantees probabilistically a good clustering partition.
Since mixed α-divergences are f-divergences, we start with the generic f-centroids.

4.1. Csiszár f-Centroids

The centroids induced by f-divergences of a set of positive measures (that relaxes the
normalisation constraint) have been studied by Ben-Tal et al. [33]. Those entropic centroids are

176


Entropy 2014, 16, 3273–3301

shown to be unique, since f-divergences are convex statistical distances in both arguments. Let Ef
denote the energy to minimize when considering f-divergences:

Ef
.=
min
x∈X If (H : x) =
n
∑
j=1
wjIf (hj : x),
(16)

=
min
x∈X

n
∑
j=1
wj

d
∑
i=1
pi
j f

�
ci

hi
j

�

.
(17)

When the domain is the open probability simplex X = Δd, we get a constrained optimisation
problem to solve. We transform this constrained minimisation problem (i.e., x ∈ Δd) into an equivalent
unconstrained minimisation problem by using the Lagrange multiplier, γ:

min
x∈Rd

n
∑
j=1
wjIf (hj : c) + γ

�
d
∑
i=1
xi − 1

�

.
(18)

Taking the derivatives according to xi, we get:

∀i ∈ {1, ..., d},
n
∑
j=1
wj f ′
�
xi

hi
j

�

− γ = 0.
(19)

We now consider this equation for α-divergences and symmetrized α-divergences, both
f-divergences.

4.2. Sided Positive and Frequency α-Centroids

The positive sided α-centroids for a set of weighted histograms were reported in [34] using the
representation Bregman divergence. We summarise the results in the following theorem:

Theorem 2 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted α-centroid
coordinates of a set of n positive histograms h1, ..., hn are weighted α-means:

ri
α = f −1
α

�
n
∑
j=1
wj fα(hi
j)

�

, li
α = ri
−α

with fα(x) =

�
x
1−α

2
α ̸= ±1,
log x
α = 1.

Furthermore, the frequency-sided α-centroids are simply the normalized-sided α-centroids.

Theorem 3 (Sided frequency α-centroids [16]). The coordinates of the sided frequency α-centroids of a set of
n weighted frequency histograms are the normalised weighted α-means.

Table 1 summarizes the results concerning the sided positive and frequency α-centroids.

177


Entropy 2014, 16, 3273–3301

Table 1. Positive and frequency α-centroids: the frequency α-centroids are normalized positive
α-centroids, where w(h) denotes the cumulative sum of the histogram bins. The arithmetic mean
is obtained for r−1 = l1 and the geometric mean for r1 = l−1.

Positive centroid
Frequency centroid

Right-sided centroid
riα =

�
(∑n
j=1 wj(hi
j)
1−α

2 )
2

1−α
α ̸= 1
ri
1 = ∏n
j=1(hi
j)wj
α = 1
˜riα =
ri
α

w(˜rα)

Left-sided centroid
liα = ri−α =

�
(∑n
j=1 wj(hi
j)
1+α

2 )
2

1+α
α ̸= −1
li
−1 = ∏n
j=1(hi
j)wj
α = −1
˜liα = ˜ri−α =
ri
−α

w(˜r−α)

4.3. Mixed α-Centroids

The mixed α-centroids for a set of n weighted histograms is defined as the minimizer of:

∑
j
wjMλ,α(l : hj : r).
(20)

We state the theorem generalizing [15]:

Theorem 4. The two mixed α-centroids are the left-sided and right-sided α-centroids.

Figure 1 depicts some clustering result with our α-clustering software. We remark that the clusters
found are all approximately subclusters of the “distinct” clusters that appear on the figure. When
those distinct clusters are actually the optimal clusters—which is likely to be the case when they are
separated by large minimal distance to other clusters—this is clearly a desirable qualitative property
as long as the number of experimental clusters is not too large compared to the number of optimal
clusters. We remark also that in the experiment displayed, there is no closed form solution for the
cluster centers.

Figure 1. Snapshot of the α-clustering software. Here, n = 800 frequency histograms of three bins with
k = 8, and α = 0.7 and λ = 1

2.

178


Entropy 2014, 16, 3273–3301

4.4. Symmetrized Jeffreys-Type α-Centroids

The Kullback–Leibler divergence can be symmetrized in various ways: Jeffreys divergence,
Jensen–Shannon divergence and Chernoff information, just to mention a few. Here, we consider the
following symmetrization of α-divergences extending Jeffreys J-divergence:

Sα(p, q)
=
1
2 (Dα(p : q) + Dα(q : p)) = S−α(p, q),

=
M 1

2 (p : q : p),
(21)

For α = ±1, we get half of Jeffreys divergence:

S±1(p, q) = 1

2

d
∑
i=1
(pi − qi) log pi

qi

In particular, when p and q are frequency histograms, we have for α ̸= ±1:

Jα( ˜p : ˜q)
=
8

1 − α2

�

1 +
d
∑
i=1
H 1−α

2 ( ˜pi, ˜qi)

�

,
(22)

where H 1−α

2 (a, b) a symmetric Heinz mean [35,36]:

Hβ(a, b) = aβb1−β + a1−βbβ

2
.

Heinz means interpolate the arithmetic and geometric means and satisfies the inequality:

√

ab = H 1

2 (a, b) ≤ Hα(a, b) ≤ H0(a, b) = a + b

2
.

(Another interesting property of Heinz means is the integral representation of the logarithmic mean:
L(x, y) =
x−y

log x−log y = � 1
0 Hβ(x, y)dβ. This allows one to prove easily that √xy ≤ L(x, y) ≤ x+y

2 .)
The Jα-divergence is a Csiszár f-divergence [24,25].
Observe that it is enough to consider α ∈ [0, ∞) and that the symmetrized α-divergence for
positive and frequency histograms coincide only for α = ±1.
For α = ±1, Sα(p, q) tends to the Jeffreys divergence:

J(p, q) = KL(p, q) + KL(q, p) =
d
∑
i=1
(pi − qi)(log pi − log qi).
(23)

The Jeffreys divergence writes mathematically the same for frequency histograms:

J( ˜p, ˜q) = KL( ˜p, ˜q) + KL( ˜q, ˜p) =
d
∑
i=1
( ˜pi − ˜qi)(log ˜pi − log ˜qi).
(24)

We state the results reported in [32]:

Theorem 5 (Jeffreys positive centroid [32]). The Jeffreys positive centroid c = (c1, ..., cd) of a set {h1, ..., hn}
of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W
analytic function:

ci =
ai

W( ai

gi e)
,

where ai = ∑n
j=1 πjhi
j denotes the coordinate-wise arithmetic weighted means and gi = ∏n
j=1(hi
j)πj the
coordinate-wise geometric weighted means.

179


Entropy 2014, 16, 3273–3301

The Lambert analytic function W [37] (positive branch) is defined by W(x)eW(x) = x for x ≥ 0.

Theorem 6 (Jeffreys frequency centroid [32]). Let ˜c denote the Jeffreys frequency centroid and ˜c′ =
c
wc the

normalised Jeffreys positive centroid. Then, the approximation factor α˜c′ = S1(˜c′, ˜H)

S1(˜c, ˜H) is such that 1 ≤ α˜c′ ≤
1
wc
(with wc ≤ 1).

Therefore, we shall consider α ̸= ±1 in the remainder.
We state the following lemma generalizing the former results in [38] that were tailored to the
symmetrized Kullback–Leibler divergence or the symmetrized Bregman divergence [14]:

Lemma 1 (Reduction property). The symmetrized Jα-centroid of a set of n weighted histograms amount to
computing the symmetrized α-centroid for the weighted α-mean and −α-mean:

min Jα(x, H) = min
x
(Dα(x : rα) + Dα(lα : x)) .

Proof. It follows that the minimization problem minx Sα(x, H) = ∑n
j=1 wjSα(x, hj) reduces to the
following minimization:

min
d
∑
i=1
xi − (xi)
1+α

2 ¯hi
α − (xi)
1−α

2 ¯hi
−α.
(25)

This is equivalent to minimizing:

≡
d
∑
i=1
xi − (xi)
1+α

2 ((¯hi
α)
2

1−α )
1−α

2 −

(xi)
1−α

2 ((¯hi
−α)
2

1+α )
1+α

2 ,

≡
d
∑
i=1
xi − (xi)
1+α

2 (ri
α)

1−α

2
− (xi)
1−α

2 (li
α)
1+α

2

≡
Dα(x : rα) + Dα(lα : x).

Note that α = ±1, the lemma states that the minimization problem is equivalent to minimizing
KL(a : x) + KL(x : g) with respect to x, where a = l1 and g = r1 denote the arithmetic and geometric
means, respectively.

The lemma states that the optimization problem with n weighted histograms is equivalent to the
optimization with only two equally weighted histograms.
The positive symmetrized α-centroid is equivalent to computing a representation symmetrized
Bregman centroid [14,34].
The frequency symmetrized α-centroid asks to minimize the following problem:

min
˜x∈Δd ∑
j
wjSα( ˜x, ˜hi).

Instead of seeking for ˜x in the probability simplex, we can optimize on the unconstrained
domain Rd−1 by using a reparameterization. Indeed, frequency histograms belong to the exponential
families [39] (multinomials).
Exponential families also include many other continuous distributions, like the Gaussian, Beta or
Dirichlet distributions. It turns out the α-divergences can be computed in closed-form for members of
the same exponential family:

180


Entropy 2014, 16, 3273–3301

Lemma 2. The α-divergence for distributions belonging to the same exponential families amounts to computing
a divergence on the corresponding natural parameters:

Aα(p : q) =
4

1 − α2

�

1 − e−J( 1−α

2 )
F
(θp:θq)
�

,

where Jβ
F(θ1 : θ2) = βF(θ1) + (1 − β)F(θ2) − F(βθ1 + (1 − β)θ2) is a skewed Jensen divergence defined for
the log-normaliser F of the family.

The proof follows from the fact that �
pα(x)q1−α(x)dx = e−J
(α)(θp:θq)
F
; see [40].

First, we convert a frequency histogram ˜h to its natural parameter θ with θi = log ˜hi

˜hd ; see [39].
The log-normaliser is a non-separable convex function F(θ) = log(1 + ∑i eθi). To convert back a
multinomial to a frequency histogram with d bins, we first set ˜hd =
1

1+∑d−1
l=1 eθl and then retrieve the

other bin values as ˜hi = ˜hdeθi.
The centroids with respect to skewed Jensen divergences has been investigated in [13,40].

Remark 4. Note that for the special case of α = 0 (squared Hellinger centroid), the sided and symmetrized
centroids coincide. In that case, the coordinates si
0 of the squared Hellinger centroid are:

si
0 =

�
n
∑
j=1
wj
�

hi
j

�2
, 1 ≤ i ≤ d.

Remark 5. The symmetrized positive α-centroids can be solved in special cases (α = ±3, α = ±1 corresponding
to the symmetrized χ2 and Jeffreys positive centroids). For frequency centroids, when dealing with binary
histograms (d = 2), we have only one degree of freedom and can solve the binary frequency centroids. Binary
histograms (and mixtures thereof) are used in computer vision and pattern recognition [41].

Remark 6. Since α-divergences are Csiszár f-divergences and f-divergences can always be symmetrized by
taking generator s(t) = f (t) + t f ( 1

t ), we deduce that symmetrized α-divergences Sα are f-divergences for
the generator:

f (t) = − log((1 − α) + αt) − t log
�
(1 − α) + α

t

�
.
(26)

Hence, Sα divergences are convex in both arguments, and the sα centroids are unique.

5. Soft Mixed α-Clustering

Algorithm 4 reports the general clustering with soft membership, which can be adapted to left
(λinit = 1), right (λinit = 0) or mixed clustering. We have not considered a weighted histogram set in
order not load the notations and because the extension is straightforward.
Again, for skew Jeffreys centers, we shall adopt a variational approach. Notice that the soft
clustering approach learns all parameters, including λ (if not constrained to zero or one) and α ∈ R.
This is not the case for Matsuyama’s α-expectation maximization (EM) algorithm [42] in which α is
fixed beforehand (and, thus, not learned).
Assuming we model the prior for histograms by:

pλ,α,j(hi) ∝

λ exp −Dα(lj : hi) + (1 − λ) exp −Dα(hi : rj) ,
(27)

181


Entropy 2014, 16, 3273–3301

the negative log-likelihood involves the α-depending quantity:

k
∑
j=1

m
∑
i=1
p(j|hi) log pλ,α,j(hi) ≥

k
∑
j=1

m
∑
i=1
Mλ,α(lj : hi : rj)p(j|hi),
(28)

because of the concavity of the logarithm function.
Therefore, the maximization step for α
involves finding:

arg max
α

k
∑
j=1

m
∑
i=1
Mλ,α(lj : hi : rj)p(j|hi) .
(29)

Algorithm 4: Mixed α-soft clustering; MAsC(H, k, λ, α)

Input: Histogram set H with |H| = m, integer k > 0, real λ ← λinit ∈ [0, 1], real α ∈ R;
Let C = {(li, ri)}k
i=1 ← MAS(H, k, λ, α);
repeat

//Expectation
for i = 1, 2, ..., m do

for j = 1, 2, ..., k do

p(j|hi) =
πj exp(−Mλ,α(lj:hi:rj))

∑j′ πj′ exp(−Mλ,α(lj′ :hi:rj′));

//Maximization
for j = 1, 2, ..., k do

πj ← 1

m ∑i p(j|hi);

li ←
�
1

∑i p(j|hi) ∑i p(j|hi)h

1+α

2
i

�
2

1+α
;

ri ←
�
1

∑i p(j|hi) ∑i p(j|hi)h

1−α

2
i

�
2

1−α
;

//Alpha - Lambda
α ← α − η1 ∑k
j=1 ∑m
i=1 p(j|hi) ∂

∂α Mλ,α(lj : hi : rj);

if λinit ̸= 0, 1 then

λ ← λ − η2
�
∑k
j=1 ∑m
i=1 p(j|hi)Dα(lj : hi)−

∑k
j=1 ∑m
i=1 p(j|hi)Dα(hi : rj)
�
;

//for some small η1, η2; ensure that λ ∈ [0, 1].

until convergence;
Output: Soft clustering of H according to k densities p(j|.) following C;

No closed-form solution are known, so we compute the gradient update in Algorithm 4 with:

∂Mλ,α(lj : hi : rj)

∂α
=

λ∂Dα(lj : hi)

∂α
+ (1 − λ)∂Dα(hi : rj)

∂α
,
(30)

182


Entropy 2014, 16, 3273–3301

∂Dα(p : q)

∂α
=
2

(1 − α)2 ×
�

q −
�1 − α

1 + α

�2
p + p
1−α

2 q
1+α

2
�
4α

1 − α2 − ln
� q

p

���

.
(31)

The update in λ is easier as:

∂Mλ,α(lj : hi : rj)

∂λ
=
Dα(lj : hi) − Dα(hi : rj) .
(32)

Maximizing the likelihood in λ would imply choosing λ ∈ {0, 1} (a hard choice for left/right centers),
yet we prefer the soft update for the parameter, like for α.

6. Conclusions

The family of α-divergences plays a fundamental role in information geometry: These statistical
distortion measures are the canonical divergences of dual spaces on probability distribution manifolds
with constant curvature κ = 1−α2

4
and the canonical divergences of dually flat manifolds for positive
distribution manifolds [19].
In this work, we have presented three techniques for clustering (positive or frequency) histograms
using k-means:

(1) Sided left or right α-centroid k-means,
(2) Symmetrized Jeffreys-type α-centroid (variational) k-means, and
(3) Coupled k-means with respect to mixed α-divergences relying on dual α-centroids.

Sided and mixed dual centroids are always available in closed-forms and are therefore highly
attractive from the standpoint of implementation. Symmetrized Jeffreys centroids are in general not
available in closed-form and require one to implement a variational k-means by updating incrementally
the cluster centroids in order to monotonically decrease the k-means loss function. From the clustering
standpoint, this appears not to be a problem when guaranteed expected approximations to the optimal
clustering are enough.
Indeed, we also presented and analyzed an extension of k-means++ [17] for seeding those k-means
algorithms. The mixed α-seeding initializations do not require one to calculate centroids and behaves
like a discrete k-means by picking up the seeds among the data. We reported guaranteed probabilistic
clustering bounds. Thus, it yields a fast hard/soft data partitioning technique with respect to mixed or
symmetrized α-divergences. Recently, the advantage of clustering using α-divergences by tuning α in
applications has been demonstrated in [18]. We thus expect the computationally fast mixed α-seeding
with guaranteed performance to be useful in a growing number of applications.

Acknowledgments

NICTA is funded by the Australian Government as represented by the Department of Broadband,
Communication and the Digital Economy and the Australian Research Council through the ICT Centre
of Excellence program.

Author Contributions: Author Contributions
All authors contributed equally to the design of the research. The research was carried out by all
authors. Frank Nielsen and Richard Nock wrote the paper. Frank Nielsen implemented the algorithms
and performed experiments. All authors have read and approved the final manuscript.

Conflicts of Interest: Conflicts of Interest
The authors declare no conflict of interests.

183


Entropy 2014, 16, 3273–3301

Appendix Proof Sketch of Theorem 1

We give here the key results allowing one to obtain the proof of the Theorem, following the proof
scheme of [15]. In order not to load the notations, weights are considered uniform. The extension to
non-uniform weights is immediate as it boils down to duplicate histograms in the histogram set and
does not change the approximation result.
Let A ⊆ H be an arbitrary cluster of Copt. Let us define UA and πA as the uniform and biased
distributions conditioned to A. The key to the proof is to relate the expected potential of A under UA
and πA to its contribution to the optimal potential.

Lemma A1. Let A ⊆ H be an arbitrary cluster of Copt. Then:

Ec∼UA[Mλ,α(A, c)]
=
Mopt,λ,α(A) + Mopt,λ,−α(A)

=
Ec∼UA[Mλ,−α(A, c)] ,

where UA is the uniform distribution over A.

Proof. α-coordinates have the property that for any subset A ⊆ H, (1/|A|) ∑p∈A uα(p) = uα(rα,A).
Hence, we have:

∀c ∈ A , ∑
p∈A
Dα(p : c)

=
∑
p∈A
Dϕα(uα(p) : uα(c))

=
∑
p∈A
Dϕα(uα(p) : uα(rα,A)) + |A|Dϕα(uα(rα,A) : uα(c))

=
∑
p∈A
Dα(p : rα,A) + |A|Dα(rα,A : c) .
(A1)

Because Dα(p : q) = D−α(q : p) and lα = r−α, we obtain:

∀c ∈ A , ∑
p∈A
Dα(c : p)

=
∑
p∈A
D−α(p : c)

=
∑
p∈A
D−α(p : r−α,A) + |A|D−α(r−α,A : c)

=
∑
p∈A
Dα(lα,A : p) + |A|Dα(c : lα,A) .
(A2)

It comes now from (A1) and (A2) that:

Ec∼UA[Mλ,α(A, c)]

=
1
|A| ∑
c∈A ∑
p∈A
{λDα(c : p) + (1 − λ)Dα(p : c)}

=
(1 − λ) ∑
p∈A
Dα(p : rα,A) + (1 − λ) ∑
p∈A
Dα(rα,A : p)

+λ ∑
p∈A
Dα(lα,A : p) + λ ∑
p∈A
Dα(p : lα,A)

=
(1 − λ)Mopt,0,α(A) + λMopt,1,α(A)

+(1 − λ)Mopt,0,−α(A) + λMopt,1,−α(A)

=
Mopt,λ,α(A) + Mopt,λ,−α(A) .
(A3)

184


Entropy 2014, 16, 3273–3301

This gives the left-hand side equality of the Lemma. The right-hand side follows from the fact that
Ec∼UA[Mλ,−α(A, c)] = Mopt,1−λ,α(A) + Mopt,1−λ,−α(A).

Instead of Mopt,λ,α(A) + Mopt,λ,−α(A), we want a term depending solely on Mopt,λ,α(A) as it is
the “true” optimum. We now give two lemmata that shall be useful in obtaining this upper bound.
The first is of independent interest, as it shows that any α-divergence is a scaled, squared Hellinger
distance between geometric means of points.

Lemma A2. For any p, q and α ̸= 1, there exists r ∈ [p, q], such that (1 − α)2Dα(p : q) = D0(p1−αrα :
q1−αrα).

Proof. By the definition of Bregman divergences, for any x, y, there exists some z ∈ [x, y], such that:

Dϕα(x : y)
=
1
2(x − y)2ϕ”α(z)

=
1
2(x − y)2
�
1 + 1 − α

2
z
� 2α

1−α
,

and since uα is continuous and strictly increasing, for any p, q, there exists some r ∈ [p, q], such that:

Dα(p : q)

=
Dϕα(uα(p) : uα(q))

=
1
2(uα(p) − uα(q))2
�
1 + 1 − α

2
uα(r)
� 2α

1−α

=
2

(1 − α)2

�
p
1−α

2 − q
1−α

2
�2
rα

=
2

(1 − α)2

�
p1−α + q1−α − 2(pq)
1−α

2
�
rα

=
1

(1 − α)2 D0(p1−αrα : q1−αrα) .

Lemma A3. Let discrete random variable x take non-negative values x1, x2, ..., xm with uniform probabilities.
Then, for any β > −1, we have var(x1+β/uβ) ≤ var(x), with u
.= (1 + β)β maxi xi.

Proof. First, ∀β > −1, remark that for any x, function f (x) = x(uβ − xβ) is increasing for x ≤
u/(1 + β)β. Hence, assuming that the xis are put in non-increasing order without loss of generality, we
have f (xi) ≥ f (xj), and so, xi(uβ − xβ
i ) ≥ xj(uβ − xβ
j ), ∀i ≥ j, as long as xi ≤ u/(1 + β)β. Choosing

185


Entropy 2014, 16, 3273–3301

u = x1(1 + β)β yields, after reordering and putting the exponent, (x1+β
i
− x1+β
j
)2 ≤ (xiuβ − xjuβ)2.
Hence:

1
m ∑
i
x2(1+β)
i
−

�
1
m ∑
i
x(1+β)
i

�2

=
1

2m2 ∑
i,j
(x1+β
i
− x1+β
j
)2

≤
1

2m2 ∑
i,j
(xiuβ − xjuβ)2

= u2β

2m2 ∑
i,j
(xi − xj)2

= u2β

⎛

⎝ 1

m ∑
i
x2
i −

�
1
m ∑
i
xi

�2⎞

⎠ .

Dividing by u2β the leftmost and rightmost terms and using the fact that var(λx) = λ2var(x) yields
the statement of the Lemma.

We are now ready to upper bound Mopt,λ,−α(A) as a function of Mopt,λ,α(A).

Lemma A4. For any cluster A of Copt,

Mopt,λ,−α(A)
≤
Mopt,λ,α(A) ×

�
f (λ)
if λ ∈ (0, 1)
z(α)h2(α)
otherwise
,

where z(α), f (λ) and h(α) are defined in Theorem 1.

Proof. The case λ ̸= 0, 1 is fast, as we have by definition:

Mopt,λ,−α(A)
=
∑
p∈A
λD−α(l−α,A : p) + (1 − λ)D−α(p : r−α,A)

=
∑
p∈A
λDα(p : l−α,A) + (1 − λ)Dα(r−α,A : p)

=
∑
p∈A
λDα(p : rα,A) + (1 − λ)Dα(lα,A : p)

≤
max
�1 − λ

λ
,
λ

1 − λ

�
Mopt,λ,α(A)

= f (λ)Mopt,λ,α(A) .

Suppose now that λ
=
0 and α
≥
0.
Because Mopt,0,−α(A)
=
∑p∈A D−α(p : r−α,A)
=
∑p∈A Dα(lα,A : p) = Mopt,1,α(A), what we wish to do is upper bound ∑p∈A Dα(lα,A : p) = Mopt,1,α(A)
as a function of ∑p∈A Dα(p : rα,A) = Mopt,0,α(A). We use Lemmatas A2 and A3 in the following
derivations, using r(p) to refer to the r in Lemma A2, assuming α ≥ 0. We also note varA( f (p)) as

186


Entropy 2014, 16, 3273–3301

the variance, under the uniform distribution over A, of discrete random variable f (p), for p ∈ A. We
have:

∑
p∈A
Dα(lα,A : p)

=
∑
p∈A
D−α(p : lα,A)

=
1

(1 + α)2 ∑
p∈A
r(p)−αD0(p1+α : l1+α
α,A )

≤
1

(1 + α)2 minA pα ∑
p∈A
D0(p1+α : l1+α
α,A )

=
1

(1 + α)2 minA pα ∑
p∈A

�
p1+α + l1+α
α,A − 2p
1+α

2 l

1+α

2
α,A

�

=
|A|

(1 + α)2 minA pα

⎛

⎝ 1

|A| ∑
p∈A
p1+α −

�
1
|A| ∑
p∈A
p
1+α

2

�2⎞

⎠

=
|A|varA(p
1+α

2 )

(1 + α)2 minA pα .
(A4)

We have used the expression of left centroid l1+α
α,A to simplify the expressions. Now, picking xi = p

1−α

i 2
,

β = 2α/(1 − α) and u =
�
1+α
1−α
� 2α

1−α maxA p
1−α

2
in Lemma A3 yields:

varA(p
1+α

2 )

=
u2βvarA(p
1+α

2 /uβ)

=
u2βvarA
�
p
1−α

2 pα/uβ�

=
u2βvar(x1+β/uβ)

≤
u2βvar(x)

= u2βvarA
�
p
1−α

2
�

=
�1 + α

1 − α

�
8α2

(1−α)2
max
A
p2αvarA
�
p
1−α

2
�
.
(A5)

187


Entropy 2014, 16, 3273–3301

Plugging this in (A4) yields:

∑
p∈A
Dα(lα,A : p)

≤
�1 + α

1 − α

�
8α2

(1−α)2 |A| maxA p2αvarA
�
p
1−α

2
�

(1 + α)2 minA pα

=
�1 + α

1 − α

�
8α2

(1−α)2 −2 �maxA p

minA p

�2α
× |A| minA pαvarA(p
1−α

2 )

(1 − α)2

=
�1 + α

1 − α

�
8α2

(1−α)2 −2 �maxA p

minA p

�2α
× minA pα

(1 − α)2 ∑
p∈A
D0(p1−α : r1−α
α,A )
(A6)

≤
�1 + α

1 − α

�
8α2

(1−α)2 −2 �maxA p

minA p

�2α
×
1

(1 − α)2 ∑
p∈A
r(p)αD0(p1−α : r1−α
α,A )

=
�1 + α

1 − α

�
8α2

(1−α)2 −2 �maxA p

minA p

�2α
× ∑
p∈A
Dα(p : rα,A)

≤
z(α)
�maxA p

minA p

�2α
× ∑
p∈A
Dα(p : rα,A) .
(A7)

Here, (A6) follows the path backwards of derivations that lead to (A4). The cases λ = 1 or α < 0 are
obtained using the same chains of derivations and achieve the proof of Lemma A4.

Lemma A4 can be directly used to refine the bound of Lemma A1 in the uniform distribution. We
give the Lemma for the biased distribution, directly integrating the refinement of the bound.

Lemma A5. Let A be an arbitrary cluster of Copt and C an arbitrary clustering. If we add a random couple
(c, c) to C, chosen from A with π as in Algorithm 2, then:

Ec∼πA[Mλ,α(A, c)]

≤
4

�
f (λ)h2(α)Mopt,λ,α(A)
if
λ ∈ (0, 1)
z(α)h4(α)Mopt,λ,α(A)
otherwise
,
(A8)

where f (λ) and h(α) are defined in Theorem 1.

Proof. The proof essentially follows the proof of Lemma 3 in [15]. To complete it, we need a triangle
inequality involving α-divergences. We give it here.

Lemma A6. For any p, q, r and α, we have:

�

Dα(p : q)
≤
�maxi{pi, qi, ri}

mini{pi, qi, ri}

�|α| ��

Dα(p : r) +
�

Dα(r : q)
�
(A9)

(where the min is over strictly positive values)

Remark: take α = 0; we find the triangle inequality for the squared Hellinger distance.

Proof. Using the proof of Lemma 2 in [15] for Bregman divergence Dϕα, we get:

�

Dϕα(x : z)

≤
ρ(α)
��

Dϕα(x : y) +
�

Dϕα(y : z)
�
,
(A10)

188


Entropy 2014, 16, 3273–3301

where:

ρ(α)
=
max
u,v

�
1 + 1−α

2 u
� 2α

1−α

�
1 + 1−α

2 v
� 2α

1−α
.
(A11)

Taking x = uα(p), y = uα(q), z = uα(r) yields ρ(α) = maxs,t∈{pi,qi,ri}(s/t)|α| and the statement of
Lemma A6.

The rest of the proof of Lemma A5 follows the proof of Lemma 3 in [15].

We get all of the ingredients to our proof, and there remains to use Lemma 4 in [15] to achieve the
proof of Theorem 1.

Appendix Properties of α-Divergences

For positive arrays p and q, the α-divergence Dα(p : q) can be defined as an equivalent
representational Bregman divergence [19,34] Bϕα(uα(p) : uα(q)) over the (uα, vα)-structure [43] with:

ϕα(x)
.=
2

1 + α

�
1 + 1 − α

2
x
�
2

1−α
,
(A12)

uα(p)
.=
2

1 − α

�
p
1−α

2 − 1
�
,
(A13)

vα(p)
.=
2

1 + α p
1+α

2
,
(A14)

where we assume that α ̸= ±1. Otherwise, for α = ±1, we compute Dα(p : q) by taking the sided
Kullback–Leibler divergence extended to positive arrays.
In the proof of Theorem 1, we have used two properties of α-divergences of independent interest:

• any α-divergence can be explained as a scaled squared Hellinger distance between geometric
means of its arguments and a point that belong to their segment (Lemma A2);
• any α-divergence satisfies a generalized triangle inequality (Lemma A6). Notice that this Lemma
is optimal in the sense that for α = 0, it is possible to recover the triangle inequality of the
Hellinger distance.

The following lemma shows how to bound the mixed divergence as a function of an α-divergence.

Lemma A7. For any positive arrays l, h, r and α ̸= ±1, define η
.= λ(1 − α)/(1 − α(2λ − 1)) ∈ [0, 1], gη
with gi
η
.= (li)η(ri)1−η and aη with ai
η
.= ηli + (1 − η)ri. Then, we have:

Mλ,α(l : h : r)
≤
1 − α2(2λ − 1)2

1 − α2
Dα(2λ−1)(gη : h)

+2(1 − α(2λ − 1))

1 − α2
∑
i

�
ai
η − gi
η
�
.

Proof. For all index i, we have:

Mλ,α(li : hi : ri) = λDα(li : hi) + (1 − λ)Dα(hi : ri)

=
4

1 − α2

�λ(1 − α)

2
li + (1 − λ)(1 + α)

2
ri + 1 + α(2λ − 1)

2
hi
(A15)

−λ(li)
1−α

2 (hi)
1+α

2 − (1 − λ)(ri)
1+α

2 (hi)
1−α

2
�
.
(A16)

189


Entropy 2014, 16, 3273–3301

The arithmetic-geometric-harmonic (AGH) inequality implies:

λ(li)
1−α

2 (hi)
1+α

2 + (1 − λ)(ri)
1+α

2 (hi)
1−α

2
≥
(li)
λ(1−α)

2
(ri)
(1−λ)(1+α)

2
(hi)
1+α(2λ−1)

2

=
�
(li)

λ(1−α)

1−α(2λ−1) (ri)

(1−λ)(1+α)
1−α(2λ−1)
� 1−α(2λ−1)

2
(hi)
1+α(2λ−1)

2

=
�
(li)η(ri)1−η� 1−α(2λ−1)

2
(hi)
1+α(2λ−1)

2

= (gi
η)
1−α(2λ−1)

2
(hi)
1+α(2λ−1)

2
.

It follows that (A16) yields:

Mλ,α(li : hi : ri)
≤
4

1 − α2

�1 − α(2λ − 1)

2

�
ηli + (1 − η)ri�
+
(A17)

1 + α(2λ − 1)

2
hi − (gi
η)
1−α(2λ−1)

2
(hi)
1+α(2λ−1)

2
�

= 1 − α2(2λ − 1)2

1 − α2
Dα(2λ−1)(gi
η : hi) + 2(1 − α(2λ − 1))

1 − α2

�
ai
η − gi
η
�
,(A18)

out of which we get the statement of the Lemma.

Appendix Sided α-Centroids

For the sake of completeness, we prove the following theorem:

Theorem A1 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted
α-centroid coordinates of a set of n positive histograms h1, ..., hn are weighted α-means:

ri
α = f −1
α

�
n
∑
j=1
wj fα(hi
j)

�

, li
α = ri
−α

with:

fα(x) =

�
x
1−α

2
α ̸= ±1,
log x
α = 1.

Proof. We distinguish three cases: α ̸= ±1, α = −1 and α = 1.
First, consider the general case α ̸= ±1. We have to minimize:

Rα(x, H) =
4

1 − α2

n
∑
j=1
wj×

d
∑
i=1

�1 − α

2
hi
j + 1 + α

2
xi − (hi
j)
1−α

2 (xi)
1+α

2
�
.

Removing all additive terms independent of xi and the overall constant multiplicative factor
4

1−α2 ̸= 0,
we get the following equivalent minimisation problem:

R′
α(x, H) =
d
∑
i=1

1 + α

2
xi − (xi)
1+α

2

�
n
∑
j=1
wj(hi
j)
1−α

2

�

�
��
�
¯hiα

,
(A19)

190


Entropy 2014, 16, 3273–3301

where ¯hi
α denote the following aggregation term:

¯hi
α =
n
∑
j=1
wj(hi
j)
1−α

2 .

Setting coordinate-wise the derivative to zero of Equation (A19) (i.e., ∇xR′(x, H) = 0), we get:

1 + α

2
− 1 + α

2
(xi)
α−1

2 ¯hi
α = 0

Thus, we find that the coordinates of the right-sided α-centroids are:

ci
α = (¯hi
α)
2

1−α =

�
n
∑
j=1
wj(hi
j)
1−α

2

�
2

1−α
= ˆhi
α.

We recognise the expression of a quasi-arithmetic mean for the strictly monotonous generator fα(x):

ri
α = f −1
α

�
n
∑
j=1
wj fα(hi
j)

�

,
(A20)

with:
fα(x) = x
1−α

2 ,
f −1
α (x) = x
2

1−α , α ̸= ±1.

Therefore, we conclude that the coordinates of the positive α-centroid are the weighted α-means of
the histogram coordinates (for α ̸= ±1). Quasi-arithmetic means are also called in the literature
quasi-linear means or f-means.
When α = −1, we search for the right-sided extended Kullback–Leibler divergence centroid by
minimising:

R−1(x; ˜H) =
n
∑
j=1
wj

d
∑
i=1
hi
j log
hi
j

xi + xi − hi
j.

It is equivalent to minimizing:

R′
−1(x; ˜H) =
d
∑
i=1
xi −

�
n
∑
j=1
wjhi
j

�

�
��
�
a

log xi,

where a denotes the arithmetic mean. Solving coordinate-wise, we get ci = ai = ∑n
j=1 wjhi
j.
When α = 1, the right-sided reverse extended KL centroid is a left-sided extended KL centroid. The
minimisation problem is:

R1(x; ˜H) =
n
∑
j=1
wj

d
∑
i=1
xi log xi

hi
j
+ hi
j − xi.

Since ∑j wj = 1, we solve coordinate-wise and find log x = ∑j wj log hj. That is, ri
1 is the geometric
mean:

ri
1 =
n
∏
j=1
(hi
j)wj.

Both the arithmetic mean and the geometric mean are power means in the limit case (and hence
quasi-arithmetic means). Thus,

ri
α = f −1
α

�
n
∑
j=1
wj fα(hi
j)

�

,
(A21)

191


Entropy 2014, 16, 3273–3301

with:

fα(x) =

�
x
1−α

2
α ̸= ±1,
log x
α = 1.

References

1.
Baker, L.D.; McCallum, A.K.Distributional clustering of words for text classification. In Proceedings of the
21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,
Melbourne, Australia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 96–103.
2.
Bigi, B. Using Kullback–Leibler distance for text categorization.
In Proceedings of the 25th European
conference on IR research (ECIR), Pisa, Italy, 14–16 April 2003; Springer-Verlag: Berlin/Heidelberg, Germany,
2003; ECIR’03, pp. 305–319.
3.
Bag of Words Data Set. Available online: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words (accessed
on 17 June 2014).
4.
Csurka, G.; Bray, C.; Dance, C.; Fan, L. Visual Categorization with Bags of Keypoints; Workshop on Statistical
Learning in Computer Vision (ECCV); Xerox Research Centre Europe: Meylan, France, 2004, pp. 1–22.
5.
Jégou, H.; Douze, M.; Schmid, C. Improving Bag-of-Features for Large Scale Image Search. Int. J. Comput.
Vis. 2010, 87, 316–336.
6.
Yu, Z.; Li, A.; Au, O.; Xu, C. Bag of textons for image segmentation via soft clustering and convex shift. In
Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI,
USA, 16–21 June 2012; pp. 781–788.
7.
Steinhaus, H. Sur la division des corp matériels en parties. Bull. Acad. Polon. Sci. 1956, 1, 801–804. (in
French)
8.
Lloyd, S.P. Least Squares Quantization in PCM; Technical Report RR-5497; Bell Laboratories: Murray Hill, NJ,
USA, 1957.
9.
Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137.
10.
Chandrasekhar, V.; Takacs, G.; Chen, D.M.; Tsai, S.S.; Reznik, Y.A.; Grzeszczuk, R.; Girod, B. Compressed
histogram of gradients: A low-bitrate descriptor. Int. J. Comput. Vis. 2012, 96, 384–399.
11.
Nock, R.; Nielsen, F.; Briys, E. Non-linear book manifolds: Learning from associations the dynamic geometry
of digital libraries. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, New
York, NY, USA, 2013; pp. 313–322.
12.
Kwitt, R.; Vasconcelos, N.; Rasiwasia, N.; Uhl, A.; Davis, B.C.; Häfner, M.; Wrba, F. Endoscopic image
analysis in semantic space. Med. Image Anal. 2012, 16, 1415–1422.
13.
Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. 2010, arXiv:1009.4004.
14.
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904.
15.
Nock, R.; Luosto, P.; Kivinen, J. Mixed Bregman clustering with approximation guarantees. In Proceedings of
the European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium,
15–19 September 2008; Springer-Verlag: Berlin/Heidelberg, Germany, 2008; pp. 154–169.
16.
Amari, S. Integration of Stochastic Models by Minimizing α-Divergence. Neural Comput. 2007, 19, 2780–2796.
17.
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth
Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), New Orleans, LA, USA, 7–9 January 2007;
Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035.
18.
Olszewski, D.; Ster, B. Asymmetric clustering using the alpha-beta divergence. Pattern Recognit. 2014,
47, 2031–2041.
19.
Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes.
IEEE Trans. Inf. Theory 2009, 55, 4925–4931.
20.
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res.
2005, 6, 1705–1749.
21.
Teboulle, M. A unified continuous optimization framework for center-based clustering methods. J. Mach.
Learn. Res. 2007, 8, 65–102.
22.
Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000.
23.
Morimoto, T. Markov Processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331.

192


Entropy 2014, 16, 3273–3301

24.
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat.
Soc. Ser. B 1966, 28, 131–142.
25.
Csiszár, I. Information-type measures of difference of probability distributions and indirect observation.
Studi. Sci. Math. Hung. 1967, 2, 229–318.
26.
Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust
nonnegative matrix factorization. Entropy 2011, 13, 134–170.
27.
Zhu, H.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of
Neural Networks; Operations Research/Computer Science Interfaces Series; Ellacott, S., Mason, J., Anderson,
I., Eds.; Springer: New York, NY, USA, 1997; Volume 8, pp. 394–398.
28.
Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.
Ann. Math. Stat. 1952, 23, 493–507.
29.
Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett.
2013, 20, 269–272.
30.
Wu, J.; Rehg, J. Beyond the euclidean distance: creating effective visual codebooks using the histogram
intersection kernel. In Proceedings of 2009 IEEE 12th International Conference on Computer Vision, Kyoto,
Japan, 29 September–2 October 2009; pp. 630–637.
31.
Bhattacharya, A.; Jaiswal, R.; Ailon, N. A tight lower bound instance for k-means++ in constant dimension.
In Theory and Applications of Models of Computation; Lecture Notes in Computer Science; Gopal, T., Agrawal,
M., Li, A., Cooper, S., Eds.; Springer International Publishing: New York, NY, USA, 2014; Volume 8402, pp.
7–22.
32.
Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight
approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660.
33.
Ben-Tal, A.; Charnes, A.; Teboulle, M. Entropic means. J. Math. Anal. Appl. 1989, 139, 537–551.
34.
Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In
Proceedings of International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June
2009; pp. 71–78.
35.
Heinz, E. Beiträge zur Störungstheorie der Spektralzerlegung. Math. Anna. 1951, 123, 415–438. (in German)
36.
Besenyei, A. On the invariance equation for Heinz means. Math. Inequal. Appl. 2012, 15, 973–979.
37.
Barry, D.A.; Culligan-Hensley, P.J.; Barry, S.J. Real values of the W-function. ACM Trans. Math. Softw. 1995,
21, 161–171.
38.
Veldhuis, R.N.J. The centroid of the symmetrical Kullback–Leibler distance. IEEE Signal Process. Lett. 2002,
9, 96–99.
39.
Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. 2009, arXiv.org: 0911.4863.
40.
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466.
41.
Romberg, S.; Lienhart, R. Bundle min-hashing for logo recognition.
In Proceedings of the 3rd ACM
Conference on International Conference on Multimedia Retrieval, Dallas, TX, USA, 16–19 April 2013; ACM:
New York, NY, USA, 2013; pp. 113–120.
42.
Matsuyama, Y. The alpha-EM algorithm: Surrogate likelihood maximization using alpha-logarithmic
information measures. IEEE Trans. Inf. Theory 2003, 49, 692–706.
43.
Amari, S.I. New developments of information geometry (26): Information geometry of convex programming
and game theory. In Mathematical Sciences (suurikagaku); Number 605; The Science Company: Denver, CO,
USA, 2013; pp. 65–74. (In Japanese)

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

193


entropy

Article
New Riemannian Priors on the Univariate
Normal Model

Salem Said *, Lionel Bombrun and Yannick Berthoumieu

Groupe Signal et Image, CNRS Laboratoire IMS, Institut Polytechnique de Bordeaux, Université de Bordeaux,
UMR 5218, Talence, 33405, France; E-Mails: lionel.bombrun@u-bordeaux.fr (L.B.);
yannick.berthoumieu@u-bordeaux.fr (Y.B.)
*
E-Mail: salem.said@u-bordeaux.fr; Tel.:+33-(0)5-4000-6185.

Received: 17 April 2014; in revised form: 23 June 2014 / Accepted: 9 July 2014 /
Published: 17 July 2014

Abstract: The current paper introduces new prior distributions on the univariate normal model, with
the aim of applying them to the classification of univariate normal populations. These new prior
distributions are entirely based on the Riemannian geometry of the univariate normal model, so that
they can be thought of as “Riemannian priors”. Precisely, if {pθ; θ ∈ Θ} is any parametrization of the
univariate normal model, the paper considers prior distributions G( ¯θ, γ) with hyperparameters ¯θ ∈ Θ
and γ > 0, whose density with respect to Riemannian volume is proportional to exp(−d2(θ, ¯θ)/2γ2),
where d2(θ, ¯θ) is the square of Rao’s Riemannian distance. The distributions G( ¯θ, γ) are termed
Gaussian distributions on the univariate normal model. The motivation for considering a distribution
G( ¯θ, γ) is that this distribution gives a geometric representation of a class or cluster of univariate
normal populations. Indeed, G( ¯θ, γ) has a unique mode ¯θ (precisely, ¯θ is the unique Riemannian
center of mass of G( ¯θ, γ), as shown in the paper), and its dispersion away from ¯θ is given by γ.
Therefore, one thinks of members of the class represented by G( ¯θ, γ) as being centered around ¯θ
and lying within a typical distance determined by γ. The paper defines rigorously the Gaussian
distributions G( ¯θ, γ) and describes an algorithm for computing maximum likelihood estimates of
their hyperparameters. Based on this algorithm and on the Laplace approximation, it describes
how the distributions G( ¯θ, γ) can be used as prior distributions for Bayesian classification of
large univariate normal populations.
In a concrete application to texture image classification,
it is shown that this leads to an improvement in performance over the use of conjugate priors.

Keywords: Fisher information; Riemannian metric; prior distribution; univariate normal distribution;
image classification

1. Introduction

In this paper, a new class of prior distributions is introduced on the univariate normal model. The
new prior distributions, which will be called Gaussian distributions, are based on the Riemannian
geometry of the univariate normal model. The paper introduces these new distributions, uncovers
some of their fundamental properties and applies them to the problem of the classification of univariate
normal populations. It shows that, in the context of a real-life application to texture image classification,
the use of these new prior distributions leads to improved performance in comparison with the use of
more standard conjugate priors.
To motivate the introduction of the new prior distributions, considered in the following, recall
some general facts on the Riemannian geometry of parametric models.
In information geometry [1], it is well known that a parametric model {pθ; θ ∈ Θ}, where Θ ⊂ Rp,
can be equipped with a Riemannian geometry, determined by Fisher’s information matrix, say I(θ).

Entropy 2014, 16, 4015–4031; doi:10.3390/e16074015
www.mdpi.com/journal/entropy
194


Entropy 2014, 16, 4015–4031

Indeed, assuming I(θ) is strictly positive definite, for each θ ∈ Θ, a Riemannian metric on Θ is defined
by:

ds2(θ) =

p
∑
i,j=1
Iij(θ)dθidθj
(1)

The fact that the length element Equation (1) is invariant to any change of parametrization was realized
by Rao [2], who was the first to propose the application of Riemannian geometry in statistics.
Once the Riemannian metric Equation (1) is introduced, the whole machinery of Riemannian
geometry becomes available for application to statistical problems relevant to the parametric model
{pθ; θ ∈ Θ}. This includes the notion of Riemannian distance between two distributions, pθ and pθ′,
which is known as Rao’s distance, say d(θ, θ′), the notion of Riemannian volume, which is exactly the
same as Jeffreys prior [3], and the notion of Riemannian gradient, which can be used in numerical
optimization and coincides with the so-called natural gradient of Amari [4].
It is quite natural to apply Rao’s distance to the problem of classifying populations that belong to
the parametric model {pθ; θ ∈ Θ}. In the case where this parametric model is the univariate normal
model, this approach to classification is implemented in [5]. For more general parametric models,
beyond the univariate normal model, similar applications of Rao’s distance to problems of image
segmentation and statistical tests can be found in [6–8].
The idea of [5] is quite elegant. In general, it requires that some classes {SL; L = 1, . . . , C}, (based
on a learning sequence) have been identified with “centers” ¯θL ∈ Θ. Then, in order to assign a test
population, given by the parameter θt, to a class L∗, it is proposed to choose L∗, which minimizes Rao’s
distance d2(θt, ¯θL), over L = 1, . . . , C. In the specific context of the classification of univariate normal
populations [5], this leads to the introduction of hyperbolic Voronoi diagrams.
The present paper is also concerned with the case where the parametric model {pθ; θ ∈ Θ} is a
univariate normal model. It starts from the idea that a class SL should be identified not only with
a center ¯θL, as in [5], but also with a kind of “variance”, say γ2, which will be called a dispersion
parameter. Accordingly, assigning a test population given by the parameter θt to a class L should be
based on a tradeoff between the square of Rao’s distance d2(θt, ¯θL) and the dispersion parameter γ2.
Of course, this idea has a strong Bayesian flavor. It proposes to give more “confidence” to classes
that have a smaller dispersion parameter. Thus, in order to implement it, in a concrete way, the paper
starts by introducing prior distributions on the univariate normal model, which it calls Gaussian
distributions. By definition, a Gaussian distribution G( ¯θ, γ2) has a probability density function, with
respect to Riemannian volume, given by:

p(θ| ¯θ, γ) ∝ exp
�−d2(θ, ¯θ)

2γ2

�
(2)

Given this definition of a Gaussian distribution (which is developed in a detailed way, in Section 3),
classification of univariate normal populations can be carried out by associating to each class SL of
univariate normal populations a Gaussian distribution G( ¯θL, γ2
L) and by assigning any test population
with parameter θt to the class L∗, which maximizes the likelihood p(θt| ¯θL, γL), over L = 1, . . . , C.
The present paper develops in a rigorous way the general approach to the classification of
univariate normal populations, which has just been described. It proceeds as follows.
Section 2, which is basically self-contained, provides the concepts, regarding the Riemannian
geometry of the univariate normal model, which will be used throughout the paper.
Section 3 introduces Gaussian distributions on the univariate normal model and uncovers some
of their general properties. In particular, Section 3.2 of this section gives a Riemannian gradient descent
algorithm for computing maximum likelihood estimates of the parameters ¯θ and γ of a Gaussian
distribution.
Section 4 states the general approach to classification of univariate normal populations proposed
in this paper. It deals with two problems: (i) given a class S of univariate normal populations Si, how

195


Entropy 2014, 16, 4015–4031

to fit a Gaussian distribution G(¯z, γ) to this class; and (ii) given a test univariate normal population St
and a set of classes {SL, L = 1, . . . , C}, how to assign St to a suitable class SL∗.
In the present paper, the chosen approach for resolving these two problems is marginalized
likelihood estimation, in the asymptotic framework where each univariate normal population contains
a large number of data points. In this asymptotic framework, the Laplace approximation plays a
major role [9]. In particular, it reduces the first problem, of fitting a Gaussian distribution to a class
of univariate normal populations, to the problem of maximum likelihood estimation, covered in
Section 3.2.
The final result of Section 4 is the decision rule Equation (37). This generalizes the one developed
in [5] and already explained above, by taking into account the dispersion parameter γ, in addition to
the center ¯θ, for each class.
In Section 5, the formalism of Section 4 is applied to texture image classification, using the
VisTeX image database [10]. This database is used to compare the performance obtained using
Gaussian distributions, as in Section 4, to that obtained using conjugate prior distributions. It is
shown that Gaussian distributions, proposed in the current paper, lead to a significant improvement
in performance.
Before going on, it should be noted that probability density functions of the form (2), on general
Riemannian manifolds, were considered by Pennec in [11]. However, they were not specifically used
as prior distributions, but rather as a representation of uncertainty in medical image analysis and
directional or shape statistics.

2. Riemannian Geometry of the Univariate Normal Model

The current section presents in a self-contained way the results on the Riemannian geometry of
the univariate normal model, which are required for the remainder of the paper. Section 2.1 recalls
the fact that the univariate normal model can be reparametrized, so that its Riemannian geometry is
essentially the same as that of the Poincaré upper half plane. Section 2.2 uses this fact to give analytic
formulas for distance, geodesics and integration on the univariate normal model. Finally, Section 2.3
presents, in general form, the Riemannian gradient descent algorithm.

2.1. Derivation of the Fisher Metric

This paper considers the Riemannian geometry of the univariate normal model, as based on the
Fisher metric (1). To be precise, the univariate normal model has a two-dimensional parameter space
Θ = {θ = (μ, σ)|μ ∈ R , σ > 0}, and is given by:

pθ(x) = |2πσ2|−1/2 exp
� −(x − μ)2

2σ2

�
(3)

where each pθ is a probability density function with respect to the Lebesgue measure on R. The Fisher
information matrix, obtained from Equation (3), is the following:

I(θ) =

�
1
σ2
0
0
2
σ2

�

As in [12], this expression can be made more symmetric by introducing the parametrization z = (x, y),
where x = μ/
√

2 and y = σ. This yields the Fisher information matrix:

I(z) = 2 ×

�
1
y2
0

0
1
y2

�

196


Entropy 2014, 16, 4015–4031

It is suitable to drop the factor two in this expression and introduce the following Riemannian metric
for the univariate normal model,

ds2(z) = dx2 + dy2

y2
(4)

This is essentially the same as the Fisher metric (up to the factor tow) and will be considered throughout
the following. The resulting Rao’s distance and Riemannian geometry are given in the following
paragraph.

2.2. Distance, Geodesics and Volume

The Riemannian metric (4), obtained in the last paragraph, happens to be a very well-known
object in differential geometry. Precisely, the parameter space H = {z = (x, y)|y > 0} equipped with
the metric (4) is known as the Poincaré upper half plane and is a basic model of a two-dimensional
hyperbolic space [13].
Rao’s distance between two points z1 = (x1, y1) and z2 = (x2, y2) in H can be expressed as follows
(for results in the present paragraph, see [13], or any suitable reference on hyperbolic geometry),

d(z1, z2) = acosh
�
1 + (x1 − x2)2 + (y1 − y2)2

2y1y2

�
(5)

where acosh denotes the inverse hyperbolic cosine.
Starting from z1, in any given direction, it is possible to draw a unique geodesic ray γ : R+ → H.
This is a curve having the property that γ(0) = z1 and, for any t ∈ R+, if γ(t) = z2 then d(z1, z2) = t.
In other words, the length of γ between z1 and z2 is equal to the distance between z1 and z2.
The equation of a geodesic ray starting from z ∈ H is conveniently written down in complex
notation (that is, by treating points of H as complex numbers). To begin, consider the case of z = i
(which stands for x = 0 and y = 1). The geodesic in the direction making an angle ψ with the y-axis is
the curve,

γi(t) = et/2 cos(ψ/2) i − e−t/2 sin(ψ/2)

et/2 sin(ψ/2) i + e−t/2 cos(ψ/2)
(6)

In particular ψ = 0 gives γi(t) = eti and ψ = π gives γi(t) = e−ti. If ψ is not a multiple of π, γi(t)
traces out a portion of a circle, which is parallel to the y-axis, in the limit t → ∞. For a general starting
point z, the geodesic ray in the direction making an angle ψ with the y-axis can be written:

γz(t, ψ) = x + yγi(t/y, ψ)
(7)

where z = (x, y) and γi(t, ψ) is given by Equation (6). A more detailed treatment of Rao’s distance (5)
and of geodesics in the Poincaré upper half plane, along with applications in image clustering, can be
found in [5].
The Riemannian volume (or area, since H is of dimension 2) element corresponding to the
Riemannian metric (4) is dA(z) = dxdy/y2. Accordingly, the integral of a function f : H → R with
respect to dA is given by:
�

H f (z)dA(z) =
� +∞

0

� +∞

−∞
f (x, y)

y2
dxdy
(8)

In many cases, the analytic computation of this integral can be greatly simplified by using polar
coordinates (r, φ) defined with respect to some “origin” ¯z ∈ H. Polar coordinates (r, ϕ) map to the
point z(r, ϕ) given by:

z(r, ϕ) = γ¯z
�
r, π

2 − ϕ
�
(9)

197


Entropy 2014, 16, 4015–4031

where the right-hand side is defined according to Equation (7). The polar coordinates (r, ϕ) do indeed
define a global coordinate system of H, in the sense that the application that takes a complex number
reiϕ to the point z(r, ϕ) in H is a diffeomorphism. The standard notation from differential geometry is:

exp¯z
�
reiϕ�
= z(r, ϕ)
(10)

In these coordinates, the Riemannian metric (4) takes on the form:

ds2(z) = dr2 + sinh2 rdϕ2
(11)

The integral Equation (8) can be computed in polar coordinates using the formula [13],

�

H f (z)dA(z) =
� 2π

0

� +∞

0
( f ◦ exp¯z)
�
reiϕ�
sinh(r)drdϕ
(12)

where exp¯z was defined in Equation (10) and ◦ denotes composition. This is particularly useful when
f ◦ exp¯z does not depend on ϕ.

2.3. Riemannian Gradient Descent

In this paper, the problem of minimizing, or maximizing, a differentiable function f : H → R will
play a central role. A popular way of handling the minimization of a differentiable function defined on
a Riemannian manifold (such as H) is through Riemannian gradient descent [14].
Here, the definition of Riemannian gradient is reviewed, and a generic description of Riemannian
gradient descent is provided. The Riemannian gradient of f is here defined as a mapping ∇ f : H → C
with the following property:

1
y2 × Re {∇ f (z) h∗} = Re {d f (z) h∗}
(13)

for any complex number h, where Re denotes the real part, ∗ denotes conjugation and d f is the
“derivative”, d f = (∂ f /∂x) + (∂ f /∂y) i. For example, if f (z) = y, it follows from Equation (13) that
∇ f (z) = y2.
Riemannian gradient descent consists in following the direction of −∇ f at each step, with the
length of the step (in other words, the step size) being determined by the user. The generic algorithm
is, up to some variations, the following:

INPUT
ˆz ∈ H
% Initial guess

WHILE
∥∇ f (ˆz)∥ > ε
% ε ≈ 0 machine precision

ˆz ← expˆz (−λ∇ f (ˆz))
% λ > 0 step size, depends on ˆz

END WHILE

OUTPUT
ˆz
% near critical point of f

Here, in the condition for the while loop, ∥∇ f (zk)∥ is the Riemannian norm of the gradient
∇ f (zk). In other words,

∥∇ f (zk)∥2 = 1

y2
k
× Re {∇ f (zk) ∇ f (zk)∗}

Just like a classical gradient descent algorithm, the above Riemannian gradient descent consists in
following the direction of the negative gradient −∇ f (ˆz), in order to define a new estimate. This is
repeated as long as the gradient is sensibly nonzero, in the sense of the loop condition.
The generic algorithm described above has no guarantee of convergence. Convergence and
behavior near limit points depends on the function f, on the initialization of the algorithm and on the
step sizes λ. For these aspects, the reader may consult [14](Chapter 4).

198


Entropy 2014, 16, 4015–4031

3. Riemannian Prior on the Univariate Normal Model

The current section introduces new prior distributions on the univariate normal model. These may
be referred to as “Riemannian priors”, since they are entirely based on the Riemannian geometry of
this model, and will also be called “Gaussian distributions”, when viewed as probability distributions
on the Poincaré half plane.
Here, Section 3.1 defines in a rigorous way Gaussian distributions on H (based on the intuitive
Formula (2)). A Gaussian distribution G(¯z, γ) has two parameters, ¯z ∈ H, called the center of mass, and
γ > 0, called the dispersion parameter. Section 3.2 uses the Riemannian gradient descent algorithm
Section 2.3 to provide an algorithm for computing maximum likelihood estimates of ¯z and γ. Finally,
Section 3.3 proves that ¯z is the Riemannian center of mass or Karcher mean of the distribution G(¯z, γ),
(Historically, it is more correct to speak of the “Fréchet mean”, since this concept was proposed by
Fréchet in 1948 [15]), and that γ is uniquely related to mean square Rao’s distance from ¯z.
The reader may wish to note that the results of Section 3.3 are not used in the following, so this
paragraph may be skipped on a first reading.

3.1. Gaussian Distributions on H

A Gaussian distribution G(¯z, γ) on H is a probability distribution with the following probability
density function:

p(z|¯z, γ) =
1

Z(γ) exp
� −d2(z, ¯z)

2γ2

�
(14)

Here, ¯z ∈ H is called the center of mass and γ > 0 the dispersion parameter of the distribution G(¯z, γ).
The squared distance d2(z, ¯z) refers to Rao’s distance (5). The probability density function (14) is
understood with respect to the Riemannian volume element dA(z). In other words, the normalization
constant Z(γ) is given by:

Z(γ) =
�

H f (z)dA(z)
f (z) = exp
� −d2(z, ¯z)

2γ2

�

Using polar coordinates, as in Equation (12), it is possible to calculate this integral explicitly. To do so,
let (r, ϕ), whose origin is ¯z. Then, d2(z, ¯z) = r2 when z = z(r, ϕ), as in Equation (9). It follows that:

( f ◦ exp¯z) (r, ϕ) = exp
� −r2

2γ2

�
(15)

According to Equation (12), the integral Z(γ) reduces to:

Z(γ) =
� 2π

0

� +∞

0
exp
� −r2

2γ2

�
sinh(r)drdϕ

which is readily calculated,

Z(γ) = 2π ×
�

π
2 γ × e
γ2
2 × erf
� γ
√

2

�
(16)

where erf denotes the error function. Formula (16) completes the definition of the Gaussian distribution
G(¯z, γ). This definition is the same as suggested in [11], with the difference that, in the present work, it
has been possible to compute exactly the normalization constant Z(γ).
It is noteworthy that the normalization constant Z(γ) depends only on γ and not on ¯z. This shows
that the shape of the probability density function (14) does not depend on ¯z, which only plays the role
of a location parameter. At a deeper mathematical level, this reflects the fact that H is a homogeneous
Riemannian space [13].

199


Entropy 2014, 16, 4015–4031

The probability density function (14) bears a clear resemblance to the usual Gaussian (or normal)
probability density function. Indeed, both are proportional to the exponential minus the “square
distance”, but in one case, the distance is interpreted as Euclidean distance and, in the other (that of
Equation (14)) as Rao’s distance.

3.2. Maximum Likelihood Estimation of ¯z and γ

Consider the problem of computing maximum likelihood estimates of the parameters ¯z and γ of
the Gaussian distribution G(¯z, γ), based on independent samples {zi}N
i=1 from this distribution. Given
the expression (14) of the density p(z|¯z, γ), the log-likelihood function ℓ(¯z, γ) can be written,

ℓ(¯z, γ) = −N log{Z(γ)} −
1

2γ2

N
∑
i=1
d2(zi, ¯z)
(17)

Since ¯z only appears in the second term, the maximum likelihood estimate of ¯z, say ˆz, can be computed
first. It is given by the minimization problem:

ˆz = argminz∈H
1
2

N
∑
i=1
d2(zi, z)
(18)

In other words, the maximum likelihood estimate ˆz minimizes the sum of squared Rao distances to the
samples zi. This exhibits ˆz as the Riemannian center of mass, also called the Karcher or the Fréchet
mean [16], of the samples zi.
The notion of Riemannian center of mass is currently a widely popular one in signal and image
processing, with applications ranging from blind source separation and radar signal processing [17,18]
to shape and motion analysis [19,20]. The definition of Gaussian distributions, proposed in the
present paper, shows how the notion of Riemannian center of mass is related to maximum likelihood
estimation, thereby giving it a statistical foundation.
An original result, due to Cartan and cited in Equation [16], states that ˆz, as defined in
Equation (18), exists and is unique, since H, with the Riemannian distance (4), has constant negative
curvature. Here, ˆz is computed using Riemannian gradient descent, as described in Section 2.3. The
cost function f to be minimized is given by (the factor N−1 is conventional),

f (z) =
1
2N

N
∑
i=1
d2(zi, z)
(19)

Its Riemannian gradient ∇ f (z) is easily found by noting the following fact. Let fi(z) = (1/2)d2(z, zi).
Then, the Riemannian gradient of this function is (see [21] (page 407)),

∇ fi(z) = logz(zi)
(20)

where logz : H → C is the inverse of expz : C → H. It follows from Equation (20) that,

∇ f (z) = 1

N

N
∑
i=1
logz(zi)
(21)

The analytic expression of logz, for any z ∈ H, will be given below (see Equation (23)).
Here, the gradient descent algorithm for computing ˆz is described. This algorithm uses a constant
step size λ, which is fixed manually.

200


Entropy 2014, 16, 4015–4031

Once the maximum likelihood estimate ˆz has been computed, using the gradient descent
algorithm, the maximum likelihood estimate of γ, say ˆγ, is found by solving the equation:

F(γ) = 1

N

N
∑
i=1
d2(zi, ˆz)
where F(γ) = γ3 × d

dγ log{Z(γ)}
(22)

The gradient descent algorithm for computing ˆz is the following,

INPUT
{z1, . . . , zN}
% N independent samples from G(¯z, γ)

ˆz ∈ H
% Initial guess

WHILE
∥∇ f (ˆz)∥ > ε
% ε ≈ 0 machine precision

ˆz ← expˆz (−λ∇ f (ˆz))
% ∇ f (ˆz) given by Equation (21)

% step size λ is constant

END WHILE

OUTPUT
ˆz
% near Riemannian center of mass

Application of Formula (21) requires computation of logˆz(zi) for i = 1, . . . , N. Fortunately, this
can be done analytically as follows. In general, for ˆz = ( ¯x, ¯y),

logˆz(z) = ¯y logi

�z − ¯x

¯y

�
(23)

where logi is found by inverting Equation (6). Precisely,

logi(z) = reiϕ
(24)

where, for z = (x, y) with x ̸= 0,

r = acosh
�
1 + x2 + (y − 1)2

2y

�

and:

cos(ϕ) =
x

y sinh(r)
sin(ϕ) = cosh(r) − y−1

sinh(r)

and, for z = (0, y),
logi(z) = ln(y)i

with ln denoting the natural logarithm.

3.3. Significance of ¯z and γ

The parameters ¯z and γ of a Gaussian distribution G(¯z, γ) have been called the center of mass
and the dispersion parameter. In the present paragraph, it is proven that,

¯z = argminz∈H
1
2

�

H d2(z′, z)p(z′|¯z, γ)dA(z′)
(25)

and also that:
F(γ) =
�

H d2(z′, ¯z)p(z′|¯z, γ)dA(z′)
(26)

where F(γ) was defined in Equation (22) and p(z′|¯z, γ) is the probability density function of G(¯z, γ),
given in Equation (14).

201


Entropy 2014, 16, 4015–4031

Note that Equations (25) and (26) are asymptotic versions of Equations (18) and (22). Indeed,
Equations (25) and (26) can be written:

¯z = argminz∈H
1
2E¯z,γd2(z′, z)
F(γ) = E¯z,γd2(z, ¯z)
(27)

where E¯z,γ denotes the expectation with respect to G(¯z, γ), and the expectation is carried out on the
variable z′ in the first formula. Now, these two formulae are the same as Equations (18) and (22), but
with expectation instead of empirical mean.
Note, moreover, that Equations (25) and (26) can be interpreted as follows. If z′ is distributed
according to the Gaussian distribution G(¯z, γ), then Equation (25) states that ¯z is the unique point, out
of all z ∈ H, which minimizes the expectation of squared Rao’s distance to z′. Moreover, Equation (26)
states that the expectation of squared Rao’s distance between ¯z and z′ is equal to F(γ), so F(γ) is the
least possible expected squared Rao’s distance between a point z ∈ H and z′. This interpretation
justifies calling ¯z the center of mass of G(¯z, γ) and shows that γ is uniquely related to the expected
dispersion, as measured by squared Rao’s distance, away from ¯z.
In order to prove Equation (25), consider the log-likelihood function,

ℓ(¯z, γ; z) = − log{Z(γ)} −
1

2γ2 d2(z, ¯z)
(28)

Let fz(¯z) = (1/2)d2(z, ¯z). The score function, with respect to ¯z is, by definition,

∇¯zℓ(¯z, γ; z) = ∇ fz(¯z)
(29)

where ∇¯z indicates the Riemannian gradient (defined in Equation (13) of Section 2.3) is with respect to
the variable ¯z. Under certain regularity conditions, which are here easily verified, the expectation of
the score function is identically zero,
E¯z,γ∇ fz(¯z) = 0
(30)

Let f (z) be defined by:

f (z) = E¯z,γ fz′(z) = 1

2E¯z,γd2(z′, z)

with the expectation carried out on the variable z′. Clearly, f (z) is the expression to be minimized
in Equation (25) (or in the first formula in Equation (27), which is just the same). By interchanging
Riemannian gradient and expectation,

∇ f (¯z) = E¯z,γ∇ fz(¯z) = 0

where the last equality follows from Equation (30).
It has just been proved that ¯z is a stationary point of f (a point where the gradient is zero).
Theorem 2.1 in [16] states the function f has one and only one stationary point, which is moreover a
global minimizer. This concludes the Proof (25).
The proof of Equation (26) follows exactly the same method, defining the score function with
respect to γ and noting that its expectation is identically zero.

4. Classification of Univariate Normal Populations

The previous section studied Gaussian distributions on H, “as they stand”, focusing on the
fundamental issue of maximum likelihood estimation of their parameters. The present Section
considers the use of Gaussian distributions as prior distributions on the univariate normal model.
The main motivation behind the introduction of Gaussian distributions is that a Gaussian
distribution G(¯z, γ) can be used to give a geometric representation of a cluster or class of univariate
normal populations. Recall that each point (x, y) ∈ H is identified with a univariate normal population

202


Entropy 2014, 16, 4015–4031

with mean μ =
√

2x and standard deviation σ = y. The idea is that populations belonging to the same
cluster, represented by G(¯z, γ), should be viewed as centered on ¯z and lying within a typical distance
determined by γ.
In the remainder of this Section, it is shown how the maximum likelihood estimation algorithm of
Section 3.2 can be used to fit the hyperparameters ¯z and γ to data, consisting in a class S = {Si; i =
1, . . . , K} of univariate normal populations. This is then applied to the problem of the classification
of univariate normal populations. The whole development is based on marginalized likelihood
estimation, as follows.
Assume each population Si contains Ni points, Si = {sj; j = 1, . . . , Ni}, and the points sj, in any
class, are drawn from a univariate normal distribution with mean μ and standard deviation σ. The
focus will be on the asymptotic case where the number Ni of points in each population Si is large.
In order to fit the hyperparameters ¯z and γ to the data S, assume moreover that the distribution
of z = (x, y), where (x, y) = (μ/
√

2, σ), is a Gaussian distribution G(¯z, γ). Then, the distribution of S
can be written in integral form:

p(S|¯z, γ) =
K
∏
i=1

�

H p(Si|z)p(z|¯z, γ)dA(z)
(31)

where p(z|¯z, γ) is the probability density of a Gaussian distribution G(¯z, γ), defined in Equation (14).
Moreover, expressing p(Si|z) as a product of univariate normal distributions p(sj|z), it follows,

p(S|¯z, γ) =
K
∏
i=1

�

H

Ni
∏
j=1
p(sj|z)p(z|¯z, γ)dA(z)
(32)

This expression, given the data S, is to be maximized over (¯z, γ). Using the Laplace approximation,
this task is reduced to the maximum likelihood estimation problem, addressed in Section 3.2.
The Laplace approximation will here be applied in its “basic form” [9]. That is, up to terms of
order N−1
i
. To do so, write each of the integrals in Equation (32), using Equation (8) of Section 2.2.
These integrals then take on the form:

� +∞

0

� +∞

−∞

Ni
∏
j=1
|2πy2|−1/2 exp

⎛

⎜
⎝
−
�
sj −
√

2 x
�2

2y2

⎞

⎟
⎠ × p(z|¯z, γ) × 1

y2 dxdy
(33)

where the univariate normal distribution p(sj|z) has been replaced by its full expression. Now, this
expression can be written p(sj|z) = exp [−Nih(x, y)], where:

h(x, y) = −1

2 ln
�
2πy2�
− B2
i + V2
i

2y2

Here, B2 and V2
i are the empirical bias and variance, within population Si,

Bi = ˆSi −
√

2 x
V2
i = N−1
i

Ni
∑
j=1
( ˆSi − sj)2

where ˆSi is the empirical mean of the population ˆSi = N−1
i
∑Ni
j=1 sj.
The expression h(x, y) is maximized when x = ˆxi and y = ˆyi, where ˆzi = ( ˆxi, ˆyi) is the couple of
maximum likelihood estimates of the parameters (x, y), based on the population Si.

203


Entropy 2014, 16, 4015–4031

According to the Laplace approximation, the integral Equation (33) is equal to:

2π
���∂2h( ˆxi, ˆyi)
���
−1/2
× exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) × 1

ˆy2
i
+ O(N−1
i
)

where ∂2h( ˆxi, ˆyi) is the matrix of second derivatives of h, and | · | denotes the determinant. Now, since
h is essentially the logarithm of p(sj|z), a direct calculation shows that ∂2h( ˆxi, ˆyi) is the same as the
Fisher information matrix derived in Section 2.1 (where it was denoted I(z)). Thus, the first factor in
the above expression is 2π ˆy2
i , and cancels out with the last factor.
Finally, the Laplace approximation of the integral Equation (33) reads:

2π × exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) + O(N−1
i
)

and the resulting approximation of the distribution of S, as given by Equation (32), can be written:

p(S|¯z, γ) ≈
K
∏
i=1
α × p(ˆzi|¯z, γ)
(34)

where α is a constant, which does not depend either on the data or on the parameters, and p(ˆzi|¯z, γ)
has the expression (14).
Accepting this expression to give the distribution of the data S, conditionally on the
hyperparameters (¯z, γ), the task of estimating these hyperparameters becomes the same as the
maximum likelihood estimation problem, described in Section 3.2.
In conclusion, if one assumes the populations Si belong to a single cluster or class S and wishes
to fit the hyperparameters ¯z and γ of a Gaussian distribution representing this cluster, it is enough to
start by computing the maximum likelihood estimates ˆxi and ˆyi for each population Si and then to
consider these as input to the maximum likelihood estimation algorithm described in Section 3.2.
The same reasoning just carried out, using the Laplace approximation, can be generalized to
the problem of classification of univariate normal populations. Indeed, assume that classes {SL, L =
1, . . . , C}, each containing some number KL of univariate normal populations, have been identified
based on some training sequence. Using the Laplace approximation and the maximum likelihood
estimation approach of Section 3.2, to each one of these classes, it is possible to fit hyperparameters
(¯zL, γL) of a Gaussian distribution G(¯zL, γL) on H.
For a test population St, the maximum likelihood rule, for deciding which of the classes SL this
test population St belongs to, requires finding the following maximum:

L∗ = argmaxLp(St|¯zL, γL)
(35)

and assigning the test population St to the class with label L∗. If the number of points Nt in the
population St is large, the Laplace approximation, in the same way used above, approximates the
maximum in Equation (35) by:
L∗ = argmaxLp(ˆzt|¯zL, γL)
(36)

where ˆzt = ( ˆxt, ˆyt) is the couple of maximum likelihood estimates computed based on the test
population St and where p(ˆzt|¯zL, γL) is given by Equation (14). Now, writing out Equation (14), the
decision rule becomes:

L∗ = argmaxL

�

− log {Z(γL)} −
1

2γ2
L
d2(ˆzt, ¯zL)

�

(37)

204


Entropy 2014, 16, 4015–4031

Under the homoscedasticity assumption, that all of the γL are equal, this decision rule essentially
becomes the same as the one proposed in [5], which requires St to be assigned to the “nearest” cluster,
in terms of Rao’s distance. Indeed, if all the γL are equal, then Equation (37) is the same as,

L∗ = argminLd2(ˆzt, ¯zL)
(38)

This decision rule is expected to be less efficient that the one proposed in Equation (37), which also takes
into account the uncertainty associated with each cluster, as measured by its dispersion parameter γL.

5. Application to Image Classification

In this section, the framework proposed in Section 4, for classification of univariate normal
populations, is applied to texture image classification using Gabor filters. Several authors have found
that Gabor energy features are well-suited texture descriptors. In the following, consider 24 Gabor
energy sub-bands that are the result of three scales and eight orientations. Hence, each texture image
can be decomposed as the collection of those 24 sub-bands. For more information concerning the
implementation, the interested reader is referred to [22].
Starting from the VisTeX database of 40 images [10] (these are displayed in Figure 1), each image
was divided into 16 non-overlapping subimages of 128 × 128 pixels each. A training sequence was
formed by choosing randomly eight subimages out of each image. To each subimage in the training
sequence, a bank of 24 Gabor filters was applied. The result of applying a Gabor filter with scale s
and orientation o to a subimage i belonging to an image L is a univariate normal population Si,s,o of
128 × 128 points (one point for each pixel, after the filter is applied).

Figure 1. Forty images of the VisTex database.

These populations Si,s,o (called sub-bands) are considered independent, each one of them
univariate normal with mean μi,s,o =
√

2xi,s,o, standard deviation σi,s,o = yi,s,o and with zi,s,o =
(xi,s,o, yi,s,o). The couple of maximum likelihood estimates for these parameters is denoted ˆzi,s,o =
( ˆxi,s,o, ˆyi,s,o). An image L (recall, there are 40 images) contains, in each sub-band, eight populations
Si,s,o, with which hyperparameters ¯zL,s,o and γL,s,o are associated, by applying the maximum likelihood
estimation algorithm of Section 3.2 to the inputs ˆzi,s,o.
If St is a test subimage, then one should begin by applying the 24 Gabor filters to it, obtaining
independent univariate normal populations St,s,o, and then compute for each population the couple
of maximum likelihood estimates ˆzt,s,o = ( ˆxt,s,o, ˆyt,s,o). The decision rule Equation (37) of Section 4
requires that St should be assigned to the image L∗, which realizes the maximum:

L∗ = argmaxL ∑
s,o
− log{Z(γL,s,o)} −
1

2γ2
L,s,o
d2(ˆzt,s,o, ¯zL,s,o)
(39)

205


Entropy 2014, 16, 4015–4031

When considering the homoscedasticity assumption, i.e., γL,s,o = γs,o for all L, this decision rule
becomes:
L∗ = argminL ∑
s,o
d2(ˆzt,s,o, ¯zL,s,o)
(40)

For this concrete application, to the VisTex database, it is pertinent to compare the rate of successful
classification (or overall accuracy) obtained using the Riemannian prior, based on the framework
of Section 4, to that obtained using a more classical conjugate prior, i.e., a normal-inverse gamma
distribution of the mean μ =
√

2x and the standard deviation σ = y. This conjugate prior is given by:

p(μ|σ, μp, κp) =
√κp
σ
√

2π
exp
�
− κp

2σ2 (μ − μp)2�

with an inverse gamma prior, on σ2,

p(σ2|α, β) =
βα

Γ(α)

�
σ2�−(α+1)
exp
�
− β

σ2

�
(41)

Using this conjugate prior, instead of a Riemannian prior, and following the procedure of applying the
Laplace approximation, a different decision rule is obtained, where L∗ is taken to be the maximum of
the following expression:

∑
s,o

ln κpL,s,o

2
− κpL,s,o

2 ˆy2
t,s,o

�√

2 ˆxt,s,o − μpL,s,o
�2

+ αL,s,o ln βL,s,o − ln Γ(αL,s,o) − 2(αL,s,o + 1) ln ˆyt,s,o − βL,s,o

ˆy2
t
(42)

where, as in Equation (39), ˆxt,s,o and ˆyt,s,o are the maximum likelihood estimates computed for the
population St,s,o.
Both the Riemannian and conjugate priors have been applied to the VisTex database, with half of
the database used for training and half for testing. In the course of 100 Monte Carlo runs, a significant
gain of about 3% is observed with the Riemannian prior compared to the conjugate prior. This is
summarized in the following table.

Prior Model
Overall Accuracy

Riemannian prior Equation (39)
71.88% ± 2.16%

Riemannian prior, homoscedasticity assumption Equation (40)
69.06% ± 1.96%

Conjugate prior Equation (42)
68.73% ± 2.92%

Recall that the overall accuracy is the ratio of the number of successfully classified subimages
to the total number of subimages. The table shows that the use of a Riemannian prior, even under a
homoscedasticity assumption, yields significant improvement upon the use of a conjugate prior.

6. Conclusions

Motivated by the problem of the classification of univariate normal populations, this paper
introduced a new class of prior distributions on the univariate normal model. With the univariate
normal model viewed as the Poincaré half plane H, these new prior distributions, called Gaussian
distributions, were meant to reflect the geometric picture (in terms of Rao’s distance) that a cluster or
class of univariate normal populations can be represented as having a center ¯z ∈ H and a “variance”
or dispersion γ2. Precisely, a Gaussian distribution G(¯z, γ) has a probability density function p(z),

with respect to Riemannian volume of the Poincaré half plane, which is proportional to exp
�
− d2(z,¯z)

2γ2
�
.

206


Entropy 2014, 16, 4015–4031

Using Gaussian distributions as prior distributions in the problem of the classification of univariate
normal populations was shown to lead to a new, more general and efficient decision rule. This
decision rule was implemented in a real-world application to texture image classification, where it
led to significant improvement in performance, in comparison to decision rules obtained by using
conjugate priors.
The general approach proposed in this paper contains several simplifications and approximations,
which could be improved upon in future work. First, it is possible to use different prior distributions,
which are more geometrically rich than Gaussian distributions, to represent classes of univariate normal
populations. For example, it may be helpful to replace Gaussian distributions that are “isotropic”, in
the sense of having a scalar dispersion parameter γ, by non-isotropic distributions, with a dispersion
matrix Γ (a 2 × 2 symmetric positive definite matrix). Another possibility would be to represent
each class of univariate normal populations by a finite mixture of Gaussian distributions, instead of
representing it by a single Gaussian distribution.
These variants, which would allow classes with a more complex geometric structure to be taken
into account, can be integrated in the general framework proposed in the paper, based on: (i) fitting
each class to a prior distribution (Gaussian non-isotropic, mixture of Gaussians); and (ii) choosing, for
a test population, the most adequate class, based on a decision rule. These two steps can be realized as
above, through the Laplace approximation and maximum likelihood estimation, or through alternative
techniques, based on Markov chain Monte Carlo stochastic optimization.
In addition to generalizing the approach of this paper and improving its performance, a further
important objective for future work will be to extend it to other parametric models, beyond univariate
normal models. Indeed, there is an increasing number of parametric models (generalized Gaussian,
elliptical models, etc.), whose Riemannian geometry is becoming well understood and where the
present approach may be helpful.

Author Contributions

Salem Said carry out the mathematical development, and specify the algorithms, appearing in
Sections 2, 3 and 4. Lionel Bombrun carry out all numerical simulations, and to propose the theoretical
development of Section 4. Yannick Berthoumieu devise the main idea of the paper. That is, use of
Riemannian priors as geometric representation of a class or cluster of univariate normal population.
All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

1.
Amari, S.I; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
USA, 2000.
2.
Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
Math. Soc. 1945, 37, 81–91.
3.
Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234.
4.
Amari, S.I. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276.
5.
Nielsen, F; Nock, R. Hyperbolic Voronoi diagrams made easy. 2009 , arXiv:0903.3287.
6.
Lenglet, C.; Rousson, M.; Deriche, R.; Fougeras, O. Statistics on the manifold of multivariate normal
distributions: Theory and application to diffusion tensor MRI processing. J. Math. Imaging Vis. 2006, 25,
423–444.
7.
Verdoolaege, G.; Scheunders, P. On the geometry of multivariate generalized Gaussian models. J. Math.
Imaging Vis. 2012, 43, 180–193.

207


Entropy 2014, 16, 4015–4031

8.
Berkane, M.; Oden, K. Geodesic estimation in elliptical distributions. J. Multival. Anal. 1997, 63, 35–46.
9.
Erdélyi, A. Asymptotic Expansions; Dover Books: Mineola, New York, NY, USA, 2010.
10.
MIT Vision and Modeling Group.
Vision Texture.
Available online: http://vismod.media.mit.edu/
pub/VisTex (accessed on 10 June 2014).
11.
Pennec, X. Intrinsic statistics on Riemannian manifold: Basic tools for geometric measurements. J. Math.
Imaging Vis. 2006, 25, 127–154.
12.
Atkinson, C.; Mitchell, A.F.S. Rao’s distance measure. Sankhya Ser. A 1981, 43, 345–365.
13.
Gallot, S.; Hulin, D.; Lafontaine, J. Riemannian Geometry; Springer-Verlag: Berlin, Germany, 2004.
14.
Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University
Press: Cambridge, MA, USA, 2006.
15.
Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’I.H.P. 1948,
10, 215–310. (In French)
16.
Afsari, B. Riemannian Lp center of mass: Existence, Uniqueness and convexity. Proc. Am. Math. Soc. 2011,
139, 655–673.
17.
Manton, J.H. A centroid (Karcher mean) approach to the joint approximate diagonalisation problem: The real
symmetric case. Digit. Sign. Process. 2006, 16, 468–478.
18.
Arnaudon, M.; Barbaresco, F. Riemannian medians and means with applications to RADAR signal processing.
IEEE J. Sel. Top. Sign. Process. 2013, 7, 595–604.
19.
Le, H. On the consistency of procrustean mean shapes. Adv. Appl. Prob. 1998, 30, 53–63.
20.
Turaga, P.; Veeraraghavan, A.; Chellappa, R. Statistical Snalysis on Stiefel and Grassmann Manifolds with
Applications in Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Anchorage, AK, USA, 23–28 June 2008; doi: 10.1109/CVPR.2008.4587733.
21.
Chavel, I. Riemannian geometry: A modern introduction; Cambridge University Press: Princeton, MA, USA, 2008.
22.
Grigorescu, S.E.; Petkov, N.; Kruizinga, P. Comparison of texture features based on Gabor filter. IEEE Trans.
Image Process. 2002, 11, 1160–1167.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

208


entropy

Article
Combinatorial Optimization with Information
Geometry: The Newton Method

Luigi Malagò 1 and Giovanni Pistone 2,*

1 Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico, 39/41, 20135 Milano, Italy;
E-Mail: malago@di.unimi.it
2 de Castro Statistics, Collegio Carlo Alberto, Via Real Collegio 30, 10024 Moncalieri, Italy
*
E-Mail: giovanni.pistone@carloalberto.org; Tel.: +39-011-670-5033; Fax: +39-011-670-5082.

Received: 31 March 2014; in revised form: 10 July 2014 / Accepted: 11 July 2014 /
Published: 28 July 2014

Abstract: We discuss the use of the Newton method in the computation of max(p �→ Ep [ f ]), where
p belongs to a statistical exponential family on a finite state space. In a number of papers, the authors
have applied first order search methods based on information geometry. Second order methods
have been widely used in optimization on manifolds, e.g., matrix manifolds, but appear to be new
in statistical manifolds. These methods require the computation of the Riemannian Hessian in a
statistical manifold. We use a non-parametric formulation of information geometry in view of further
applications in the continuous state space cases, where the construction of a proper Riemannian
structure is still an open problem.

Keywords: statistical manifold; Riemannian Hessian; combinatorial optimization; Newton method

1. Introduction

In this paper, statistical exponential families [1] are thought of as differentiable manifolds along
the approach called information geometry [2] or the exponential statistical manifold [3]. Specifically,
our aim is to discuss optimization on statistical manifolds using the Newton method, as is suggested
in ([4] (Ch. 5 and 6)); see also the monograph [5]. This method is based on classical Riemannian
geometry [6], but here, we put our emphasis on coordinate-free differential geometry; see [7,8].
We mainly refer to the above-mentioned references [2,4], with one notable exception in the
description of the tangent space. Our manifold will be an exponential family EV of positive densities,
V being a vector space of sufficient statistics. Given a one-dimensional statistical model p(t) ∈ EV,
t ∈ I, we define its velocity at time t to be its Fisher score s(t) = d

dt ln p(t) [9]. The Fisher score s(t)
is a random variable with zero expectation with respect to p(t), Ep(t) [s(t)] = 0. Because of that, the
tangent space at p ∈ EV is a vector space of random variables with zero expectation at p. A vector field
is a mapping from p to a random variable V(p), such that for all p ∈ E, the random variable V(p) is
centered at p, Ep [V(p)] = 0. In other words, each point of the manifold has a different tangent space,
and this tangent space can be used as a non-parametric model space of the manifold. In this formalism,
a vector field is a mapping from densities to centered random variables, that is, it is what in statistics is
called a pivot of the statistical model. To avoid confusion with the product of random variables, we
do not use the standard notation for the action of a vector field on a real function. This approach is
possibly unusual in differential geometry, but it is fully natural from the statistical point of view, where
the Fisher score has a central place. Moreover, this approach scales nicely from the finite state space to
the general state space; see the discussion in [9] and the review in [3].
A complete construction of the geometric framework based on the idea of using the Fisher scores
as elements of the tangent bundle has been actually worked out. In this paper, we go on by considering
a second order geometry based on the non-parametric settings.

Entropy 2014, 16, 4260–4289; doi:10.3390/e16084260
www.mdpi.com/journal/entropy
209


Entropy 2014, 16, 4260–4289

Our main motivation for such a geometrical construction is its application to combinatorial
optimization using exponential families, whose first order version was developed in [10–14]. We give
here an illustration of the methods in the following toy example.
Consider the function f (x1, x2) = a0 + a1x1 + a2x2 + a12x1x2, with x1, x2 = ±1, a0, a1, a2, a12 ∈ R.
The function f is a real random variable on the sample space Ω = {+1, −1}2 with the uniform
probability λ.
Note that the coordinate mappings X1, X2 of Ω generate an orthonormal basis
1, X1, X2, X1X2 of L2(Ω, λ) and that f is the general form of a real random variable on such a space. Let
P> be the open simplex of positive densities on (Ω, λ), and let EV be a statistical model, i.e., a subset
of P>. The relaxed mapping F: EV → R,

F(p) = Ep [ f ] = a0 + a1 Ep [X1] + a2 Ep [X2] + a12 Ep [X1X2] ,
(1)

is strictly bounded by the maximum of f, F(p) = Ep [ f ] < maxx∈Ω f (x), unless f is constant. We are
looking for a sequence pn, n ∈ N, such that Epn [ f ] → maxx∈Ω f (x) as n → ∞. The existence of such a
sequence is a nontrivial condition for the model E. Precisely, the closure of EV must contain a density,
whose support is contained in the set of maxima {x ∈ Ω| f (x) = max f }. This condition is satisfied by
the independence model, V = Span {X1, X2}, where we can write:

F(η1, η2) = a0 + a1η1 + a2η2 + a12η1η2,
ηi = Ep [Xi] ,
(2)

See Figure 1.
The gradient of Equation (2) has components ∂1F = a1 + a12η2, ∂2F = a2 + a12η1, and the flow
along the gradient produces increasing values for F; however, the gradient flow does not converge to
the maximum of F; see the dotted line in Figure 2. However, one can follow the suggestion by [15] and
use a modified gradient (the “natural” gradient) flow that produces better results in our problem; see
Figure 3. Full details on this example are given in Section 2.5.2.

−1.0
−0.5
0.0
0.5
1.0

−1.0
−0.5
0.0
0.5
1.0

η1

η2

−4

−2

0

2

4

6

 −2 

 0 

 2 

 4 

Expectation parameters

Figure 1. Relaxation of the Function (2) on the independence model. a1 = 1, a2 = 2, a12 = 3.

210


Entropy 2014, 16, 4260–4289

−2
−1
0
1
2

−2
−1
0
1
2

η1

η2

−20

−10

0

10

20

 −10 

 −10 

 0 

 0 

10 

 10 

 20 

Expectation parameters

Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside
the square [−1, +1]2.

−1.0
−0.5
0.0
0.5
1.0

−1.0
−0.5
0.0
0.5
1.0

η1

η2

−4

−2

0

2

4

6

 −2 

 0 

 2 

 4 

Gradient vs Natural gradient

Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting
at (−1/4, −1/4).

In combinatorial optimization, the values of the function f are assumed to be available at each
point, and the curve of steepest ascent of the relaxed function is learned through a simulation procedure
based on exponential statistical models.

211


Entropy 2014, 16, 4260–4289

In this paper, we introduce, in Section 2, the geometry of exponential families and its first order
calculus. The second order calculus and the Hessian are discussed in Section 3. Finally, in Section 4,
we apply the formalism to the discussion of the Newton method in the context of the maximization of
the relaxed function.

2. Models on a Finite State Space

We consider here the exponential statistical manifold on the set of positive densities on a measure
space (Ω, μ) with Ω finite and counting measure μ. The setup we describe below is not strictly required
in the finite case, because in such a case, other approaches are possible, but it provides a mathematical
formalism that has its own pros and that scales naturally to the infinite case.
We provide below a schematic presentation of our formalism as an introduction to this section.

• Two different exponential families can actually be the same statistical model, as the set of
densities in the two exponential families are actually equal. This fact is due to both the
arbitrariness of the reference density and the fact that sufficient statistics are actually a vector
basis of the vector space generated by the sufficient statistics. In a non-parametric approach,
we can refer directly to the vector space of centered log-densities, while the change of reference
density is geometrically interpreted as a change of chart. The set of all possible such charts
defines a manifold.
• We make a specific interpretation of the tangent bundle as the vector space of Fisher’s scores at
each density and use such tangent spaces as the space of coordinates. This produces a different
tangent space/space of coordinates at each density, and different tangent spaces are mapped
one onto another by a proper parallel transport, which is nothing else than the re-centering of
random variables.
• If a basis is chosen, a parametrization is given, and such a parametrization is, in fact, a new
chart, whose values are real vectors. In the real parametrization, the natural scalar product in
each scores space is given by Fisher’s information matrix.
• Riemannian gradients are defined in the usual way. It is customary in information geometry to
call “natural gradient” the real coordinate presentation of the Riemannian gradient. The natural
gradient is computed by applying the inverse of the Fisher information matrix to the Euclidean
gradient. It seems that there are tree gradients involved, but they all represent the same object
when correctly understood.
• The classical notion of expectation parameters for exponential families carries on as another
chart on the statistical manifold, which gives rise to a further presentation of a geometrical
object.
• While the statistical manifold is unique, there are at least three relevant connections as structures
on the vector bundles of the manifold: one relating to the exponential charts, one relating to the
expectation charts and one depending on the Riemannian structure.

2.1. Exponential Families As Manifolds

On the finite sample space Ω, #Ω = n, let a set of random variables B = {X1, . . . , Xm} be
given, such that ∑J αjXj is constant if, and only if, the αj’s are zero, or, equivalently, such that X0 =
1, X1, . . . , Xm are affinely independent. The condition implies, necessarily, the linear independence of
B. A common choice is to take a set of linearly independent and μ-centered random variables.
We write V = Span {X1, . . . , Xm} and define the following exponential family of positive densities
p ∈ P>:
EV =
�
q ∈ P>
���q ∝ eV p, V ∈ V
�
.
(3)

Given any couple p, q ∈ EV, then there exist a unique set of parameters θ = θp(q), such that:

q = exp

�
∑
j
θj eUpXj − ψp(θ)

�

· p
(4)

212


Entropy 2014, 16, 4260–4289

where eUp is the centering at p, that is,

eUp : V ∋ U �→ U − Ep [U] ∈ eUpV.
(5)

The linear mapping eUp is one-to-one on V and eUpXj, j = 1, . . . , m, and is a basis of eUpV. We view
each choice of a specific reference p as providing a chart centered at p on the exponential family EV,
namely:

σp : exp

�
∑
j
θj eUpXj − ψp(θ)

�

· p �→ θ,
(6)

If:

U = eUpU + Ep [U] =
m
∑
j=1
θj eUpXj + Ep [U] ,
(7)

then:

Ep
�
U eUpXi
� =
m
∑
j=1
θj Ep
�eUpXi
eUpXj
�
,
(8)

so that θ = I−1
B (p) Ep [U eUpX], where:

IB(p) =
�
Covp
�
Xi, Xj
��

ij = Ep
�
XX′� − Ep [X] Ep
�
X′�
(9)

is the Fisher information matrix of the basis B = {X1, . . . , Xm}.
The mappings:
σp : EV ∋ q �→ U �→ θ ∈ Rm
(10)

where:

sp : q �→ U = log
� q

p

�
− Ep

�
log
� q

p

��
,
(11)

σp : q �→ θ = I−1
B (p) Ep
�
U eUpX
� = I−1
B (p) Ep

�
log
� q

p

�
eUpX
�
,
(12)

are global charts in the non-parametric and parametric coordinates, respectively.
Notice that
Equation (12) provides the regression coefficients of the least squares estimate on eUpV of the
log-likelihood.
We denote by ep : Rm → EV the inverse of σp, i.e.,

ep(θ) = exp

�
m
∑
j=1
θj eUpXj − ψp(θ)

�

· p,
(13)

so that the representation of the divergence q �→ D (p ∥q) in the chart σp is ψp:

ψp(θ) = log
�
Ep
�
e∑m
j=1 θj eUpXj��
= Eθ

�
log
�
p

ep(θ)

��
= D
�
p ∥ep(θ)
�
.
(14)

The mapping IB : p �→ Covp (X, X) ∈ Rm×m is represented in the chart centered at p by:

IB,p(θ) = IB(ep(θ)) = [Covep(θ)
�
Xi, Xj
�]i,j = Hess ψp(θ),
(15)

See [1].

213


Entropy 2014, 16, 4260–4289

2.2. Change of Chart

Fix p, ¯p ∈ EV; then, we can express p in the chart centered at ¯p,

p = exp
� ¯U − kp( ¯U)
� · ¯p,
¯U ∈ eU ¯pV,
k ¯p( ¯U) = log
�
E ¯p
�
e ¯U��
.
(16)

In coordinates ¯U = ∑m
j=1‘ ¯θj eU ¯pXj.
For all q ∈ EV, q = exp
�
U − kp(U)
�
p, U ∈ eUpV, kp(U) = log
�Ep
�
eU��
, in coordinates
U = ∑m
j=1‘ θj eUpXj, we can write:

q = exp
�
U − kp(U)
� · p

= exp
�
U − kp(U)
�
exp
� ¯U − k ¯p( ¯U)
� · ¯p

= exp
�
U − kp(U) + ¯U − k ¯p( ¯U)
� · ¯p

= exp
��(U + ¯U) − E ¯p [U]
� −
�
kp(U) − k ¯p( ¯U) + E ¯p [U]
�� · ¯p,
(17)

hence, the non-parametric coordinate of q in the chart centered at ¯p is U + ¯U − E ¯p [U] = eU ¯p(U) + ¯U.
From Equation (12):

σ¯p(q) = I−1
V ( ¯p) E ¯p
�
(eU ¯pU + ¯U) eU ¯pX
�

= θ + ¯θ
(18)

This provides the change of charts σ¯p ◦ σ−1
p
: θ �→ θ + ¯θ. This atlas of charts defines the affine
manifold (EV, (σp)). This fact has deep consequences that we do not discuss here, e.g., our manifold is
an instance of a Hessian manifold [16].

2.3. Tangent Bundle

The space of Fisher scores at p is eUpV, and it is identified with the tangent space of the manifold
at p, TqEV; see the discussion in [3,9]. Let us check the consistency of this statement with our θ-
parametrization.
Let:

q(τ) = exp

�
m
∑
j=1
θj(τ)
eUq(0)X − ψq(0)(τ)

�

· q(0),
(19)

τ ∈ I, I an open interval containing zero, a curve in EV. In the chart centered at q(0), we have from
Equation (12):

σq(0)(q(τ)) = I−1
B (q(0)) Eq(0)

�
log
�q(τ)

q(0)

�
eUq(0)X
�

= I−1
B (q(0)) Eq(0)

��
m
∑
j=1
θj(τ)
eUq(0)Xj − ψq(0)(θ(τ))

�
eUq(0)X

�

= I−1
B (q(0))
m
∑
j=1
θj(τ) Eq(0)
�eUq(0)Xj
eUq(0)X
�

= I−1
B (q(0)) Eq(0)
�eUq(0)X
eUq(0)X
�
θ

= θ(τ).
(20)

The vector space eUpV is represented by the coordinates in the base eUpB. The tangent bundle
TEV as a manifold is defined by the charts (σp, ˙σp) on the domain:

TEV =
�(p, v)
��p ∈ EV, v ∈ TpEV
�
(21)

214


Entropy 2014, 16, 4260–4289

with:
(σp, ˙σp): (q, V) �→
�
I−1
B (p) Ep

�
log
� q

p

�
eUpX
�
, I−1
B (p) Ep
�
V eUpX
��
.
(22)

The dot notation ˙σp for the charts on the tangent spaces is justified by the computation in Equation (23)
below:

d
dtσq(0)(q(τ))
����
τ=0
= I−1
B (q(0)) Eq(0)

� d

dτ log (q(τ))
����
τ=0

eUq(0)X
�
=

I−1
B (q(0)) Eq(0)
�
δq(0)
eUq(0)X
�
= ˙σq(0)(δq(0)).
(23)

The velocity at τ = 0 is δq(0) =
d
dτ log (q(τ))
���
τ=0 ∈ Tq(0)EV and:

d
dτ θ(τ)
����
τ=0
= I−1
B (q(0)) Eq(0)

� d

dτ log (q(τ))
����
τ=0

eUq(0)X
�

= I−1
B (q(0)) Eq(0)
�
δq(0)
eUq(0)X
�
,
(24)

which is consistent with both the definition of tangent space as set of Fisher scores and with the chart
of the tangent bundle as defined in Equation (22).
The velocity at a generic τ is δq(τ) =
d
dτ log (q(τ)) ∈ Tq(τ)EV and has coordinates at p:

d
dτ θ(τ) = I−1
B (q(0)) Eq(0)

� d

dτ log (q(τ))
eUq(0)X
�

= I−1
B (q(0)) Eq(0)
�
δq(τ)
eUq(0)X
�
.
(25)

If V, W are vector fields on TEV, i.e., V(p), W(p) ∈ TpEV = eUpV, p ∈ EV, we define a Riemannian
metric g(V, W)) by:
g(V, W)(p) = gp(V(p), W(p)) = Ep [V(p)W(p)]
(26)

In coordinates at p, V(p) = ∑j ˙σj
p(V) eUpXj, W(p) = ∑j ˙σj
p(W) eUpXj, so that:

gp(V(p), W(p)) = ˙σp(V)′IB(p) ˙σp(W).
(27)

2.4. Gradients

Given a function φ: EV → R let φp = φ ◦ ep, ep = σ−1
p , its representation in the chart centered
at p:

EV
φ
� R

Rm

ep
�

φp

�
(28)

The derivative of θ �→ φp(θ) at θ = 0 along α ∈ Rm is:

∇φp(0)α = ∇φp(0)I−1
B (p)IB(p)α =
�
I−1
B (p)∇φp(0)′�′
IB(p)α = gp(I−1
B (p)∇φp(0)′, α).
(29)

The mapping �∇φ: p �→ I−1
B (p)(∇φp(0))′ ∈ Rm that appears in Equation (29) is Amari’s natural
gradient of φ: EV; see [15]. It is a standard notion in Riemannian geometry; cf. [4] (p. 46).

215


Entropy 2014, 16, 4260–4289

More generally, the derivative of θ �→ φp(θ) at θ along α ∈ Rm is:

∇φp(θ)α = ∇φp(θ)I−1
B (ep(θ))IB(ep(θ))α =
�
I−1
B (ep(θ))∇φp(θ)′�′
IB(ep(θ))α = gep(θ)(I−1
B (ep(θ))∇φp(θ)′, α).
(30)

Let us compare ∇φq(0) and ∇φp(θ) when q = ep(θ). As φp = φ ◦ ep and φq = φ ◦ eq, we have the
change of charts:
φq = φ ◦ eq = φ ◦ ep ◦ σp ◦ eq = φp ◦ σp ◦ eq,
(31)

hence ∇φq(0) = ∇φp(σp(q))J(σp ◦ eq)(0), where J(σp ◦ eq) is the Jacobian of σp ◦ eq. As σp ◦ eq(θ) =
θ + σp(q), we have J(σp ◦ eq) = Id, and in conclusion, ∇φep(θ)(0) = ∇φp(θ). For all p ∈ EV and
θ ∈ Rm,
�∇φ(ep(θ)) = I−1
B (ep(θ))∇φp(θ).
(32)

Alternatively, for all q, p ∈ EV, �∇φ: EV → Rm is defined by:

�∇φ(q) = I−1
B (q)∇φp(σp(q)).
(33)

The Riemannian gradient of φ: EV is the vector field ∇φ, such that DYφ = g(∇φ, Y). Note that
the Riemannian gradient takes values in the tangent bundle, while the natural gradient takes values in
Rm. We compute the Riemannian gradient at p as follows. If y = ˙σp(Y(p)),

DYφ(p) = dφp(0)y = gp( �∇φ(p), y) = Ep [∇φ(p)Y(p)] ,
(34)

hence �∇φ(p) = I−1
B (p)∇φp(0)′ is the representation in the chart centered at p of the vector field
∇φ: EV. Explicitly, we have (see Equation (22)),

�∇φ(p) = I−1
B (p)(∇φp(0))′ = I−1
B (p) Ep
�∇φ(p) eUpX
�
,
(35)

∇φ(p) = ∑
j
( �∇φ(p))j eUpXj
(36)

The Euclidean gradient ∇φp(θ) is sometimes called the “vanilla gradient.” It is equal to the
covariance between the Riemannian gradient ∇φ(p) and the basis X, (∇φp(0))′ = Ep [∇φ(p) eUpX].
We summarize in a display the relations between our three gradients: Euclidean ∇φp(0), natural
�∇φ(p) and Riemannian ∇φ(p).

TEV
(σp, ˙σp)�

π
�

R2m

π1
�

EV
σp
� Rm

TpEV
˙σp
� Rm

IB(p)
�

EV

∇φ(p)
�

∇φp(0)
� Rm
˙σp ◦ ∇φ(p) = I−1
B ∇φp(0) = �∇φ(p)

(37)
In the following, we shall frequently use the fact that the representation of the gradient vector
field ∇φ in a generic chart centered at p is:

(∇φ)p(θ) = ˙σp(∇φ(ep(θ))) = ( �∇φ)(ep(θ)) = I−1
B,p(θ)∇φp(θ).
(38)

It should be noted that the leftmost term (∇φ)p(θ) is the presentation of the gradient in the charts
of the tangent bundle, while in the rightmost term, ∇φp(θ) denotes the Euclidean gradient of the
presentation of the function φ in the charts of the manifold.

216


Entropy 2014, 16, 4260–4289

2.4.1. Expectation Parameters

As ψp is strictly convex, the gradient mapping θ �→ (∇ψp(θ))′ is a homeomorphism from the
space of parameters Rm to the interior of the convex set generated by the image of eUpX; see [1]. The
function μp : EV defined by:

μp(q) = Eq
�eUpX
� = Eq [X] − Ep [X] = (∇ψp(θ))′,
θ = σp(q)
(39)

is a chart for all p ∈ EV. The value of the inverse q = Lp(μ) is characterized as the unique q ∈ EV, such
that μ = Eq [eUpX], i.e., the maximum likelihood estimator.
Let us compute the change of chart from p to ¯p:

μ ¯p ◦ μ−1
p (η) = ¯η = η + Ep [X] − E ¯p [X] .
(40)

In fact, μ = ELp(μ) [eUpX] and ¯μ = μ ¯p(Lp(μ)) = ELp(μ)
�eU ¯pX
�
.
We do not discuss here the rich theory started in [2] about the duality between σp and μp. We
limit ourselves to the computation of the Riemannian gradient in the expectation parameters. If φ: EV,

φp(θ) = φ ◦ ep(θ) = φ ◦ Lp ◦ μp ◦ ep(θ) = (φ ◦ Lp) ◦ (∇ψp)(θ),
(41)

because μp ◦ ep(θ) = Eep(θ) [eUpX] = ∇φp(θ), hence:

∇φp(θ) = ∇(φ ◦ Lp)(∇ψp(θ)) Hess ψp(θ),
(42)

�∇φ(p) = IV(p)−1(∇(φ ◦ Lp)(0) Hess ψp(0))′ = (∇(φ ◦ Lp)(0))′,
(43)

∇φ(p) = ∇(φ ◦ Lp)(0) eUpX,
(44)

that is, the natural gradient �∇φ at p = Lp(μ) is equal to the Euclidean gradient of μ �→ φ ◦ Lp(μ) at
μ = 0.

2.4.2. Vector Fields

If V is a vector field of TEV and φ: EV is a real function, then we define the action of V on φ, ∇Vφ,
to be the real function:
∇Vφ: EV ∋ p �→ ∇Vφ(p) = ∇φp(0) ˙σp (V(p)) .
(45)

We prefer to avoid the standard notation Vφ, because in our setting, V(p) is a random variable, and
the product V(p)φ(p) is otherwise defined as the ordinary product.
Let us represent ∇Vφ in the chart centered at p:

(∇Vφ)p(θ) = ∇Vφ(ep(θ)) = ∇φep(θ)(0) ˙σep(θ)
�
V(ep(θ))
� = ∇φp(θ)Vp(θ),
(46)

where we have used the equality ∇φep(θ)(0) = ∇φp(θ) and Vp(θ) = ˙σep(θ)
�
V(ep(θ))
�
.
If W is a vector field, we can compute ∇W∇Vφ at p as:

∇W∇Vφ(p) = ∇(∇Vφ)p(0) ˙σp(W(p))

= Vp(0)′ Hess φp(0)Wp(0) + ∇φp(0)JVp(0)Wp(0),
(47)

where J denotes the Jacobian matrix.
The Lie bracket [W, V]φ (see [7] (§4.2), [8] (V, §1), [4] (Section 5.3.1)) is given by:

[W, V]φ(p) = ∇W∇Vφ(p) − ∇V∇wφ(p) = ∇φp(0)
�
JVp(0)Wp(0) − JWp(0)Vp(0)
�
,
(48)

because of Equation (47) and the symmetry of the Hessian.

217


Entropy 2014, 16, 4260–4289

The flow of the smooth vector field V : EV is a family of curves γ(t, p), p ∈ EV, t ∈ Jp, Jp open real
interval containing zero, such that for all p ∈ EV and t ∈ Jp,

γ(0, p) = p,
(49)

δγ(t, p) = V(γ(t, p)).
(50)

As uniqueness holds in Equation (50) (see [8] (VI, §1) or [7] (§4.1)), we have semi-group property
γ(s + t, p) = γ(s, γ(t, p)), and Equation (50) is equivalent to δγ(0, p) = V(γ(0, p)), p ∈ EV.
If a flow of V is available, we have an interpretation of ∇Vφ as a derivative of φ along γ(t, p),

d
dtφ(γ(t, p))
����
t=0
= ∇φp(σp(γ(t, p)))
� d

dtσp(γ(t, p))
�����
t=0
= ∇φp(0)V(p) = ∇Vφ(p).
(51)

2.5. Examples

The following examples are intended to show how the formalism of gradients is usable in
performing basic computations.

2.5.1. Expectation

Let f be any random variable, and define F: EV by F(p) = Ep [ f ]. In the chart centered at p, we
have:

Fp(θ) =
�
f exp

�
∑
j
θj eUpXj − ψp(θ)

�

· p dμ
(52)

and the Euclidean gradient:
∇Fp(0) = Covp ( f, X) ∈ (Rm)′.
(53)

The natural gradient is:

�∇F(p) = Covp (X, X)−1 Covp (X, f ) ∈ Rm,
(54)

and the Riemannian gradient is:

∇F(p) = ( �∇F(p))′ eUpX = Covp ( f, X) Covp (X, X)−1 eUpX ∈ TpEV.
(55)

From Equation (55), it follows that ∇F(p) is the L2(p)-projection f onto eUpV, while �∇F(p) in
Equation (54) are the coordinates of the projection. Let us consider the family of curves:

γ(t, p) = exp

�
m
∑
j=1
t( �∇F(p))j eUpXj − ψp(t �∇F(p))

�

· p,
t ∈ R.
(56)

The velocity is:

δγ(t, p) = d

dt

�
m
∑
j=1
t( �∇F(p))j eUpXj − ψp(t �∇F(p))

�

= ∇F(p) − Eγ(t,p) [∇F(p)] ,
(57)

which is different from ∇F(γ(t, p)), unless f ∈ V ⊕ R. Then, γ is not, in general, the flow of ∇F, but it
is a local approximation, as δγ(0, p) = ∇F(p).
These computation are the basis of model-based methods in combinatorial optimization; see [10–14].

218


Entropy 2014, 16, 4260–4289

2.5.2. Binary Independent Variables

Here, we present, in full generality, the toy example of the Introduction; see [17] for more
information on the application to combinatorial optimization. Our example is a very special case of
Ising exactly solvable models [18], our aim being here to explore the geometric framework.
Let Ω = {+1, −1}m with counting measure μ, and let the space V be generated by the coordinate
projections B = {X1, . . . , Xd}. Note that we use here the coding +1, −1 (from physics) instead of
the coding 0, 1, which is more common in combinatorial optimization. The exponential family is
EV =
�
exp
�
∑m
J=1 θjXj − ψλ(θ)
�
· 2−m�
, λ(x) = 2−m for x ∈ Ω being the uniform density. The
independence of the sufficient statistics Xj under all distributions in EV implies:

ψλ(θ) =
m
∑
j=1
ψ(θj),
ψ(θ) = log (cosh(θ)) .
(58)

We have:

∇ψλ(θ) = [tanh(θj): j = 1, . . . , d]

= ηλ(θ),
(59)

Hess ψλ(θ) = diag
�
cosh−2(θj): j = 1, . . . , d
�

= diag
�
e−2ψ(θj) : j = 1, . . . , d
�

= IB,λ(θ),
(60)

IB,λ(θ)−1 = diag
�
cosh2(θj): j = 1, . . . , d
�

= diag
�
e2ψ(θj) : j = 1, . . . , d
�
.
(61)

The quadratic function f (X) = a0 + ∑j ajXj + ∑{i,j} ai,jXiXj has expected value at p = eλ(θ), i.e.,
relaxed value, equal to:

F(p) = Fλ(θ) = Eθ [ f (X)] = a0 + ∑
j
aj tanh(θj) + ∑
{i,j}
ai,j tanh(θi) tanh(θj),
(62)

and covariance with Xk ∈ B equal to:

Covθ ( f (X), Xk) = ∑
j
aj Covθ
�
Xj, Xk
� + ∑
{i,j}
ai,j Covθ
�
XiXj, Xk
�

= ak Varθ (Xk) + ∑
i̸=k
ai,k Eθ [Xi] Varθ (Xk)

= cosh−2(θk)

�

ak + ∑
i̸=k
ai,k tanh(θi)

�

.
(63)

In the computation, we have used the independence and the special algebra of ±1, which implies
X2
i = 1, so that Covθ
�
XiXj, Xk
� = 0 if i, j ̸= k, otherwise Covθ (XiXk, Xk) = Eθ [Xi] − Eθ [Xi] Eθ [Xk]2;
see [13].

219


Entropy 2014, 16, 4260–4289

The Euclidean gradient, the natural gradient and the Riemannian gradient are, respectively,

∇Fλ(θ) =

�

cosh−2(θj)

�

aj + ∑
i̸=j
ai,j tanh(θi)

�

: j = 1, . . . , d

�

,
(64)

�∇F(eλ(θ)) =

�

aj + ∑
i̸=j
ai,j tanh(θi): j = 1, . . . , d

�

,
(65)

∇F(eλ(θ)) =
m
∑
j=1

�

aj + ∑
i̸=j
ai,j Eθ [Xi]

�
�
Xj − Eθ
�
Xj
��
.
(66)

The (natural) gradient flow equations are:

˙θj(t) = aj + ∑
i̸=j
ai,j tanh(θi(t)),
j = 1, . . . , d.
(67)

Equations (64)–(66) are usable in practice if the aj’s and the ai,j’s are estimable. Otherwise, one
can use Equation (63) and the following forms of the gradients:

∇Fλ(θ) =
�
Covθ
�
Xj, f (X)
�
: j = 1, . . . , d
�
,
(68)

�∇F(eλ(θ)) =
�
cosh2(θj) Covθ
�
f (X), Xj
�
: j = 1, . . . , d
�
,
(69)

in which case, the gradient flow equations are:

˙θj(t) = cosh2(θj) Covθ
�
f (X), Xj
�
,
j = 1, . . . , d.
(70)

Let us study the relaxed function in the expectation parameters ηj = ηj(θ), j = 1, . . . , d,

Fλ(η) = a0 + ∑
j
ajηj + ∑
{i,j}
ai,jηiηj,
η ∈] − 1, +1[m.
(71)

The Euclidean gradient with respect to η has components:

∂jFλ(η) = aj + ∑
i̸=j
ai,jηi,
(72)

which are equal to the components of the natural gradient; see Section 2.4.1. As:

˙ηj(t) = d

dt tanh(θj(t)) = cosh−2(θj(t)) ˙θj(t) =
�
1 − ηj(t)2�
˙θj(t),
j = 1, . . . , m,
(73)

the gradient flow expressed in the η-parameters has equations:

˙ηj(t) =
�
1 − ηj(t)2� �

aj + ∑
i̸=j
ai,jηi(t)

�

,
j = 1, . . . , d.
(74)

Alternatively, in vector form,

˙η(t) = diag
�
1 − ηj(t)2 : j = 1, . . . , d
�
(a + Aη(t)) ,
(75)

where a = [aj : j = 1, . . . , d]t and Ai,j = 0 if i = j, Aij = ai,j. The matrix A is symmetric with zero
diagonal, and it has the meaning of the adjacency matrix of the (weighted) interaction graph. We do
not know a closed-form solution of Equation (74). An example of a numerical solution is shown in
Figure 3.

220


Entropy 2014, 16, 4260–4289

2.5.3. Escort Probabilities

For a given a > 0, consider the function C(a) : EV defined by C(a)(p) = �
pa dμ. We have:

C(a)
p (θ) =
�
exp

�

a
m
∑
j=1
θj eUpXj − aψp(θ)

�

pa dμ
(76)

and:

dC(a)
p (0)α =
�
a

�
m
∑
j=1
αj eUpXj

�

pa dμ =

m
∑
j=1
αj
�
a eUpXjpa dμ =
m
∑
j=1
αj Covp
�
Xj, apa−1�
,
(77)

that is, the Euclidean gradient is ∇C(a)
p (0) = Covp
�
apa−1, X
�
(row vector). The natural gradient is
computed from Equation (35) as:

�∇C(a)(p) = I−1
B (p)(∇C(a)
p (0))′ = Covp (X, X)−1 Covp
�
X, apa−1�
,
(78)

while the Riemannian gradient follows from Equation (36):

∇C(a)(p) = Covp
�
apa−1, X
�
Covp (X, X)−1 eUpX.
(79)

Note that the Riemannian gradient is the orthogonal projection of the random variable apa−1 onto
the tangent space TpEV = eUpV.
The probability density pa/C(p) is called the escort density in the literature on non-extensive
statistical mechanics; see, e.g., [19] (Section 7.4).
We compute now the tangent mapping of EV ∋ p �→ pa/C(a)(a) ∈ P>. Let us extend the basis
X1, . . . , Xm to a basis X1, . . . , Xn, n ≥ m, whose exponential family is full, i.e., equal to P>. The

non-parametric coordinate of q =
�
exp
�
∑m
j=1 θj eUpXj − ψp(θ)
�
p
�a
/C(a)
p (θ) in the chart centered at

¯p = pa/C(a)
p (0) is the ¯p-centering of the random variable:

log
� q

¯p

�
= log

⎛

⎜
⎝

�
exp
�
∑m
j=1 θj eUpXj − ψp(θ)
�
p
�a
/C(a)
p (θ)

pa/C(a)
p (0)

⎞

⎟
⎠

= a
m
∑
j=1
θj eUpXj − aψp(θ) + ln C(a)
( 0) − ln C(a)
p (θ),
(80)

that is,

v = a
m
∑
j=1
θj eU ¯pXj.
(81)

The coordinates of v in the basis eU ¯pX1, . . . , eU ¯pXn are (aθ1, . . . , aθm, 0, . . . , 0), and the Jacobian of
θ �→ (aθ, 0n−m) is the m × n matrix [aIm|0m×(n−m)].

221


Entropy 2014, 16, 4260–4289

2.5.4. Polarization Measure

The polarization measure has been introduced in Economics by [20]. Here, we consider the
qualitative version of [21]. If π is a distribution of a finite set, the probability that in three independent
samples from π there are exactly two equal is 3 ∑j π2
j (1 − πj). If p ∈ EV, define:

G(p) =
�
p2(1 − p) dμ = C(2)(p) − C(3)(p),
(82)

where C(2) and C(3) are defined as in Example 2.5.3.
From Equation (78), we find the natural gradient:

�∇G(p) = Covp (X, X)−1 Covp
�
X, 2p − 3p2�
.
(83)

Note that �∇G(p) = 0 if p is constant; see Figure 4.

Figure 4. Normalized polarization.

3. Second Order Calculus

In this section, we turn to considering second order calculus, in particular Hessians, in order to
prepare the discussion of the Newton method for the relaxed optimization of Section 4.

3.1. Metric Derivative (Levi–Civita connection)

Let V, W : EV be vector fields, that is, V(p), W(p) ∈ TpEV = eUpV, p ∈ EV. Consider the real
function R = g(V, W): EV → R, whose value at p ∈ EV is R(p) = gp(V(p), W(p)) = Ep [V(p)W(p)].
Assuming smoothness, we want to compute the derivative of R along the vector field Y : EV, that is,
(DYR)(p) = dRp(0)α, with α = ˙σp(Y(p)). The expression of R in the chart centered at p is, according
to Equation (27),

θ �→ Rp(θ) = ˙σp(V(ep(θ)))′IB(ep(θ)) ˙σp(W(ep(θ))) = Vp(θ)′IB,p(θ)Wp(θ),
(84)

where Vp and Wp are the presentation in the chart of the vector fields V and W, respectively.

222


Entropy 2014, 16, 4260–4289

The i-th component ∂iRp(θ) of the Euclidean gradient ∇Rp(θ) is:

∂iRp(θ) = ∂i
�
Vp(θ)′IB,p(θ)Wp(θ)
� =

∂iVp(θ)′IB,p(θ)Wp(θ) + Vp(θ)′∂iIB,p(θ)Wp(θ) + Vp(θ)′IB,p(θ)∂iWp(θ) =
�
∂iVp(θ) + 1

2 I−1
B,p(θ)∂iIB,p(θ)Vp(θ)
�′
IB,p(θ)Wp(θ)+

Vp(θ)′IB,p(θ)
�
∂iWp(θ) + 1

2 I−1
B,p(θ)∂iIB,p(θ)Wp(θ)
�
,
(85)

so that the derivative at θ along α = ˙σep(θ)(Y(ep(θ))) is:

dRp(θ)α =
�
dVp(θ)α + 1

2 I−1
B,p(θ)
�
dIB,p(θ)α
�
Vp(θ)
�′
IB,p(θ)Wp(θ)+

Vp(θ)′IB,p(θ)
�
dWp(θ)α + 1

2 I−1
B,p(θ)
�
dIB,p(θ)α
�
Wp(θ)
�
.
(86)

Proposition 1. If we define DYV to be the vector field on EV, whose value at q = ep(θ) has coordinates
centered at p given by:

˙σp(DYV(q)) = dVp(θ)α + 1

2 I−1
B (p)
�
dIB,p(θ)α
�
Vp(θ),
α = ˙σp(Y(q)),
(87)

then:
DYg(V, W) = g(DYV, W) + g(V, DYW),
(88)

i.e., Equation (87) is a metric covariant derivative; see [6] (Ch. 2 §3), [8] (VIII §4), [4] (§5.3.2).

The metric derivative Equation (87) could be computed from the flow of the vector field Y. Let
(t, p) �→ γ(t, p) be the flow of the vector field V, i.e., δγ(t, p) = V(γ(t, p)) and γ(0, p) = p. Using
Equation (23), we have:

d
dt ˙σ(V(γ(t, p)))
����
t=0
= d

dtVp(σp(γ(t, p)))
����
t=0

= dVp(σp(γ(t, p))) d

dtσp(γ(t, p))
����
t=0
= dVp(0) ˙σp(δγ(0, p)) = dVp(0) ˙σp(Y(p)),
(89)

and:

d
dt IV(γ(t, p))
����
t=0
= d

dt IB,p(σpγ(t, p))
����
t=0
= dIB,p(0) ˙σp(δγ(0, p)) = dIB,p(0) ˙σp(Y(p))Vp(0),
(90)

so that:

˙σ(DYV(p)) = d

dt ˙σV(γ(t, p))
����
t=0
+ 1

2 I−1
V (p) d

dt IV(γ(t, p))
����
t=0
.
(91)

Let us check the symmetry of the metric covariant derivative to show that it is actually the unique
Riemannian or Levi–Civita affine connection; see [6] (Th. 3.6).
The Lie bracket of the vector fields V and W is the vector field [V, W], whose coordinates are:

[V, W]p(θ) = dVp(0)˙σp(W(p)) − dWp(0) ˙σp(V(p)).
(92)

223


Entropy 2014, 16, 4260–4289

As the ij entry of ∂kIB,p(0) is ∂k∂i∂jψp(0), then the symmetry (dIB,p(0)α)β = (dIB,p(0)β)α holds,
and we have:

˙σp (DWV(p) − DVW(p)) =

dVp(0)˙σp(W(p)) + 1

2 I−1
B (p)
�
dIB,p(0) ˙σp(W(p))
�
Vp(0)

− dWp(0) ˙σp(V(p)) − 1

2 I−1
B (p)
�
dIB,p(0) ˙σp(V(p))
�
Wp(0)

= ˙σ[V, W](p).
(93)

The term Γk(p) = 1

2 I−1
p (0)∂kdIB,p(0) of Equation (87) is sometimes referred to as the Christoffel
matrix, but we do not use this terminology in this paper. As:

IB,p(θ) = IB(ep(θ)) =
�
Covep(θ)
�
Xi, Xj
��

i,j=1,...,m =
�
∂i∂jψp(θ)
�

i,j=i,...,m ,
(94)

we have ∂kIB(ep(θ)) = [∂i∂j∂kψp(θ)]i,j=i,...,m =
�
Covep(θ)
�
Xi, Xj, Xk
��

i,j=i,...,m and:

Γk(p) = 1

2
�
Covp
�
Xi, Xj
��−1
i,j=i,...,m
�
Covp
�
Xi, Xj, Xk
��

i,j=i,...,m
(95)

.
If V, W are vector fields of TEV, we have:

Γ(p, V, W) = 1

2 I−1
B (p) Covp (X, V, W)

= 1

2 I−1
B (p) Ep
�eUpXVW
�
,
(96)

which is the projection of V(p)W(p)/2 on eUpV.
Notice also that:

(dI−1
p (0)α)IB,p(0) = −I−1
p (0)(dIB,p(0)α)I−1
p (0)IB,p(0)y = −I−1
p (0)
�
dIB,p(0)α
�
.
(97)

3.2. Acceleration

Let p(t), t ∈ I, be a smooth curve in EV. Then, the velocity δp(t) = d

dt log (p(t)) is a vector field
V(p(t)) = δp(t), defined on the support p(I) of the curve. As the curve is the flow of the velocity
field, we can compute the metric derivative of the velocity along the the velocity itself Dδpδp from
Equation (91) with V(p(0)) = δp(0); we can use Equation (91) to get:

˙σp(Dδpδp)(p(0)) = d

dt ˙σp(0) (δ(p(t)))
����
t=0
+ 1

2 I−1
B (p(0)) d

dt IB(p(t))
����
t=0
=

d2

dt2 σp(0)(p(t))
����
t=0
+ 1

2 I−1
B (p(0)) d

dt IB(p(t))
����
t=0
.
(98)

which can be defined to be the Riemannian acceleration of the curve at t = 0.
Let us write θ(t) = σp(p(t)), p = p(0) and:

p(t) = exp

�
m
∑
j=1
θj(t) eUpXj − ψp(θ(t))

�

· p,
(99)

224


Entropy 2014, 16, 4260–4289

so that ˙σp(δp)(0) = ˙θ(0) and
d2
dt2 σp(p(t))
���
t=0 = ¨θ(0). We have:

d
dt IB(p(t))
����
t=0
= d

dt IB,p(θ(t))
����
t=0
= d

dt Hess ψp(θ(t))
����
t=0
= Covp(X, X,
m
∑
j=1
˙θj(t)Xj)
(100)

so that the acceleration at p has coordinates:

¨θ(0) + 1

2

m
∑
i,j=1
˙θi(0) ˙θj(0) Covp (X, X)−1 Covp(X, Xi, Xj) =

¨θ(0) + 1

2 Covp (X, X)−1 Covp(X,
m
∑
i
˙θi(0)Xi,
m
∑
j=1
˙θj(0)Xj).
(101)

A geodesic is a curve whose acceleration is zero at each point. The exponential map is the mapping
Exp: TEV → EV defined by:
(p, U) �→ Expp U = p(1),
(102)

where t �→ p(t) is the geodesic, such that p(0) = p and δp(0) = U, for all U, such that the geodesic
exists for t = 1.
The exponential map is a particular retraction, that is, a family of mappings Rp, p ∈ E, from
the tangent space at p to the manifold; here R: TpE → E, such that Rp(0) = p and dRp(0) = Id;
see [4] (§5.4). It should be noted that exponential manifolds have natural retractions other than Exp, a
notable one being the exponential family itself. A retraction provides a crucial step in a gradient search
algorithms by mapping a direction of increase of the objective function to a new trial point.

3.2.1. Example: Binary Independent 2.5.2 Continued.

Let us consider the binary independent model of Section 2.5.2. We have

IB(eλ(θ)) = IB,λ(θ) = diag
�
cosh−2(θj): j = 1, . . . , d
�
,
(103)

it follows that

∂kIB,λ(θ) = ∂k diag
�
cosh−2(θj): j = 1, . . . , d
�

= −2 cosh−3(θk) sinh(θk)Ekk,
(104)

where Ekk is the d × d matrix with entry one at (k, k), zero otherwise. The k-th Christoffel’s matrix in
the second term in the definition of the metric derivative (aka Levi–Civita connection) is:

Γk
B(eλ(θ)) = Γk
λ(θ) = 1

2 I−1
B,λ(θ)∂kIB,λ(θ) = − tanh(θk)Ekk.
(105)

In terms of the moments, we have IB,λ(θ) = Covθ (X, X′) = Hess ψλ(θ). As ∂k∂i∂jψλ(θ) =
Covθ
�
Xk, Xi, Xj
�
, we that can write:

∂kIB,λ(θ) = ∂k diag
�
Varθ
�
Xj
�
: j = 1, . . . , d
�

= Covθ (Xk, Xk, Xk) Ekk
(106)

and:

Γk
λ(θ) = 1

2 Covθ (Xk, Xk)−1 Covθ (Xk, Xk, Xk) Ekk

= 1

2(1 − (ηk)2)−1(−2ηk + 2(ηk)3)Ekk = −ηkEkk.
(107)

225


Entropy 2014, 16, 4260–4289

The equations for the geodesics starting from θ(0) with velocity ˙θ(0) = u are:

¨θk(t) +
m
∑
ij=1
Γk
ij(θ(t)) ˙θi(t) ˙θj(t) = ¨θk(t) − tanh(θk(t))( ˙θk(t))2 = 0,
k = 1, . . . , d.
(108)

The ordinary differential equation:

¨θ − tanh(θ) ˙θ2 = 0
(109)

has the closed form solution:

θ(t) = gd−1
�
gd(θ(0)) +
˙θ(0)

cosh(θ(0))t
�
= tanh−1
�
sin
�
gd(θ(0)) +
˙θ(0)

cosh(θ(0))t
��
(110)

for all t, such that:

− π/2 < gd(θ(0)) +
˙θ(0)

cosh(θ(0))t < π/2,
(111)

where gd: R →] − π/2, +π/2[ is the Gudermannian function, that is, gd′(x) = 1/ cosh x, gd(0) = 0;
in closed form, gd(x) = arcsin(tanh(x)). In fact, if θ is a solution of Equation (109), then:

d
dt gd(θ(t)) =
˙θ(t)

cosh(θ(t))
(112)

d2

dt2 gd(θ(t)) = −sinh(θ(t))˙(θ(t))2

cosh2(θ(t))
+
¨θ(t)

cosh(θ(t))

=
1

cosh(θ(t))

�
¨θ(t) − tanh(θ(t))( ˙θ(t))2�
= 0,
(113)

so that t �→ gd(θ(t)) coincides (where it is defined) with an affine function characterized by the initial
conditions.
In particular, at t = 1, the geodesic Equation (110) defines the Riemannian exponential
Exp: TEV → EV.
If (p, U) ∈ TEV, that is, p ∈ EV and U ∈ TpEV, then σλ(p) = θ(0) and
U = ∑ uj
eUpXj, ˙σλ(U) = u. If:

− π/2 < gd(θj) +
uj

cosh(θj) < π/2,
(114)

then we can take ˙θ(0) = u and t = 1, so that:

Expp : U
˙σλ
�−→ u �→
�
gd−1
�
gd(θj) +
uj

cosh(θj)

�
: j = 1, . . . , d
�
eλ
�−→

m
∏
j=1
exp
�
gd−1
�
gd(θj) +
uj

cosh(θj)

�
Xj − ψ
�
gd−1
�
gd(θj) +
uj

cosh(θj)

���
2−m.
(115)

We have:

exp
�
gd−1(v)
�
= exp
�
tanh−1(sin(v))
�
=

�

1 + sin v
1 − sin v
(116)

and:

ψ
�
gd−1(v)
�
= + log
�
gd−1(sin v)
�
= log
�
1

cos v

�
,
(117)

226


Entropy 2014, 16, 4260–4289

hence u �→ Expp
�
∑d
j=1 uj
eUpXj
�
is given for:

u ∈

d×
j=1

�
cosh(θj)(−π/2 − gd(θj)), cosh(θj)(π/2 − gd(θj))
�
,
(118)

by:

Expθ(u) =
m
∏
j=1
cos
�
gd(θj) +
uj

cosh(θj)

� ⎛

⎝
1 + sin
�
gd(θj) +
uj

cosh(θj)

�

1 − sin
�
gd(θj) +
uj

cosh(θj)

�

⎞

⎠

Xj
2

=

m
∏
j=1

�
1 + sin
�
gd(θj) +
uj

cosh(θj)

�
Xj

�
2−m ∈ EV.
(119)

The expectation parameters are:

ηi(t) = Eθ=0

�

Xi

m
∏
j=1

�
1 + sin
�
gd(θj) +
tuj

cosh(θj)

�
Xj

��

= sin
�
gd(θj) +
tuj

cosh(θj)

�
,
(120)

and:
gd(θj) = arcsin(ηj),
cosh(θj) =
1

�
1 − (ηj)2� 1

2
,
(121)

so that the exponential in terms of the expectation parameters is:

Expη(u) =
�
sin
�
arcsin ηj +
�
1 − (ηj)2� 1

2 uj

�
: j = 1, . . . , m
�
.
(122)

The inverse of the Riemannian exponential provides a notion of translation between two elements
of the exponential model, which is a particular parametrization of the model:

−−→
η1η2 = Exp−1
η1 η2 =
��
(1 − (ηj
i)2�− 1

2 �
arcsin ηj
2 − arcsin ηj
1
�
: j = 1, . . . , m
�
(123)

In particular, at θ = 0, we have the geodesic:

t �→
d
∏
j=1

�
1 + sin(tuj)Xj
�
2−m,
|t| <
π

2 max
��uj
��
(124)

See in Figure 5 some geodesic curves.

227


Entropy 2014, 16, 4260–4289

−1.0
−0.5
0.0
0.5
1.0

−1.0
−0.5
0.0
0.5
1.0

Expectation parameters

η1

η2

Figure 5. Geodesics from η = (0.75, 0.75).

3.3. Riemannian Hessian

Let φ: EV → R with Riemannian gradient ∇φ(p) = ∑i( �∇φ)i(p) eUpXi, �∇φ(p) = I−1
B (p)∇φp(0).
The Riemannian Hessian of φ is the metric derivative of the gradient ∇φ along the vector field Y, that
is, HessY φ = DY∇φ; see [6] (Ch. 6, Ex. 11), [4] (§5.5). in the following, we denote by the symbol Hess,
without a subscript, the ordinary Hessian matrix.
From Equation (87), we have the coordinates of HessY φ(p). Given a generic tangent vector α, we
compute from Equation (38):

d(∇φ)p(θ)α
��
θ=0 = d
�
I−1
B,p(θ)∇φp(θ)
�
α
���
θ=0
= (dI−1
B,p(0)α)∇φp(0) + I−1
B,p(0) Hess φp(0)α

= −I−1
B (p)(dIB,p(0)α) �∇φ(p) + I−1
B (p) Hess φp(0)α
(125)

and, upon substitution of (∇φ)p to Vp in Equation (87),

˙σp(HessY φ(p)) = d(∇φ)p(0)α + 1

2 I−1
B (p)
�
dIB,p(0)α
� (∇φ)p(0),
α = Sp(Y(p))

= −I−1
B (p)(dIB,p(0)α) �∇φ(p) + I−1
B (p) Hess φp(0) + 1

2 I−1
B (p)
�
dIB,p(0)α
� �∇φ(p)

= I−1
B (p) Hess φp(0)α − 1

2 I−1
B (p)
�
dIB,p(0)α
� �∇φ(p)

= I−1
B (p)
�
Hess φp(0)α − 1

2
�
dIB,p(0)α
� �∇φ(p)
�
(126)

HessY φ is characterized by knowing the value of g(HessY φ, X): EV for all vector fields X. We have
from Equation (126), with α = ˙σp(Y(p)) and β = ˙σp(X(p)),

gp(HessY(p) φ(p), X(p)) = β′ Hess φp(0)α − 1

2 β′ �
dIB,p(0)α
� �∇φ(p).
(127)

228


Entropy 2014, 16, 4260–4289

This is the presentation of the Riemannian Hessian as a bi-linear form on TEV; see the comments
in [4] (Prop. 5.5.2-3). Note that the Riemannian Hessian is positive definite if:

α′ Hess φp(0)α ≥ 1

2α′ �
dIB,p(0)α
� �∇φ(p),
α ∈ Rm.
(128)

4. Application to Combinatorial Optimization

We conclude our paper by showing how the geometric method applies to the problem of finding
the maximum of the expected value of a function.

4.1. Hessian of a Relaxed Function

Here is a key example of vector field. Let f be any bounded random variable, and define the
relaxed function to be φ(p) = Ep [ f ], p ∈ P>. Define F(p) to be the projection of f, as an element of
L2(p), onto TpEV = eUpV, i.e., F(p) is the element of eUpV, such that:

Ep [( f − F(p))v] = 0,
v ∈ eUpV
(129)

In the basis eUpB, we have F(p) = ∑i ˆfp,i
eUpXi and:

Covp
�
f, Xj
� = ∑
i
ˆfp,i Ep
�eUpXi
eUpXj
�
,
j = 1, . . . , m,
(130)

so that ˆfp = I−1
B (p) Covp (X, f ) and

F(p) = ˆf ′
p
eUpX = Covp ( f, X) I−1
B (p) eUpX.
(131)

Let us compute the gradient of the relaxed function φ = E· [ f ] : EV. We have φp(θ) = Eep(θ) [ f ],
and from the properties of exponential families, the Euclidean gradient is ∇φp(0) = Covp ( f, X). It
follows that the natural gradient is:

�∇φp(0) = I−1
B (p) Covp (X, f ) = ˆf,
(132)

and the Riemannian gradient is ∇φ(p) = F(p).
From the properties of exponential families, we have:

Hess φp(0) = Covp (X, X, f ) ,

so that, in this case, Equation (127), when written in terms of the moments, is:

β′ Covp (X, X, f ) α − 1

2 β′ Covp (X, X, α · X) Covp (X, X)−1 Covp (X, f ) .
(133)

4.1.1. Example: Binary Independent 2.5.2 and 3.2.1 Continued

We list below the computation of the Hessian in the case of two binary independent variables.
Computations were done with Sage [22], which allows both the reduction x2
i = 1 in the ring of
polynomials and the simplifications in the symbolic ring of parameters.

Covη (X, f ) =

�
−
�
η2
1 − 1
�
a1 −
�
η2
1η2 − η2
�
a12
−
�
η2
2 − 1
�
a2 −
�
η1η2
2 − η1
�
a12

�

=

�
−(η1 − 1)(η1 + 1)(a12η2 + a1)
−(η2 − 1)(η2 + 1)(a12η1 + a2)

�

(134)

Covη (X, X) =

�
−η2
1 + 1
0
0
−η2
2 + 1

�

=

�
−(η1 − 1)(η1 + 1)
0
0
−(η2 − 1)(η2 + 1)

�

(135)

229


Entropy 2014, 16, 4260–4289

Covη (X, X)−1 Covη (X, f ) =

�
a12η2 + a1
a12η1 + a2

�

= ∇F(η)
(136)

Covη (X, X, f ) =
�
2
�
η3
1 − η1
�
a1 + 2
�
η3
1η2 − η1η2
�
a12
�
η2
1η2
2 − η2
1 − η2
2 + 1
�
a12
�
η2
1η2
2 − η2
1 − η2
2 + 1
�
a12
2
�
η1η3
2 − η1η2
�
a12 + 2
�
η3
2 − η2
�
a2

�

=

�
2 (η1 − 1)(η1 + 1)(a12η2 + a1)η1
(η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12
(η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12
2 (η2 − 1)(η2 + 1)(a12η1 + a2)η2

�

(137)

Covη (X, X)−1 Covη (X, X, f ) =

�
−2 (a12η2 + a1)η1
−a12η2
2 + a12
−a12η2
1 + a12
−2 (a12η1 + a2)η2

�

(138)

Covη (X, X, ∇F(η)) =
�
2 (a12η2 + a1)(η1 + 1)(η1 − 1)η1
0
0
2 (a12η1 + a2)(η2 + 1)(η2 − 1)η2

�

(139)

Covη (X, X)−1 Covη (X, X, ∇F(η)) =
�
−2 (a12η2 + a1)η1
0
0
−2 (a12η1 + a2)η2

�

(140)

The Riemannian Hessian as a matrix in the basis of the tangent space is:

Hess F(η) = Covη (X, X)−1
�
Covη (X, X, f ) − 1

2 Covη (X, X, ∇F(η))
�
=
�
−(a12η2 + a1)η1
−a12(η2 + 1)(η2 − 1)
−a12(η1 + 1)(η1 − 1)
−(a12η1 + a2)η2

�

(141)

As a check, let us compute the Riemannian Hessian as a natural Hessian in the Riemannian

parameters, Hess φ ◦ Expp(u)
���
u=0; see [4] (Prop. 5.5.4). We have:

F ◦ Expη(u) =

a12 sin
��

−η2
1 + 1u1 + arcsin (η1)
�
sin
��

−η2
2 + 1u2 + arcsin (η2)
�
+

a1 sin
��

−η2
1 + 1u1 + arcsin (η1)
�
+ a2 sin
��

−η2
2 + 1u2 + arcsin (η2)
�
(142)

and:

Hess F ◦ Expη(u)
���
u=0 =
� �
η2
1 − 1
�
a12η1η2 +
�
η2
1 − 1
�
a1η1
�
η2
1 − 1
��
η2
2 − 1
�
a12
�
η2
1 − 1
��
η2
2 − 1
�
a12
�
η2
2 − 1
�
a12η1η2 +
�
η2
2 − 1
�
a2η2

�

=

�
(a12η2 + a1)(η1 + 1)(η1 − 1)η1
a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1)
a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1)
(a12η1 + a2)(η2 + 1)(η2 − 1)η2

�

.
(143)

230


Entropy 2014, 16, 4260–4289

Note the presence of the factor Covη (X, X).

4.2. Newton Method

The Newton method is an iterative method that generates a sequence of points pt, with t = 0, 1, . . . ,
that converges towards a stationary point ˆp of a F(p) = Ep [ f ], p ∈ EV, that is, a critical point of the
vector field p �→ ∇F(p), ∇F( ˆp) = 0. Here, we follow [4] (Ch. 5–6), and in particular Algorithm 5 on
Page 113.
Let ∇F be a gradient field. We reproduce in our case the basic derivation of the Newton method
in the following. Note that, in this section, we use the notation Hess •[α] to denote Hessα •. Using
the definition of metric derivative, we have for a geodesic curve [0, 1] ∋ t �→ p(t) ∈ EV connecting
p = p(0) to ˆp = p(1) that:

d
dt gp(t) (∇F(p(t)), δp(t)) = gp(t) (Hess F(p(t))[δp(t)], δp(t))
(144)

hence the increment from p to ˆp is:

g ˆp (∇F( ˆp), δp(1)) − gp (∇F(p), δp(0)) =
� 1

0 gp(t) (Hess F(p(t))[δp(t)], δp(t)) dt.
(145)

Now, we assume that ∇F( ˆp) = 0 and that in Equation (145), the integral is approximated by the
initial value of the integrand, that is to say, the Hessian is approximately constant on the geodesic from
p to ˆp; we obtain:
− gp (∇F(p), δp(0)) = gp (Hess F(p)[δp(0)], δp(0)) + ϵ.
(146)

If we can solve the Newton equation:

Hess F(p(t))[u] = −∇F(p)
(147)

then u is approximately equal to the initial velocity of the geodesic connecting p to ˆp, that is, ˆp =
Expp(u).
The particular structure of the exponential manifold suggests at least two natural retractions
that could be used to move from u to ˆp.
Namely, we have the Riemannian exponential
(θt, θt+1) �→ Expθt(θt+1 − θt) and the e-retraction coming from the exponential family itself and
defined by (θt, θt+1) �→ eθt(θt+1 − θt), with θt+1 − θt = ut.
In the θ parameters, with the e-retraction, the Newton method generates a sequence (θt) according
to the following updating rule:

θt+1 = θt − λ Hess F(θt)−1 �∇F(θt)
(148)

where λ > 0 is an extra parameter intended to control the step size and, in turn, the convergence to ˆθ;
see [5].
We can rewrite Equation (148) in terms of covariances as:

θt+1 = θt − λ
�
Covθt(X, X, f ) − 1

2 Covθt(X, X, �∇F(θt))
�−1
�∇F(θt).
(149)

4.3. Example: Binary Independent

In the η parameters, the Newton step is:

u = − Hess F(η)−1∇F(η) =

⎛

⎜
⎝

a2
12η1+a12a2+(a1a12η1+a1a2)η2

a2
12η2
1+(a12a2η1+a2
12)η2
2−a2
12+(a1a12η2
1+a1a2η1)η2
a1a2η1+a1a12+(a12a2η1+a2
12)η2

a2
12η2
1+(a12a2η1+a2
12)η2
2−a2
12+(a1a12η2
1+a1a2η1)η2

⎞

⎟
⎠
(150)

231


Entropy 2014, 16, 4260–4289

and the new η in the Riemannian retraction is:

Expη(u) =

⎛

⎜
⎜
⎝

sin
�
(a2
12η1+a12a2+(a1a12η1+a1a2)η2)√

−η2
1+1

a2
12η2
1+(a12a2η1+a2
12)η2
2−a2
12+(a1a12η2
1+a1a2η1)η2 + arcsin (η1)
�

sin
�
(a1a2η1+a1a12+(a12a2η1+a2
12)η2)√

−η2
2+1

a2
12η2
1+(a12a2η1+a2
12)η2
2−a2
12+(a1a12η2
1+a1a2η1)η2 + arcsin (η2)
�
.

⎞

⎟
⎟
⎠
(151)

In Figure 6, we represented the vector field associated with the Newton step in the η parameters,
with λ = 0.05, using the Riemannian retraction, for the case a1 = 1, a2 = 2 and a12 = 3, with:

Expη(u) =

⎛

⎜
⎜
⎝

sin
�
λ
√

−η2
1+1((3 η1+2)η2+9 η1+6)

3 (2 η1+3)η2
2+9 η2
1+(3 η2
1+2 η1)η2−9 + arcsin (η1)
�

sin
�
λ
(3 (2 η1+3)η2+2 η1+3)√

−η2
2+1

3 (2 η1+3)η2
2+9 η2
1+(3 η2
1+2 η1)η2−9 + arcsin (η2)
�

⎞

⎟
⎟
⎠ .
(152)

The red dotted lines represented in the figure identify the basins of attraction of the vector field and
correspond to the solutions of the explicit equation in η for which the Newton step u is not defined.
This vector field can be compared to that in Figure 7, associated with the Newton step for F(η) using
the Euclidean geometry. In the Euclidean geometry, F(η) is a quadratic function with one saddle point,
so that from any η, the Newton step points in the direction of the critical point. This makes the Newton
step unsuitable for an optimization algorithm. On the other side, in the Riemannian geometry, the
vertices of the polytope are critical points for F(η), and they determine the presence of multiple basins
of attraction, as expected.

−1.0
−0.5
0.0
0.5
1.0

−1.0
−0.5
0.0
0.5
1.0

η1

η2

−4

−2

0

2

4

6

 −2 

 0 

 2 

 4 

Expectation parameters

Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines
identify the different basins of attraction and correspond to the points for which the Newton step is not
defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows.

232


Entropy 2014, 16, 4260–4289

−1.0
−0.5
0.0
0.5
1.0

−1.0
−0.5
0.0
0.5
1.0

η1

η2

−4

−2

0

2

4

6

 −2 

 0 

 2 

 4 

Expectation parameters

Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05.

−2
−1
0
1
2

−2
−1
0
1
2

θ1

θ2

−2

0

2

4

 −2 

 0 

 2 

 4 

Natural parameters

 0 

 0 

 0 

 0 

Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted
lines identify the different basins of attraction and correspond to the points for which the Newton step
is not defined. The instability along the critical lines, which identifies the basins of attraction, is not
represented.

233


Entropy 2014, 16, 4260–4289

−2
−1
0
1
2

−2
−1
0
1
2

θ1

θ2

−2

0

2

4

 −2 

 0 

 2 

 4 

Natural parameters

 0 

 0 

 0 

 0 

Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines
identify the different basins of attraction and correspond to the points for which the Newton step is
not defined. The instability along the critical lines, which identifies the basins of attraction, is not
represented.

Figure 8 shows the Newton step in the θ parameters based on the e-retraction of Equation (149),
while Figure 9 represents the Newton step evaluated with respect to the Euclidean geometry. A
comparison of the two vector fields shows that, differently from the η parameters, the number of
basins of attraction is the same in the two geometries; however, the scale of the vectors is different.
In particular, notice how on the plateau, for diverging θ, the Newton step in the Euclidean geometry
vanishes, while in the Riemannian geometry, it gets larger. This behavior suggests better convergence
properties for an optimization algorithm based on the Newton step evaluated using the proper
Riemannian geometry. In the θ parameters, the boundaries of the basins of attraction represented by
the red dotted lines have been computed numerically and correspond to the values of θ for which the
update step is not defined.
Finally, notice that in both the η and θ parameters, the step is not always in the direction of descent
for the function, a common behavior of the Newton method, which converges to the critical points.

5. Discussion and Conclusions

In this paper, we introduced second-order calculus over a statistical manifold, following the
approach described in [4], which has been adapted to the special case of exponential statistical
models [2,3]. By defining the Riemannian Hessian and using the notion of retraction, we developed
the proper machinery necessary for the definition of the updating rule of the Newton method for the
optimization of a function defined over an exponential family.
The examples discussed in the paper show that by taking into account the proper Riemannian
geometry of a statistical exponential family, the vector fields associated with the Newton step in the
different parametrizations change profoundly. Not only new basins of attraction associated with local
and global minima appear, as for the expectation parameters, but also the magnitude of the Newton
step is affected, as over the plateau in the natural parameters. Such differences are expected to have a
strong impact on the performance of an optimization algorithm based on the Newton step, from both
the point of view of achievable convergence and the speed of convergence to the optimum.

234


Entropy 2014, 16, 4260–4289

The Newton method is a popular second order optimization technique based on the computation
of the Hessian of the function to be optimized and is well known for its super-linear convergence
properties. However, the use of the Newton method poses a number of issues in practice.
First of all, as the examples in Figures 6 and 8 show, the Newton step does not always point
in the direction of the natural gradient, and the algorithm may not converge to a (local) optimum
of the function. Such behavior is not unexpected; indeed the Newton method tends to converge to
critical points of the function to be optimized, which include local minima, local maxima and saddle
points. In order to obtain a direction of ascent for the function to be optimized, the Hessian must
be negative-definite, i.e., its eigenvalues must be strictly negative, which is not guaranteed in the
general case. Another important remark is related to the computational complexity associated with
the evaluation of the Hessian, compared to the (natural) gradient. Indeed, to obtain the Newton step d,
Christoffel matrices have to be evaluated, together with the third order covariances between sufficient
statistics and the function, and the Hessian has to be inverted. Finally, notice that when the Hessian is
close to being non-invertible, numerical problems may arise in the computation of the Newton step,
and the algorithm may become unstable and diverge.
In the literature, different methods have been proposed to overcome these issues. Among them,
we mention quasi-Newton methods, where the update vector is obtained using a modified Hessian,
which has been made negative-definite, for instance, by adding a proper correction matrix.
This paper represents the first step in the design of an algorithm based on the Newton method
for the optimization over a statistical model. The authors are working on the computational aspects
related to the implementation of the method, and a new paper with experimental results is in progress.

Acknowledgments: Luigi Malagò was supported by the Xerox University Affairs Committee Award and by
de Castro Statistics, Collegio Carlo Alberto, Moncalieri. Giovanni Pistone is supported by de Castro Statistics,
Collegio Carlo Alberto, Moncalieri, and is a member of GNAMPA–INdAM, Roma.

Author Contributions

All authors contributed to the design of the research. The research was carried out by all authors.
The study of the Hessian and of the Newton method in statistical manifolds was originally suggested
by Luigi Malagò. The manuscript was written by Luigi Malagò and Giovanni Pistone. All authors
have read and approved the final manuscript.

Conflicts of Interest: Conflicts of Interest
The authors declare no conflict of interest.

References

1.
Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory;
Number 9 in IMS Lecture Notes. Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA,
1986; p. 283.
2.
Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI,
USA, 2000; p. 206.
3.
Pistone, G. Nonparametric Information Geometry. In Geometric Science of Information, Proceedings of the
First International Conference, GSI 2013, Paris, France, 28–30 August 2013; Nielsen, F., Barbaresco, F., Eds.;
Lecture Notes in Computer Science, Volume 8085; Springer: Berlin/Heidelberg, Germany, 2013; pp. 5–36.
4.
Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University
Press: Princeton, NJ, USA, 2008; pp. xvi+224.
5.
Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research and Financial
Engineering; Springer: New York, NY, USA, 2006; pp. xxii+664.
6.
Do Carmo, M.P. Riemannian geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston,
MA, USA, 1992; pp. xiv+300.
7.
Abraham, R.; Marsden, J.E.; Ratiu, T.
Manifolds, Tensor Analysis, and Applications, 2nd ed.; Applied
Mathematical Sciences, Volume 75; Springer: New York, NY, USA, 1988; pp. x+654.

235


Entropy 2014, 16, 4260–4289

8.
Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: New York,
NY, USA, 1995; pp. xiv+364.
9.
Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric
Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P., Eds.; Cambridge University
Press: Cambridge, UK, 2010.
10.
Malagò, L.; Matteucci, M.; Dal Seno, B. An information geometry perspective on estimation of distribution
algorithms: Boundary analysis. In Proceedings of the 2008 GECCO Conference Companion On Genetic and
Evolutionary Computation (GECCO ’08); ACM: New York, NY, USA, 2008; pp. 2081–2088.
11.
Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming. In
Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity,
Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
12.
Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by Estimation of Empirical
Covariances. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA,
USA, 5–8 June 2011; pp. 949–956.
13.
Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based
on the exponential family. In Proceedings of the 11th Workshop on Foundations of Genetic Algorithms
(FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011 ; ACM: New York, NY, USA, 2011; pp. 230–242.
14.
Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying
perspective. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico,
20–23 June 2013; pp. 486–493.
15.
Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276.
16.
Shima, H. The Geometry of Hessian Structures; World Scientific Publishing Co. Pte. Ltd.: Hackensack, NJ, USA,
2007; pp. xiv+246.
17.
Malagò, L. On the Geometry of Optimization Based on the Exponential Family Relaxation. Ph.D. Thesis,
Politecnico di Milano, Milano, Italy, 2012.
18.
Gallavotti, G. Statistical Mechanics: A Short Treatise; Texts and Monographs in Physics; Springer: Berlin,
Germany, 1999; pp. xiv+339.
19.
Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149.
20.
Esteban, J.; Ray, D. On the Measurement of Polarization. Econometrica 1994, 62, 819–851.
21.
Montalvo, J.; Reynal-Querol, M. Ethnic polarization, potential conflict, and civil wars. Am. Econ. Rev. 2005,
796–816.
22.
Stein, W. et al. Sage Mathematics Software (Version 6.0). The Sage Development Team, 2013. Available
online: http://www.sagemath.org (accessed on 27 March 2014).

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

236


entropy

Article
Information Geometric Complexity of a Trivariate
Gaussian Statistical Model

Domenico Felice 1,2,*, Carlo Cafaro 3 and Stefano Mancini 1,2

1 School of Science and Technology, University of Camerino, I-62032 Camerino, Italy; E-Mail:
stefano.mancini@unicam.it
2 INFN-Sezione di Perugia, Via A. Pascoli, I-06123 Perugia, Italy
3 Department of Mathematics, Clarkson University, Potsdam, 13699 NY, USA; E-Mail: carlocafaro2000@yahoo.it
*
E-Mail: domenico.felice@unicam.it

Received: 1 April 2014; in revised form: 21 May 2014 / Accepted: 22 May 2014 /
Published: 26 May 2014

Abstract: We evaluate the information geometric complexity of entropic motion on low-dimensional
Gaussian statistical manifolds in order to quantify how difficult it is to make macroscopic predictions
about systems in the presence of limited information. Specifically, we observe that the complexity of
such entropic inferences not only depends on the amount of available pieces of information but also
on the manner in which such pieces are correlated. Finally, we uncover that, for certain correlational
structures, the impossibility of reaching the most favorable configuration from an entropic inference
viewpoint seems to lead to an information geometric analog of the well-known frustration effect that
occurs in statistical physics.

Keywords: probability theory; Riemannian geometry; complexity

1. Introduction

One of the main efforts in physics is modeling and predicting natural phenomena using relevant
information about the system under consideration. Theoretical physics has had a general measure of
the uncertainty associated with the behavior of a probabilistic process for more than 100 years: the
Shannon entropy [1]. The Shannon information theory was applied to dynamical systems and became
successful in describing their unpredictability [2].
Along a similar avenue we may set Entropic Dynamics [3] which makes use of inductive inference
(Maximum Entropy Methods [4]) and Information Geometry [5]. This is clearly remarkable given
that microscopic dynamics can be far removed from the phenomena of interest, such as in complex
biological or ecological systems. Extension of ED to temporally-complex dynamical systems on curved
statistical manifolds led to relevant measures of chaoticity [6]. In particular, an information geometric
approach to chaos (IGAC) has been pursued studying chaos in informational geodesic flows describing
physical, biological or chemical systems. It is the information geometric analogue of conventional
geometrodynamical approaches [7] where the classical configuration space is being replaced by a
statistical manifold with the additional possibility of considering chaotic dynamics arising from non
conformally flat metrics. Within this framework, it seems natural to consider as a complexity measure
the (time average) statistical volume explored by geodesic flows, namely an Information Geometry
Complexity (IGC).
This quantity might help uncover connections between microscopic dynamics and experimentally
observable macroscopic dynamics which is a fundamental issue in physics [8].
An interesting
manifestation of such a relationship appears in the study of the effects of microscopic external
noise (noise imposed on the microscopic variables of the system) on the observed collective motion
(macroscopic variables) of a globally coupled map [9]. These effects are quantified in terms of the
complexity of the collective motion. Furthermore, it turns out that noise at a microscopic level reduces

Entropy 2014, 16, 2944–2958; doi:10.3390/e16062944
www.mdpi.com/journal/entropy
237


Entropy 2014, 16, 2944–2958

the complexity of the macroscopic motion, which in turn is characterized by the number of effective
degrees of freedom of the system.
The investigation of the macroscopic behavior of complex systems in terms of the underlying
statistical structure of its microscopic degrees of freedom also reveals effects due to the presence of
microcorrelations [10]. In this article we first show which macro-states should be considered in a
Gaussian statistical model in order to have a reduction in time of the Information Geometry Complexity.
Then, dealing with correlated bivariate and trivariate Gaussian statistical models, the ratio between
the IGC in the presence and in the absence of microcorrelations is explicitly computed, finding an
intriguing, even though non yet deep understood, connection with the phenomenon of geometric
frustration [11].
The layout of the article is as follows. In Section 2 we introduce a general statistical model
discussing its geometry and describing both its dynamics and information geometry complexity. In
Section 3, Gaussian statistical models (up to a trivariate model) are considered. There, we compute
the asymptotic temporal behaviors of their IGCs. Finally, in Section 4 we draw our conclusions by
outlining our findings and proposing possible further investigations.

2. Statistical Models and Information Geometry Complexity

Given n real-valued random variables X1, . . . , Xn defined on the sample space Ω with joint
probability density p : Rn → R satisfying the conditions

p(x) ≥ 0 (∀x ∈ Rn)
and
�

Rn dx p(x) = 1,
(1)

let us consider a family P of such distributions and suppose that they can be parametrized using m
real-valued variables (θ1, . . . , θm) so that

P = {pθ = p(x|θ)|θ = (θ1, . . . , θm) ∈ Θ},
(2)

where Θ ⊆ Rm is the parameter space and the mapping θ → pθ is injective. In such a way, P is an
m-dimensional statistical model on Rn.
The mapping ϕ : P → Rm defined by ϕ(pθ) = θ allows us to consider ϕ = [θi] as a coordinate
system for P. Assuming parametrizations which are C∞, we can turn P into a C∞ differentiable
manifold (thus, P is called statistical manifold) [5].
The values x1, . . . , xn taken by the random variables define the micro-state of the system, while the
values θ1, . . . , θm taken by parameters define the macro-state of the system.
Let P = {pθ|θ ∈ Θ} be an m-dimensional statistical model. Given a point θ, the Fisher information
matrix of P in θ is the m × m matrix G(θ) = [gij], where the (i, j) entry is defined by

gij(θ) :=
�

Rn dxp(x|θ)∂i log p(x|θ)∂j log p(x|θ),
(3)

with ∂i standing for
∂
∂θi . The matrix G(θ) is symmetric, positive semidefinite and determines a
Riemannian metric on the parameter space Θ [5]. Hence, it is possible to define a Riemannian statistical
manifold M := (Θ, g), where g = gijdθi ⊗ dθj (i, j = 1, . . . , m) is the metric whose components gij are
given by Equation (3) (throughout the paper we use the Einstein sum convention).
Given the Riemannian manifold M = (Θ, g), it is well known that there exists only one
linear connection ∇(the Levi–Civita connection) on M that is compatible with the metric g and
symmetric [12]. We remark that the manifold M has one chart, being Θ an open set of Rm, and the
Levi-Civita connection is uniquely defined by means of the Christoffel coefficients

Γk
ij = 1

2 gkl�∂glj

∂θi + ∂gil

∂θj − ∂gij

∂θl

�
,
(i, j, k = 1, . . . , m)
(4)

238


Entropy 2014, 16, 2944–2958

where gkl is the (k, l) entry of the inverse of the Fisher matrix G(θ).
The idea of curvature is the fundamental tool to understand the geometry of the manifold
M = (Θ, g). Actually, it is the basic geometric invariant and the intrinsic way to obtain it is by
means of geodesics. It is well-known, that given any point θ ∈ M and any vector v tangent to
M at θ, there is a unique geodesic starting at θ with initial tangent vector v. Indeed, within the
considered coordinate system, the geodesics are solutions of the following nonlinear second order
coupled ordinary differential equations [12]

d2θk

dτ2 + Γk
ij
dθi

dτ
dθj

dτ = 0,
(5)

with τ denoting the time.
The recipe to compute some curvatures at a point θ ∈ M is the following: first, select a
2-dimensional subspace Π of the tangent space to M at θ; second, follow the geodesics through
θ whose initial tangent vectors lie in Π and consider the 2-dimensional submanifolds SΠ swiped out
by them inheriting a Riemannian metric from M; finally, compute the Gaussian curvature of SΠ at θ,
which can be obtained from its Riemannian metric as stated in the Theorema Egregium [13]. The number
K(Π) found in such manner is called the sectional curvature of M at θ associated with the plane Π. In
terms of local coordinates, to compute the sectional curvature we need the curvature tensor,

Rh
ijk =
∂Γh
jk

∂θi − ∂Γh
ik

∂θj + Γl
jkΓh
il − Γl
ikΓh
jl.
(6)

For any basis (ξ, η) for a 2-plane Π ⊂ TθM, the sectional curvature at θ ∈ M is given by [12]

K(ξ, η) =
R(ξ, η, η, ξ)

|ξ|2|η|2 − ⟨ξ, η⟩,
(7)

where R is the Riemann curvature tensor which is written in coordinates as R = Rijkldθi ⊗ dθj ⊗ dθk ⊗
dθl with Rijkl = glhRh
ijk and ⟨·, ·⟩ is the inner product defined by the metric g.
The sectional curvature is directly related to the topology of the manifold; along this direction
the Cartan-Hadamard Theorem [13] is enlightening by stating that any complete, simply connected
n-dimensional manifold with non positive sectional curvature is diffeomorphic to Rn.
We can consider upon the statistical manifold M = (Θ, g) the macro-variables θ as accessible
information and then derive the information dynamical Equation (5) from a standard principle of least
action of Jacobi type [3]. The geodesic Equations (5) describe a reversible dynamics whose solution is
the trajectory between an initial and a final macrostate θinitial and θfinal, respectively. The trajectory can
be equally traversed in both directions [10]. Actually, an equation relating instability with geometry
exists and it makes hope that some global information about the average degree of instability (chaos)
of the dynamics is encoded in global properties of the statistical manifolds [7]. The fact that this might
happen is proved by the special case of constant-curvature manifolds, for which the Jacobi-Levi-Civita
equation simplifies to [7]
d2Ji

dτ2 + KJi = 0,
(8)

where K is the constant sectional curvature of the manifold (see Equation (7)) and J is the geodesic
deviation vector field. On a positively curved manifold, the norm of the separating vector J does not
grow, whereas on a negatively curved manifold, the norm of J grows exponentially in time, and if the
manifold is compact, so that its geodesic are sooner or later obliged to fold, this provide an example of
chaotic geodesic motion [14].

239


Entropy 2014, 16, 2944–2958

Taking into consideration these facts, we single out as suitable indicator of dynamical (temporal)
complexity, the information geometric complexity defined as the average dynamical statistical
volume [15]
�
vol
�
D(geodesic)
Θ
(τ)
�
:= 1

τ

� τ

0 dτ′vol
�
D(geodesic)
Θ
(τ′)
�
,
(9)

where
vol
�
D(geodesic)
Θ
(τ′)
�
:=
�

D(geodesic)
Θ
(τ′)

�

det(G(θ)) dθ,
(10)

with G(θ) the information matrix whose components are given by Equation (3). The integration space
D(geodesic)
Θ
(τ′) is defined as follows

D(geodesic)
Θ
(τ′) :=
�
θ = (θ1, . . . , θm) : θk(0) ≤ θk ≤ θk(τ′)
�
,
(11)

where θk ≡ θk(s) with 0 ≤ s ≤ τ′ such that θk(s) satisfies (5). The quantity vol
�
D(geodesic)
Θ
(τ′)
�
is the

volume of the effective parameter space explored by the system at time τ′. The temporal average
has been introduced in order to average out the possibly very complex fine details of the entropic
dynamical description of the system’s complexity dynamics.
Relevant properties, concerning complexity of geodesic paths on curved statistical manifolds, of
the quantity (10) compared to the Jacobi vector field are discussed in [16].

3. The Gaussian Statistical Model

In the following we devote our attention to a Gaussian statistical model P whose element are
multivariate normal joint distributions for n real-valued variables X1, . . . , Xn given by

p(x|θ) =
1
�

(2π)n det C
exp
�
−1

2(x − μ)tC−1(x − μ)
�
,
(12)

where μ =
�E(X1), . . . , E(Xn)
�
is the n-dimensional mean vector and C denotes the n × n covariance
matrix with entries cij = E(XiXj) − E(Xi)E(Xj), i, j = 1, . . . , n. Since μ is a n-dimensional real vector

and C is a n × n symmetric matrix, the parameters involved in this model should be n + n(n+1)

2
.
Moreover C is a symmetric, positive definite matrix, hence we have the parameter space given by

Θ := {(μ, C)|μ ∈ Rn, C ∈ Rn×n, C > 0}.
(13)

Hereafter we consider the statistical model given by Equation (12) when the covariance matrix C has
only variances σ2
i = E(X2
i ) − (E(Xi))2 as parameters. In fact we assume that the non diagonal entry
(i, j) of the covariance matrix C equals ρσiσj with ρ ∈ R quantifying the degree of correlation.
We may further notice that the function fij(x) := ∂i log p(x|θ)∂j log p(x|θ), when p(x|θ) is given
by Equation (12), is a polynomial in the variables xi (i = 1, . . . , n) whose degree is not grater than four.
Indeed, we have that

∂i log p(x|θ) =
1

p(x|θ)∂ip(x|θ) = ∂i
1
�

(2π)n det C + ∂i

�
−1

2(x − μ)tC−1(x − μ)
�
,
(14)

and, therefore, the differentiation does not affect variables xi. With this in mind, in order to compute
the integral in (3), we can use the following formula [17]

1
�

(2π)n det C

�
dx fij(x) exp
�
−1

2(x − μ)tC−1(x − μ)
�
= exp

�
1
2

n
∑
h,k=1
chk
∂
∂xh

∂
∂xk

�

fij|x=μ,
(15)

where the exponential denotes the power series over its argument (the differential operator).

240


Entropy 2014, 16, 2944–2958

3.1. The monovariate Gaussian Statistical Model

We now start to apply the concepts of the previous section to a Gaussian statistical model of
Equation (12) for n = 1. In this case, the dimension of the statistical Riemannian manifold M = (Θ, g)
is at most two. Indeed, to describe elements of the statistical model P given by Equation (12), we
basically need the mean μ = E(X) and variance σ2 = E(X − μ)2. We deal separately with the
cases when the monovariate model has only μ as macro-variable (Case 1), when σ is the unique
macro-variable (Case 2), and finally when both μ and σ are macro-variables (Case 3).

3.1.1. Case 1

Consider the monovariate model with only μ as macro-variable by setting σ = 1. In this case
the manifold M is trivially the real flat straight line, since μ ∈ (−∞, +∞). Indeed, the integral

in (3) is equal to 1 when the distribution p(x|θ) reads as p(x|μ) =
exp
�
− 1

2 (x−μ)2�

√

2π
; so the metric

is g = dμ2. Furthermore, from Equations (4) and (5) the information dynamics is described by
the geodesic μ(τ) = A1τ + A2, where A1, A2 ∈ R. Hence, the volume of Equation (10) results

vol
�
D(geodesic)
Θ
(τ′)
�
= �
dμ = A1τ + A2; since this quantity must be positive we assume A1, A2 > 0.

Finally, the asymptotic behavior of the IGC (9) is

�
vol
�
D(geodesic)
Θ
(τ)
�
≈
� A1

2

�
τ.
(16)

This shows that the complexity linearly increases in time meaning that acquiring information about μ
and updating it, is not enough to increase our knowledge about the micro state of the system.

3.1.2. Case 2

Consider now the monovariate Gaussian statistical model of Equation(12) when μ = E(X) = 0
and the macro-variable is only σ. In this case the probability distribution function reads p(x|σ) =

exp
�
− x2

2σ2
�

√

2πσ
while the Fisher–Rao metric becomes g =
2
σ2 dσ2. Emphasizing that also in this case the
manifold is flat as well, we derive the information dynamics by means of Equations (4) and (5) and we
obtain the geodesic σ(τ) = A1 exp
�
A2τ
�
. The volume in Equation (10) then results

vol
�
D(geodesic)
Θ
(τ′)
�
=
� √

2
σ dσ =
√

2 log
�
A1 exp
�
A2τ
��
.
(17)

Again, to have positive volume we have to assume A1, A2 > 0. Finally, the (asymptotic) IGC (9)
becomes
�
vol
�
D(geodesic)
Θ
(τ)
�
≈
�√

2A2
2

�
τ.
(18)

This shows that also in this case the complexity linearly increases in time meaning that acquiring
information about σ and updating it, is not enough to increase our knowledge about the micro-state of
the system.

3.1.3. Case 3

The take home message of the previous cases is that we have to account for both mean μ and
variance σ as macro-variables to look for possible non increasing complexity. Hence, consider the
probability distribution function is given by,

p(x1, x2|μ, σ) =
exp
�
− 1

2
(x−μ)2

σ2
�

σ
√

2π
.
(19)

241


Entropy 2014, 16, 2944–2958

The dimension of the Riemannian manifold M = (Θ, g) is two, where the parameter space Θ is given
by Θ = {(μ, σ)|μ ∈ (−∞, +∞), σ > 0} and the Fisher–Rao metric reads as g =
1
σ2 dμ2 + 2

σ2 dσ2. Here,
the sectional curvature given by Equation (7) is a negative function and despite the fact that is not
constant, we expect a decreasing behavior in time of the IGC. Thanks to Equation (4), we find that the
only non negative Christoffel coefficients are Γ1
12 = − 1

σ, Γ2
11 =
1
2σ and Γ2
22 = − 1

σ. Substituting them
into Equation (5) we derive the following geodesic equations

⎧
⎪
⎪
⎨

⎪
⎪
⎩

d2μ(τ)

dτ2
− 2

σ
dσ
dτ
dμ
dτ = 0,

d2σ(τ)

dτ2
− 1

σ
�
dσ
dτ
�2
+ 1

2σ
� dμ

dτ
�2
= 0.

(20)

The integration of the above coupled differential equations is non-trivial. We follow the method
described in [10] and arrive at

σ(τ) =
2σ0 exp
� σ0|A1|
√

2 τ
�

1 + exp
� 2σ0|A1|
√

2
τ
�,
μ(τ) = −
2σ0
√

2A1

|A1|
�
1 + exp
� 2σ0|A1|
√

2
τ
��,
(21)

where σ0 and A1 are real constants. Then, using (21), the volume of Equation (10) results

vol
�
D(geodesic)
Θ
(τ′)
�
=
� √

2

σ2 dσdμ =
√

2A1
|A1| exp
�
− σ0|A1|
√

2
τ
�
.
(22)

Since the last quantity must be positive, we assume A1 > 0. Finally, employing the above expression
into Equation (9) we arrive at

�
vol
�
D(geodesic)
Θ
(τ)
�
≈
�
2

σ0A1

� 1

τ .
(23)

We can now see a reduction in time of the complexity meaning that acquiring information about both
μ and σ and updating them allows us to increase our knowledge about the micro state of the system.
Hence, comparing Equations (16), (18) and (23) we conclude that the entropic inferences on a
Gaussian distributed micro-variable is carried out in a more efficient manner when both its mean and
the variance in the form of information constraints are available. Macroscopic predictions when only
one of these pieces of information are available are more complex.

3.2. Bivariate Gaussian Statistical Model

Consider now the Gaussian statistical model P of the Equation (12) when n = 2. In this case
the dimension of the Riemannian manifold M = (Θ, g) is at most four. From the analysis of the
monovariate Gaussian model in Section 3.1 we have understood that both mean and variance should
be considered. Hence the minimal assumption is to consider E(X1) = E(X2) = μ and E(X1 − μ)2 =
E(X2 − μ)2 = σ2. Furthermore, in this case we have also to take into account the possible presence of
(micro) correlations, which appear at the level of macro-states as off-diagonal terms in the covariance
matrix. In short, this implies considering the following probability distribution function

p(x1, x2|μ, σ) =
exp
�
−
1

2σ2(1−ρ2)

�
(x1 − μ)2 − 2ρ(x1 − μ)(x2 − μ) + (x2 − μ)2��

2πσ2�

1 − ρ2
,
(24)

where ρ ∈ (−1, 1).
Thanks to Equation (15) we compute the Fisher-Information matrix G and find g = g11dμ2 +
g22dσ2 with,

g11 =
2

σ2(ρ + 1); g22 = 4

σ2 .
(25)

242


Entropy 2014, 16, 2944–2958

The only non trivial Christoffel coefficients (4) are Γ1
12 = − 1

σ, Γ2
11 =
1

2σ(ρ+1) and Γ2
22 = − 1

σ. In this case
as well, the sectional curvature (Equation (7)) of the manifold M is a negative function and so we may
expect a decreasing asymptotic behavior for the IGC. From Equation (5) it follows that the geodesic
equations are,
⎧
⎪
⎪
⎨

⎪
⎪
⎩

d2μ(τ)

dτ2
− 2

σ
dσ
dτ
dμ
dτ = 0

d2σ(τ)

dτ2
− 1

σ
�
dσ
dτ
�2
+
1

2(1+ρ)σ

� dμ

dτ
�2
= 0,

(26)

whose solutions are,

σ(τ) =
2σ0 exp
�
σ0|A1|
√

2(1+ρ)τ
�

1 + exp
� 2σ0|A1|
√

2(1+ρ)τ
�,
μ(τ) = −
2σ0
�

2(1 + ρ)A1

|A1|
�
1 + exp
� 2σ0|A1|
√

2(1+ρ)τ
��.
(27)

Using (27) in Equation (10) gives the volume,

vol
�
D(geodesic)
Θ
(τ′)
�
=
�
2
√

2
�

1 + ρ σ2 dσdμ = 4A1

|A1| exp
�
−
σ0|A1|
�

2(1 + ρ)
τ
�
.
(28)

To have it positive we have to assume A1 > 0. Finally, employing (28) in (9) leads to the IGC,

�
vol
�
D(geodesic)
Θ
(τ)
�
≈
� 4
√

2

σ0A1

��

1 + ρ
τ
,
(29)

with ρ ∈ (−1, 1). We may compare the asymptotic expression of the ICGs in the presence and in the
absence of correlations, obtaining

R
strong
bivariate(ρ) :=
�
vol
�
D(geodesic)
Θ
(τ)
�

�
vol
�
D(geodesic)
Θ
(τ)
�

ρ=0

=
�

1 + ρ,
(30)

where “strong” stands for the fully connected lattice underlying the micro-variables. The ratio R
strong
bivariate(ρ)
results a monotonic increasing function of ρ.
While the temporal behavior of the IGC (29) is similar to the IGC in (23), here correlations play
a fundamental role. From Equation (30), we conclude that entropic inferences on two Gaussian
distributed micro-variables on a fully connected lattice is carried out in a more efficient manner when
the two micro-variables are negatively correlated. Instead, when such micro-variables are positively
correlated, macroscopic predictions become more complex than in the absence of correlations.
Intuitively, this is due to the fact that for anticorrelated variables, an increase in one variable
implies a decrease in the other one (different directional change): variables become more distant, thus
more distinguishable in the Fisher–Rao information metric sense. Similarly, for positively correlated
variables, an increase or decrease in one variable always predicts the same directional change for the
second variable: variables do not become more distant, thus more distinguishable in the Fisher–Rao
information metric sense. This may lead us to guess that in the presence of anticorrelations, motion on
curved statistical manifolds via the Maximum Entropy updating methods becomes less complex.

3.3. Trivariate Gaussian Statistical Model

In this section we consider a Gaussian statistical model P of the Equation (12) when n = 3.
In this case as well, in order to understand the asymptotic behavior of the IGC in the presence of
correlations between the micro-states, we make the minimal assumption that, given the random vector
X = (X1, X2, X3) distributed according to a trivariate Gaussian, then E(X1) = E(X2) = E(X3) = μ

243


Entropy 2014, 16, 2944–2958

and E(X1 − μ)2 = E(X2 − μ)2 = E(X2 − μ)2 = σ2. Therefore, the space of the parameters of P is
given by Θ = {(μ, σ)|μ ∈ R, σ > 0}.
The manifold M = (Θ, g) changes its metric structure depending on the number of correlations
between micro-variables, namely, one, two, or three . The covariance matrices corresponding to these
cases read, modulo the congruence via a permutation matrix [17],

C1 = σ2

⎛

⎜
⎝
1
ρ
0
ρ
1
0
0
0
1

⎞

⎟
⎠ ,
C2 = σ2

⎛

⎜
⎝
1
ρ
ρ
ρ
1
0
ρ
0
1

⎞

⎟
⎠ ,
C3 = σ2

⎛

⎜
⎝
1
ρ
ρ
ρ
1
ρ
ρ
ρ
1

⎞

⎟
⎠ .
(31)

3.3.1. Case 1

First, we consider the trivariate Gaussian statistical model of Equation (12) when C ≡ C1. Then
proceeding like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 =
3+ρ

(1+ρ)σ2 and g22 =
6
σ2 . Also in
this case we find that the sectional curvature of Equation (7) is a negative function. Hence, as we state
in Section 2, we may expect a decreasing (in time) behavior of the information geometry complexity.
Furthermore, we obtain the geodesics

σ(τ) =
2σ0 exp
�
σ0
�

A(ρ) τ
�

1 + exp
�
2σ0
�

A(ρ) τ
�, μ(τ) = − 2σ0A1
�

A(ρ)

1

1 + exp
�
2σ0
�

A(ρ) τ
�,
(32)

where A(ρ) = A2
1(3+ρ)
6(1+ρ) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (−1, 1). Then, the volume (10)
becomes

vol
�
D(geodesic)
Θ
(τ′)
�
=
� �

6(3 − 4ρ)
(1 − 2ρ2)
1
σ2 dσdμ = 6A1

|A1| exp
�
− σ0
�

A(ρ) τ
�
,
(33)

requiring A1 > 0 for its positivity. Finally, using (33) in (9) we arrive at the asymptotic behavior of
the IGC

�
vol
�
D(geodesic)
Θ
(τ)
�
≈
� 6
√

6

σ0A1

��

1 + ρ
3 + ρ
1
τ .
(34)

Comparing (34) in the presence and in the absence of correlations yields

R
weak
trivariate(ρ) :=
�
vol
�
D(geodesic)
Θ
(τ)
�

�
vol
�
D(geodesic)
Θ
(τ)
�

ρ=0

=
√

3

�

1 + ρ
3 + ρ,
(35)

where “weak” stands for low degree of connection in the lattice underlying the micro-variables
Notice that Rweak
trivariate(ρ) is a monotonic increasing function of the argument ρ ∈ (−1, 1).

3.3.2. Case 2

When the trivariate Gaussian statistical model of Equation (12) has C ≡ C2, the condition C > 0
constraints the correlation coefficient to be ρ ∈ (−
√

2
2 ,
√

2
2 ). Proceeding again like in Section 3.2 we

have g = g11dμ2 + g22dσ2, where g11 =
3−4ρ

(1−2ρ2)σ2 and g22 =
6
σ2 . The sectional curvature of Equation (7)
is a negative function as well and so we may apply the arguments of Section 2 expecting a decreasing
in time of the complexity. Furthermore, we obtain the geodesics

σ(τ) =
2σ0 exp
�
σ0
�

A(ρ) τ
�

1 + exp
�
2σ0
�

A(ρ) τ
�, μ(τ) = − 2σ0A1
�

A(ρ)

1

1 + exp
�
2σ0
�

A(ρ) τ
�,
(36)

244


Entropy 2014, 16, 2944–2958

where A(ρ) = A2
1(3−4ρ)

6(1−2ρ2) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (−
√

2
2 ,
√

2
2 ). Then, the
volume (10) becomes

vol
�
D(geodesic)
Θ
(τ′)
�
=
� �

6(3 − 4ρ)
(1 − 2ρ2)
1
σ2 dσdμ = 6A1

|A1| exp
�
− σ0
�

A(ρ) τ
�
.
(37)

We have to set A1 > 0 for the positivity of the volume (37), and using it in (9) we arrive at the
asymptotic behavior of the IGC

�
vol
�
D(geodesic)
Θ
(τ)
�
≈
� 6
√

6

σ0A1

��

1 − 2ρ2

3 − 4ρ
1
τ .
(38)

Then, comparing (38) in the presence and in the absence of correlations yields

R
mildly weak
trivariate (ρ) :=
�
vol
�
D(geodesic)
Θ
(τ)
�

�
vol
�
D(geodesic)
Θ
(τ)
�

ρ=0

=
√

3

�

1 − 2ρ2

3 − 4ρ ,
(39)

where “mildly weak” stands for a lattice (underlying micro-variables) neither fully connected nor with
minimal connection.
This is a function of the argument ρ ∈ (−
√

2
2 ,
√

2
2 ) that attains the maximum
�

3
2 at ρ = 1

2, while

in the extrema of the interval (−
√

2
2 ,
√

2
2 ) it tends to zero.

3.3.3. Case 3

Last, we consider the trivariate Gaussian statistical model of the Equation (12) when C ≡ C3. In
this case, the condition C > 0 requires the correlation coefficient to be ρ ∈ (− 1

2, 1). Proceeding again
like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 =
3

(1+2ρ)σ2 and g22 =
6
σ2 . We find that the
sectional curvature of Equation (7) is a negative function; hence, we may expect a decreasing (in time)
behavior of the complexity. It follows the geodesics

σ(τ) =
2σ0 exp
�
σ0
�

A(ρ) τ
�

1 + exp
�
2σ0
�

A(ρ) τ
�, μ(τ) = − 2σ0A1
�

A(ρ)

1

1 + exp
�
2σ0
�

A(ρ) τ
�,
(40)

where A(ρ) =
A2
1

2(1+2ρ) and A1 ∈ R. We note that A(ρ) > 0 for all ρ ∈ (− 1

2, 1). Using (40), we compute

vol
�
D(geodesic)
Θ
(τ′)
�
=
�
3
√

2
�

(1 + 2ρ)

1
σ2 dσdμ = 6
√

2A1

|A1|
exp
�
− σ0
�

A(ρ) τ
�
.
(41)

Also in this case we need to assume A1 > 0 to have positive volume. Finally, substituting Equation (41)
into Equation (9), the asymptotic behavior of the IGC results

�
vol
�
D(geodesic)
Θ
(τ)
�
≈
� 12

σ0A1

��

1 + 2ρ 1

τ .
(42)

The comparison of (42) in the presence and in the absence of correlations yields

R
strong
trivariate(ρ) :=
�
vol
�
D(geodesic)
Θ
(τ)
�

�
vol
�
D(geodesic)
Θ
(τ)
�

ρ=0

=
�

1 + 2ρ,
(43)

245


Entropy 2014, 16, 2944–2958

where “strong” stands for a fully connected lattice underlying the (three) micro-variables. We remark
the latter ratio is a monotonically increasing function of the argument ρ ∈ (− 1

2, 1).

The behaviors of R(ρ) of Equations (30), (35), (39) and (43) are reported in Figure 1.

−1
−0.5
0
0.5
1

ρpeak

ρ

R(ρ)

Figure 1. Ratio R(ρ) of volumes vs. degree of correlations ρ. Solid line refers to R
strong
bivariate(ρ); Dotted line
refers to Rweak
trivariate(ρ); Dashed line referes to R
mildly weak
trivariate
(ρ); Dash-dotted refers to R
strong
trivariate(ρ).

The non-monotonic behavior of the ratio R
mildly weak
trivariate (ρ) in Equation (39) corresponds to the
information geometric complexities for the mildly weak connected three-dimensional lattice.
Interestingly, the growth stops at a critical value ρpeak = 1

2 at which R
mildly weak
trivariate (ρpeak) = R
strong
bivariate(ρpeak). From
Equation (30), we conclude that entropic inferences on three Gaussian distributed micro-variables on
a fully connected lattice is carried out in a more efficient manner when the two micro-variables are
negatively correlated. Instead, when such micro-variables are positively correlated, macroscopic
predictions become more complex that in the absence of correlations.
Furthermore, the ratio
R
strong
trivariate(ρ) of the information geometric complexities for this fully connected three-dimensional
lattice increases in a monotonic fashion. These conclusions are similar to those presented for the
bivariate case. However, there is a key-feature of the IGC to emphasize when passing from the
two-dimensional to the three-dimensional manifolds associated with fully connected lattices: the
effects of negative-correlations and positive-correlations are amplified with respect to the respective
absence of correlations scenarios,
R
strong
trivariate(ρ)

R
strong
bivariate(ρ) =

�

1 + 2ρ
1 + ρ ,
(44)

where ρ ∈ (− 1

2, 1).
Specifically, carrying out entropic inferences on the higher-dimensional manifold in the presence

of anti-correlations, that is for ρ ∈
�
− 1

2, 0
�
, is less complex than on the lower-dimensional manifold as
evident form Equation (44). The vice-versa is true in the presence of positive-correlations, that is for
ρ ∈ (0, 1).

4. Concluding Remarks

In summary, we considered low dimensional Gaussian statistical models (up to a trivariate model)
and have investigated their dynamical (temporal) complexity. This has been quantified by the volume
of geodesics for parameters characterizing the probability distribution functions. To the best of our
knowledge, there is no dynamic measure of complexity of geodesic paths on curved statistical manifolds
that could be compared to our IGC. However, it could be worthwhile to understand the connection, if

246


Entropy 2014, 16, 2944–2958

any, between our IGC and the complexity of paths of dynamic systems introduced in [20]. Specifically,
according to the Alekseev-Brudno theorem in the algorithmic theory of dynamical systems [21], a way
to predict each new segment of chaotic trajectory is obtained by adding information proportional to the
length of this segment and independent of the full previous length of trajectory. This means that this
information cannot be extracted from observation of the previous motion, even an infinitely long one!
If the instability is a power law, then the required information per unit time is inversely proportional
to the full previous length of the trajectory and, asymptotically, the prediction becomes possible.
For the sake of completeness, we also point out that the relevance of volumes in quantifying the
static model complexity of statistical models was already pointed out in [22] and [23]: complexity is
related to the volume of a model in the space of distributions regarded as a Riemannian manifold
of distributions with a natural metric defined by the Fisher–Rao metric tensor. Finally, we would
like to point out that two of the Authors have recently associated Gaussian statistical models to
networks [17]. Specifically, it is assumed that random variables are located on the vertices of the
network while correlations between random variables are regarded as weighted edges of the network.
Within this framework, a static network complexity measure has been proposed as the volume of the
corresponding statistical manifold. We emphasize that such a static measure could be, in principle,
applied to time-dependent networks by accommodating time-varying weights on the edges [24]. This
requires the consideration of a time-sequence of different statistical manifolds. Thus, we could follow
the time-evolution of a network complexity through the time evolution of the volumes of the associated
manifolds.
In this work we uncover that in order to have a reduction in time of the complexity one has to
consider both mean and variance as macro-variables. This leads to different topological structures of
the parameter space in (13); in particular, we have to consider at least a 2-dimensional manifold in
order to have effects such as a power law decay of the complexity. Hence, the minimal hypothesis in a
multivariate Gaussian model consists in considering all mean values equal and all covariances equal.
In such a case, however, the complexity shows interesting features depending on the correlation among
micro-variables (as summarized in Figure 1). For a trivariate model with only two correlations the
information geometric complexity ratio exhibits a non monotonic behavior in ρ (correlation parameter)
taking zero value at the extrema of the range of ρ. In contrast to closed configurations (bivariate
and trivariate models with all micro-variables correlated each other) the complexity ratio exhibits a
monotonic behavior in terms of the correlation parameter. The fact that in such a case this ratio cannot
be zero at the extrema of the range of ρ is reminiscent of the geometric frustration phenomena that
occurs in the presence of loops [11].
Specifically, recall that a geometrically frustrated system cannot simultaneously minimize all
interactions because of geometric constraints [11,18]. For example, geometric frustration can occur
in an Ising model which is an array of spins (for instance, atoms that can take states ±1) that are
magnetically coupled to each other. If one spin is, say, in the +1 state then it is energetically favorable
for its immediate neighbors to be in the same state in the case of a ferromagnetic model. On the
contrary, in antiferromagnetic systems, nearest neighbor spins want to align in opposite directions.
This rule can be easily satisfied on a square. However, due to geometrical frustration, it is not possible
to satisfy it on a triangle: for an antiferromagnetic triangular Ising model, any three neighboring spins
are frustrated. Geometric frustration in triangular Ising models can be observed by considering spin
configurations with total spin J = ±1 and analyzing the fluctuations in energy of the spin system as
a function of temperature. There is no peak at all in the standard deviation of the energy in the case
J = −1, and a monotonic behavior is recorded. This indicates that the antiferromagnetic system does
not have a phase transition to a state with long-range order. Instead, in the case J = +1, a peak in
the energy fluctuations emerges. This significant change in the behavior of energy fluctuations as a
function of temperature in triangular configurations of spin systems is a signature of the presence of
frustrated interactions in the system [19].

247


Entropy 2014, 16, 2944–2958

In this article, we observe a significant change in the behavior of the information geometric
complexity ratios as a function of the correlation coefficient in the trivariate Gaussian statistical models.
Specifically, in the fully connected trivariate case, no peak arises and a monotonic behavior in ρ of
the information geometric complexity ratio is observed. In the mildly weak connected trivariate
case, instead, a peak in the information geometric complexity ratio is recorded at ρpeak ≥ 0. This
dramatic disparity of behavior can be ascribed to the fact that when carrying out statistical inferences
with positively correlated Gaussian random variables, the maximum entropy favorable scenario is
incompatible with these working hypothesis. Thus, the system appears frustrated.
These considerations lead us to conclude that we have uncovered a very interesting information
geometric resemblance of the more standard geometric frustration effect in Ising spin models. However,
for a conclusive claim of the existence of an information geometric analog of the frustration effect, we
feel we have to further deepen our understanding. A forthcoming research project along these lines
will be a detailed investigation of both arbitrary triangular and square configurations of correlated
Gaussian random variables where we take into consideration both the presence of different intensities
and signs of pairwise interactions (ρij ̸= ρik if j ̸= k, ∀i).

Acknowledgments: Domenico Felice and Stefano Mancini acknowledge the financial support of the Future
and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the
European Commission, under the FET-Open grant agreement TOPDRIM, number FP7-ICT-318121.

Author Contributions: The authors have equally contributed to the paper. All authors read and approved the
final manuscript.

Conflicts of Interest: Conflicts of Interest
The authors declare no conflict of interest.

References

1.
Feldman, D.F.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252.
2.
Kolmogorov, A.N. A new metric invariant of transitive dynamical systems and of automorphism of Lebesgue
spaces. Doklady Akademii Nauk SSSR 1958, 119, 861–864.
3.
Caticha, A. Entropic Dynamics. Bayesian Inference and Maximum Entropy Methods in Science and Engineering,
the 22nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and
Engineering, Moscow, Idaho, 3-7 August 2002; Fry, R.L., Ed.; American Institute of Physics: College Park,
MD, USA, 2002; Volume 617, p. 302.
4.
Caticha, A.; Preuss, R. Maximum entropy and Bayesian data analysis: Entropic prior distributions. Phys. Rev.
E 2004, 70, 046127.
5.
Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NY, USA, 2000.
6.
Cafaro, C. Works on an information geometrodynamical approach to chaos. Chaos Solitons Fractals 2009, 41,
886–891.
7.
Pettini, M. Geometry and Topology in Hamiltonian Dynamics and Statistical Mechanics; Springer-Verlag:
Berlin/Heidelberg, Germany, 2007.
8.
Lebowitz, J.L. Microscopic Dynamics and Macroscopic Laws. Ann. N. Y. Acad. Sci. 1981, 373, 220–233.
9.
Shibata, T.; Chawanya, T.; Chawanya, K. Noiseless Collective Motion out of Noisy Chaos. Phys. Rev. Lett.
1999, 82, doi: http://dx.doi.org/10.1103/PhysRevLett.82.4424.
10.
Ali, S.A.; Cafaro, C.; Kim, D.-H.; Mancini, S. The effect of the microscopic correlations on the information
geometric complexity of Gaussian statistical models. Physica A 2010, 389, 3117–3127.
11.
Sadoc, J.F.; Mosseri, R. Geometrical Frustration; Cambridge University Press: Cambridge, UK, 2006.
12.
Lee, J.M. Riemannian Manifolds: An Introduction to Curvature; Springer: New York, NY, USA, 1997.
13.
Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992.
14.
Cafaro, C.; Ali, S.A. Jacobi fields on statistical manifolds of negative curvature. Physica D 2007, 234, 70–80.
15.
Cafaro, C.; Giffin, A.; Ali, S.A.; Kim, D.-H. Reexamination of an information geometric construction of
entropic indicators of complexity. Appl. Math. Comput. 2010, 217, 2944–2951.
16.
Cafaro, C.; Mancini, S. Quantifying the complexity of geodesic paths on curved statistical manifolds through
information geometric entropies and Jacobi fields. Physica D 2011, 240, 607–618.

248


Entropy 2014, 16, 2944–2958

17.
Felice, D.; Mancini, S.; Pettini, M. Quantifying Networks Complexity from Information Geometry Viewpoint.
J. Math. Phys. 2014, 55, 043505.
18.
Moessner, R.; Ramirez, A.P. Geometrical Frustration. Phys. Today 2006, 59, 24–29.
19.
MacKay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge,
UK, 2003.
20.
Brudno, A.A. The complexity of the trajectories of a dynamical system. Uspekhi Mat. Nauk 1978, 33, 207–208.
21.
Alekseev, V.M.; Yacobson, M.V. Symbolic dynamics and hyperbolic dynamic systems. Phys. Rep. 1981, 75,
290–325.
22.
Myung, J.; Balasubramanian, V.; Pitt, M.A. Counting probability distributions: differential geometry and
model selection. Proc. Natl. Acad. Sci. USA 2000, 97, 11170.
23.
Rodriguez, C.C. The volume of bitnets. AIP Conf. Proc. 2004, 735, 555–564.
24.
Motter, A.E.; Albert, R. Networks in motion. Phys. Today 2012, 65, 43–48.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

249


entropy

Article
Learning from Complex Systems: On the Roles of
Entropy and Fisher Information in Pairwise Isotropic
Gaussian Markov Random Fields

Alexandre Levada

Computing Department, Federal University of São Carlos, Rod. Washington Luiz, km 235, São Carlos, SP, Brazil;
E-Mail: alexandre@dc.ufscar.br

Received: 4 December 2013; / Accepted: 30 January 2014 / Published: 18 February 2014

Abstract: Markov random field models are powerful tools for the study of complex systems.
However, little is known about how the interactions between the elements of such systems are
encoded, especially from an information-theoretic perspective.
In this paper, our goal is to
enlighten the connection between Fisher information, Shannon entropy, information geometry and
the behavior of complex systems modeled by isotropic pairwise Gaussian Markov random fields.
We propose analytical expressions to compute local and global versions of these measures using
Besag’s pseudo-likelihood function, characterizing the system’s behavior through its Fisher curve , a
parametric trajectory across the information space that provides a geometric representation for the
study of complex systems in which temperature deviates from infinity. Computational experiments
show how the proposed tools can be useful in extracting relevant information from complex patterns.
The obtained results quantify and support our main conclusion, which is: in terms of information,
moving towards higher entropy states (A –> B) is different from moving towards lower entropy states
(B –> A), since the Fisher curves are not the same, given a natural orientation (the direction of time).

Keywords: Markov random fields; information theory; Fisher information; entropy; maximum
pseudo-likelihood estimation

1. Introduction

With the increasing value of information in modern society and the massive volume of digital
data that is available, there is an urgent need for developing novel methodologies for data filtering and
analysis in complex systems. In this scenario, the notion of what is informative or not is a top priority.
Sometimes, patterns that at first may appear to be locally irrelevant may turn out to be extremely
informative in a more global perspective. In complex systems, this is a direct consequence of the
intricate non-linear relationship between the pieces of data along different locations and scales.
Within this context, information theoretic measures play a fundamental role in a huge variety of
applications once they represent statistical knowledge in a systematic, elegant and formal framework.
Since the first works of Shannon [1], and later with many other generalizations [2–4], the concept of
entropy has been adapted and successfully applied to almost every field of science, among which we
can cite physics [5], mathematics [6–8], economics [9] and, fundamentally, information theory [10–12].
Similarly, the concept of Fisher information [13,14] has been shown to reveal important properties of
statistical procedures, from lower bounds on estimation methods [15–17] to information geometry [18,19].
Roughly speaking, Fisher information can be thought of as the likelihood analog of entropy, which is a
probability-based measure of uncertainty.
In general, classical statistical inference is focused on capturing information about location and
dispersion of unknown parameters of a given family of distribution and studying how this information
is related to uncertainty in estimation procedures. In typical situations, an exponential family of

Entropy 2014, 16, 1002–1036; doi:10.3390/e16021002
www.mdpi.com/journal/entropy
250


Entropy 2014, 16, 1002–1036

distributions and independence hypothesis (independent random variables) are often assumed, giving
the likelihood function a series of desirable mathematical properties [15–17].
Although mathematically convenient for many problems, in complex systems modeling,
independence assumption is not reasonable, because much of the information is somehow encoded
in the relations between the random variables [20,21]. In order to overcome this limitation, Markov
random field (MRF) models appear to be a natural generalization of the classical approach by the
replacement of the independence assumption by a more realistic conditional independence assumption.
Basically, in every MRF, knowledge of a finite-support neighborhood around a given variable isolates it
from all the remaining variables. A further simplification consists in considering a pairwise interaction
model, constraining the size of the maximum clique to be two (in other words, the model captures
only binary relationships). Moreover, if the MRF model is isotropic, which means that the parameter
controlling the interactions between neighboring variables is invariant to change in the directions,
all the information regarding the spatial dependence structure of the system is conveyed by a single
parameter, from now on denoted by β (or simply, the inverse temperature).
In this paper, we assume an isotropic pairwise Gaussian Markov random field (GMRF) model [22,23],
also known as an auto-normal model or a conditional auto-regressive model [24,25]. Basically, the question
that motivated this work and that we are trying to elucidate here is: What kind of information is encoded
by the β parameter in such a model? We want to know how this parameter, and as a consequence, the
whole spatial dependence structure of a complex system modeled by a Gaussian Markov random field, is
related to both local and global information theoretic measures, more precisely the observed and expected
Fisher information, as well as self-information and Shannon entropy.
In searching for answers for our fundamental question, investigations led us to an exact expression
for the asymptotic variance of the maximum pseudo-likelihood (MPL) estimator of β in an isotropic
pairwise GMRF model, suggesting that asymptotic efficiency is not granted. In the context of statistical
data analysis, Fisher information plays a central role in providing tools and insights for modeling
the interactions between complex systems and their components. The advantage of MRF models
over the traditional statistical ones is that MRFs take into account the dependence between pieces of
information as a function of the system’s temperature, which may even be variable along time. Briefly
speaking, this investigation aims to explore ways to measure and quantify distances between complex
systems operating in different thermodynamical conditions. By analyzing and comparing the behavior
of local patterns observed throughout the system (defined over a regular 2D lattice), it is possible
to measure how informative those patterns for a given inverse temperature are, or simply β (which
encodes the expected global behavior).
In summary, our idea is to describe the behavior of a complex system in terms of information
as its temperature deviates from infinity (when the particles are statistically independent) to a lower
bound. The obtained results suggest that, in the beginning, when the temperature is infinite and the
information equilibrium prevails, the information is somehow spread along the system. However,
when temperature is low and this equilibrium condition does not hold anymore, we have a more
sparse representation in terms of information, since this information is concentrated in the boundaries
of the regions that define a smooth global configuration. In the vast remaining of this “universe”, due
to this smooth constraint, the strong alignment between the particles prevails, which is exactly the
expected global behavior for temperatures below a critical value (making the majority of the interaction
patterns along the system uninformative).
The remainder of the paper is organized as follows: Section 2 discusses a technique for the
estimation of the inverse temperature parameter, called the maximum pseudo-likelihood (MPL)
approach, and provides derivations for the observed Fisher information in an isotropic pairwise GMRF
model. Intuitive interpretations for the two versions of this local measure are discussed. In Section 3,
we derive analytical expressions for the computation of the expected Fisher information, which allows
us to assign a global information measure for a given system configuration. Similarly, in Section 4, an
expression for the global entropy of a system modeled by a GMRF is shown. The results suggest a

251


Entropy 2014, 16, 1002–1036

connection between maximum pseudo-likelihood and minimum entropy criteria in the estimation of
the inverse temperature parameter on GMRFs. Section 5 discusses the uncertainty in the estimation
of this important parameter by defining an expression for the asymptotic variance of its maximum
pseudo-likelihood estimator in terms of both forms of Fisher information. In Section 6, the definition
of the Fisher curve of a system as a parametric trajectory in the information space is proposed. Section
7 shows the experimental setup. Computational simulations with both Markov chain Monte Carlo
algorithms and some real data were conducted, showing how the proposed tools can be used to extract
relevant information from complex systems. Finally, Section 8 presents our conclusions, final remarks
and possibilities for future works.

2. Fisher Information in Isotropic Pairwise GMRFs

The remarkable Hammersley–Clifford theorem [26] states the equivalence between Gibbs random
fields (GRF) and Markov random fields (MRF), which implies that any MRF can be defined either in
terms of a global (joint Gibbs distribution) or a local (set of local conditional density functions) model.
For our purposes, we will choose the latter representation.

Definition 1. An isotropic pairwise Gaussian Markov random field regarding a local neighborhood system,
ηi, defined on a lattice S = {s1, s2, . . . , sn} is completely characterized by a set of n local conditional density
functions p(xi|ηi,⃗θ), given by:

p
�
xi|ηi,⃗θ
�
=
1
√

2πσ
exp

⎧
⎨

⎩− 1

2σ2

�

xi − μ − β ∑
j∈ηi

�
xj − μ
�
�2⎫
⎬

⎭
(1)

with⃗θ = (μ, σ2, β), where μ and σ2 are the expected value and the variance of the random variables,
and β = 1/T is the parameter that controls the interaction between the variables (inverse temperature).
Note that, for β = 0, the model degenerates to the usual Gaussian distribution. From an information
geometry perspective [18,19], this means that we are constrained to a sub-manifold within the
Riemannian manifold of probability distributions, where the natural Riemannian metric (tensor)
is given by the Fisher information. It has been shown that the geometric structure of exponential
family distributions exhibits constant curvature. However, little is known about information geometry
on more general statistical models, such as GMRFs. For β > 0, some degree of correlation between
the observations is expected, making the interactions grow stronger. Typical choices for ηi are the
first and second order non-causal neighborhood systems, defined by the sets of four and eight nearest
neighbors, respectively.

2.1. Maximum Pseudo-Likelihood Estimation

Maximum likelihood estimation is intractable in MRF parameter estimation, due to the existence
of the partition function in the joint Gibbs distribution. An alternative, proposed by Besag [24], is
maximum pseudo-likelihood estimation, which is based on the conditional independence principle.
The pseudo-likelihood function is defined as the product of the LCDFs for all the n variables of the
system, modeled as a random field.

Definition 2. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood

system, ηi. Assuming that X(t) = {x(t)
1 , x(t)
2 , . . . , x(t)
n } denotes the set corresponding to the observations at time
t, the pseudo-likelihood function of the model is defined by:

L
�
⃗θ; X(t)�
=
n
∏
i=1
p(xi|ηi,⃗θ)
(2)

252


Entropy 2014, 16, 1002–1036

Note that the pseudo-likelihood function is a function of the parameters. For better mathematical
tractability, it is usual to take the logarithm of L(⃗θ; X(t)). Plugging Equation (1) into Equation (2) and
taking the logarithm leads to:

log L
�
⃗θ; X(t)�
= −n

2 log
�
2πσ2�
−
1

2σ2

n
∑
i=1

�

xi − μ − β ∑
j∈ηi

�
xj − μ
�
�2
(3)

By differentiating Equation (3) with respect to each parameter and properly solving the
pseudo-likelihood equations, we obtain the following maximum pseudo-likelihood estimators for the
parameters, μ, σ2 and β:

ˆβMPL =

n
∑
i=1

�

(xi − μ) ∑
j∈ηi

�
xj − μ
�
�

n
∑
i=1

�
∑
j∈ηi

�
xj − μ
�
�2
(4)

ˆμMPL =
1

n (1 − kβ)

n
∑
i=1

�

xi − β ∑
j∈ηi
xj

�

(5)

ˆσ2
MPL = 1

n

n
∑
i=1

�

xi − μ − β ∑
j∈ηi

�
xj − μ
�
�2
(6)

where k denotes the cardinality of the non-causal neighborhood set ηi. Note that if β = 0, the MPL
estimators of both μ and σ2 become the widely known sample mean and sample variance.
Since the cardinality of the neighborhood system, k = |ηi|, is spatially invariant (we are assuming
a regular neighborhood system) and each variable is dependent on a fixed number of neighbors on a
lattice, ˆβMPL can be rewritten in terms of cross-covariances:

ˆβMPL =
∑
j∈ηi
ˆσij

∑
j∈ηi ∑
k∈ηi
ˆσjk
(7)

where σij denotes the sample covariance between the central variable, xi, and xj ∈ ηi. Similarly, σjk
denotes the sample covariance between two variables belonging to the neighborhood system, ηi (the
definition of the neighborhood system, ηi, does not include the the location, si).

2.2. Fisher Information of Spatial Dependence Parameters

Basically, Fisher information measures the amount of information a sample conveys about
an unknown parameter.
It can be thought of as the likelihood analog of entropy, which is a
probability-based measure of uncertainty. Often, when we are dealing with independent and identically
distributed (i.i.d) random variables, the computation of the global Fisher information presented in

a random sample X(t) = {x(t)
1 , x(t)
2 , . . . , x(t)
n } is quite straightforward, since each observation, xi,
i = 1, 2, . . . , n, brings exactly the same amount of information (when we are dealing with independent
samples, the superscript, t, is usually suppressed, since the underlying dependence structure does
not change through time). However, this is not true for spatial dependence parameters in MRFs,
since different configuration patterns (xi ∪ ηi) provide distinct contributions to the local observed
Fisher information, which can be used to derive a reasonable approximation to the global Fisher
information [27].

253


Entropy 2014, 16, 1002–1036

2.3. The Information Equality

It is widely known from statistical inference theory that, under certain regularity conditions,
information equality holds in the case of independent observations in the exponential family [15–17].
In other words, we can compute the Fisher information of a random sample regarding a parameter of
interest, θ, by:

I
�
θ; X(t)�
= E

�� ∂

∂θ logL
�
θ; X(t)��2�

= −E
� ∂2

∂θ2 logL
�
θ; X(t)��
(8)

where L
�
θ; X(t)�
denotes the likelihood function at a time instant, t. In our investigations, to avoid the
joint Gibbs distribution, often intractable due to the presence of the partition function (global Gibbs
field), we replace the usual likelihood function by Besag’s pseudo-likelihood function, and then, we
work with the local model instead (local Markov field).
However, given the intrinsic spatial dependence structure of Gaussian Markov random field
models, information equilibrium is not a natural condition. As we will discuss later, in general,
information equality fails.
Thus, in a GMRF model, we have to consider two kinds of Fisher
information, from now on denoted by Type I (due to the first derivative of the pseudo-likelihood
function) and Type II (due to the second derivative of the pseudo-likelihood function). Eventually,
when certain conditions are satisfied, these two values of information will converge to a unique bound.
Essentially, β is the parameter responsible to control whether both forms of information converge or
diverge. Knowing the role of β (inverse temperature) in a GMRF model, it is expected that for β = 0
(or T → ∞), information equilibrium prevails. In fact, we will see in the following sections that as β
deviates from zero (and long-term correlations start to emerge), the divergence between the two kinds
of information increases.
In terms of information geometry, it has been shown that the geometric structure of the exponential
family of distributions is basically given by the Fisher information matrix, which is the natural
Riemmanian metric (metric tensor) [18,19]. So, when the inverse temperature parameter is zero, the
geometric structure of the model is a surface since the parametric space is 2D (μ and σ2). However,
as the inverse temperature parameter starts to increase, the original surface is gradually transformed
to a 3D Riemmanian manifold, equipped with a novel metric tensor (the 3 × 3 Fisher information
matrix for μ, σ2 and β). In this context, by measuring the Fisher information regarding the inverse
temperature parameter along an interval ranging from βMIN = A = 0 to βMAX = B, we are essentially
trying to capture part of the deformation in the geometric structure of the model. In this paper, we
focus on the computation of this measure. In future works we expect to derive the complete Fisher
information matrix in order to completely characterize the transformations in the metric tensor.

2.4. Observed Fisher Information

In order to quantify the amount of information conveyed by a local configuration pattern in a
complex system, the concept of observed Fisher information must be defined.

Definition 3. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The
Type I local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β, is
defined in terms of its local conditional density function as:

φβ(xi) =
� ∂

∂βlog p
�
xi|ηi,⃗θ
��2
(9)

254


Entropy 2014, 16, 1002–1036

Hence, for an isotropic pairwise GMRF model, the Type I local observed Fisher information
regarding β for the observation, xi, is given by:

φβ(xi) = 1

σ4

��

xi − μ − β ∑
j∈ηi

�
xj − μ
�
� �
∑
j∈ηi

�
xj − μ
�
��2

= 1

σ4

�
∑
j∈ηi
(xi − μ)
�
xj − μ
� − β ∑
j∈ηi ∑
k∈ηi

�
xj − μ
� (xk − μ)

�2
(10)

Definition 4. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The
Type II local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β,
is defined in terms of its local conditional density function as:

ψβ(xi) = − ∂2

∂β2 log p
�
xi|ηi,⃗θ
�
(11)

In case of an isotropic pairwise GMRF model, the Type II local observed Fisher information
regarding β for the observation, xi, is given by:

φβ(xi) = 1

σ2

�
∑
j∈ηi ∑
k∈ηi

�
xj − μ
� (xk − μ)

�

(12)

Note that φβ(xi) does not depend on xi, only on the neighborhood system, ηi.
Therefore, we have two local measures, φβ(xi) and ψβ(xi), that can be assigned to every element of
a system modeled by an isotropic pairwise GMRF. In the following, we will discuss some interpretations
for what is being measured with the proposed tools and how to define global versions for these
measures by means of the expected Fisher information.

2.5. The Role of Fisher Information in GMRF Models

At this point, a relevant issue is the interpretation of these Fisher information measures in a
complex system modeled by an isotropic pairwise GMRF. Roughly speaking, φβ(xi) is the quadratic
rate of change of the logarithm of the local likelihood function at xi, given a global value of β. As
this global value of β determines what would be the expected global behavior (if β is large, a high
degree of correlation among the observations is expected and if β is close to zero, the observations are
independent), it is reasonable to admit that configuration patterns showing values of φβ(xi) close to
zero are more likely to be observed throughout the field, once their likelihood values are high (close to
the maximum local likelihood condition). In other words, these patterns are more “aligned” to what is
considered to be the expected global behavior, and therefore, they convey little information about the
spatial dependence structure (these samples are not informative once they are expected to exist in a
system operating at that particular value of inverse temperature).
Now, let us move on to configuration patterns showing high values of φβ(xi). Those samples
can be considered landmarks, because they convey a large amount of information about the global
spatial dependence structure. Roughly speaking, those points are very informative once they are
not expected to exist for that particular value of β (which guides the expected global behavior of the
system). Therefore, Type I local observed Fisher information minimization in GMRFs can be a useful
tool in producing novel configuration patterns that are more likely to exist given the chosen value of
inverse temperature. Basically, φβ(xi) tells us how informative a given pattern is for that specific global
behavior (represented by a single parameter in an isotropic pairwise GMRF model). In summary, this

255


Entropy 2014, 16, 1002–1036

measure quantifies the degree of agreement between an observation, xi, and the configuration defined
by its neighborhood system for a given β.
As we will see later in the experiments section, typical informative patterns (those showing
high values of φβ(xi)) in an organized system are located at the boundaries of the regions defining
homogeneous areas (since these boundary samples show an unexpected behavior for large β, which is:
there is no strong agreement between xi and its neighbors).
Let us analyze the Type II local observed Fisher information, ψβ(xi). Informally speaking, this
measure can be interpreted as a curvature measure, that is, how curved is the local likelihood function
at xi. Thus, patterns showing low values of ψβ(xi) tend to have a nearly flat local likelihood function.
This means that we are dealing with a pattern that could have been observed for a variety of β values
(a large set of β values have approximately the same likelihood). An implication of this fact is that
in a system dominated by this kind of patterns (patterns for which ψβ(xi) is close to zero), small
perturbations may cause a sharp change in β (and, therefore, in the expected global behavior). In other
words, these patterns are more susceptible to changes once they do not have a “stable” configuration
(it raises our uncertainty about the true value of β).
On the other hand, if the global configuration is mostly composed of patterns exhibiting large
values of ψβ(xi), changes on the global structure are unlikely to happen (uncertainty on β is sufficiently
small). Basically, ψβ(xi) measures the degree of agreement or dependence among the observations
belonging to the same neighborhood system. If at a given xi, the observations belonging to ηi are
totally symmetric around the mean value, ψβ(xi) would be zero. It is reasonable to expect that in this
situation, as there is no information about the induced spatial dependence structure (this means that
there is no contextual information available at this point). Notice that the role of ψβ(xi) is not the same
as φβ(xi). Actually, these two measures are almost inversely related, since if at xi the value of φβ(xi)
is high (it is a landmark or boundary pattern), then it is expected that ψβ(xi) will be low (in decision
boundaries or edges, the uncertainty about β is higher, causing ψβ(xi) to be small). In fact, we will
observe this behavior in some computational experiments conducted in future sections of the paper.
It is important to mention that these rather informal arguments define the basis for understanding
the meaning of the asymptotic variance of maximum pseudo-likelihood estimators, as we will discuss
ahead. In summary, ψβ(xi) is a measure of how sure or confident we are about the local spatial
dependence structure (at a given point, xi), since a high average curvature is desired for predicting the
system’s global behavior in a reasonable manner (reducing the uncertainty of β estimation).

3. Expected Fisher Information

In order to avoid the use of approximations in the computation of the global Fisher information
in an isotropic pairwise GMRF, in this section, we provide an exact expression for ˆφβ and ˆψβ as Type
I and Type II expected Fisher information. One advantage of using the expected Fisher information
instead of its global observed counterpart is the faster computing time. As we will see, instead of
computing a single local measure for each observation ,xi ∈ X, and then taking the average, both
Φβ and Ψβ expressions depend only on the covariance matrix of the configuration patterns observed
along the random field.

3.1. The Type I Expected Fisher Information

Recall that the Type I expected Fisher information, from now on denoted by Φβ, is given by:

Φβ = E

�� ∂

∂βlog L
�
⃗θ; X(t)��2�

(13)

The Type II expected Fisher information, from now on denoted by Ψβ, is given by:

Ψβ = −E
� ∂2

∂β2 log L
�
⃗θ; X(t)��
(14)

256


Entropy 2014, 16, 1002–1036

We first proceed to the definition of Φβ. Plugging Equation (3) in Equation (13), and after some
algebra, we obtain the following expression, which is composed by four main terms:

Φβ = 1

σ4 E

⎧
⎨

⎩

�
n
∑
s=1

�

xs − μ − β ∑
j∈ηs

�
xj − μ
�
� �
∑
j∈ηs

�
xj − μ
�
��2⎫
⎬

⎭
(15)

= 1

σ4 E

�
n
∑
s=1

n
∑
r=1

�

xs − μ − β ∑
j∈ηs

�
xj − μ
�
� �

xr − μ − β ∑
k∈ηr
(xk − μ)

�

×

�
∑
j∈ηs

�
xj − μ
�
� �
∑
k∈ηr
(xk − μ)

��

= 1

σ4 E

�
n
∑
s=1

n
∑
r=1

�

(xs − μ) (xr − μ) − β ∑
k∈ηr
(xs − μ) (xk − μ) − β ∑
j∈ηs
(xr − μ)
�
xj − μ
�

+β2 ∑
j∈ηs ∑
k∈ηr

�
xj − μ
� (xk − μ)

� �
∑
j∈ηs ∑
k∈ηr

�
xj − μ
� (xk − μ)

��

= 1

σ4

n
∑
s=1

n
∑
r=1

�
∑
j∈ηs ∑
k∈ηr
E
�(xs − μ) (xr − μ)
�
xj − μ
� (xk − μ)
�

−β ∑
j∈ηs ∑
k∈ηr ∑
l∈ηr
E
�(xs − μ)
�
xj − μ
� (xk − μ) (xl − μ)
�

−β ∑
m∈ηs ∑
j∈ηs ∑
k∈ηr
E
�(xr − μ) (xm − μ)
�
xj − μ
� (xk − μ)
�

+β2 ∑
m∈ηs ∑
j∈ηs ∑
k∈ηr ∑
l∈ηr
E
�(xm − μ)
�
xj − μ
� (xk − μ) (xl − μ)
�
�

Hence, the expression for Φβ is composed by four main terms, each one of them involving
a summation of higher-order cross-moments. According to Isserlis’ theorem [28], for normally
distributed random variables, we can compute higher order moments in terms of the covariance
matrix through the following identity:

E [X1X2X3X4] = E [X1X2] E [X3X4] + E [X1X3] E [X2X4] + E [X2X3] E [X1X4]
(16)

Then, the first term of Equation (15) is reduced to:

∑
j∈ηs ∑
k∈ηr
E
�(xs − μ) (xr − μ)
�
xj − μ
� (xk − μ)
� =
(17)

∑
j∈ηs ∑
k∈ηr

�
E [(xs − μ) (xr − μ)] E
��
xj − μ
� (xk − μ)
�

+ E
�(xs − μ)
�
xj − μ
��
E [(xr − μ) (xk − μ)]

+ E
�(xr − μ)
�
xj − μ
��
E [(xs − μ) (xk − μ)]
� =

∑
j∈ηs ∑
k∈ηr

�
σsrσjk + σsjσrk + σrjσsk
�

257


Entropy 2014, 16, 1002–1036

where σsr denotes the covariance between variables xs and xr (note that in an MRF, we have σsr = 0 if
xr /∈ ηs). We now proceed to the expansion of the second main term of Equation (15). Similarly, by
applying Isserlis’ identity, we have:

∑
j∈ηs ∑
k∈ηr ∑
l∈ηr
E
�(xs − μ)
�
xj − μ
� (xk − μ) (xl − μ)
� = ∑
j∈ηs ∑
k∈ηr ∑
l∈ηr

�
σsjσkl + σskσjl + σjkσsl
�
(18)

The third term of Equation (15) can be rewritten as:

∑
m∈ηs ∑
j∈ηs ∑
k∈ηr
E
�(xr − μ) (xm − μ)
�
xj − μ
� (xk − μ)
� =
(19)

= ∑
m∈ηs ∑
j∈ηs ∑
k∈ηr

�
σrmσjk + σrjσmk + σmjσrk
�

Finally, the fourth term of it is:

∑
m∈ηs ∑
j∈ηs ∑
k∈ηr ∑
l∈ηr
E
�(xm − μ)
�
xj − μ
� (xk − μ) (xl − μ)
� =
(20)

= ∑
m∈ηs ∑
j∈ηs ∑
k∈ηr ∑
l∈ηr

�
σmjσkl + σmkσjl + σmlσjk
�

Therefore, by combining Expressions Equations (17)–(20), we have the complete expression for Φβ,
the Type I expected Fisher information for an isotropic pairwise GMRF model regarding the inverse
temperature parameter, as:

Φβ = 1

σ4

n
∑
s=1

n
∑
r=1

�
∑
j∈ηs ∑
k∈ηr

�
σsrσjk + σsjσrk + σrjσsk
�
(21)

−β ∑
j∈ηs ∑
k∈ηr ∑
l∈ηr

�
σsjσkl + σskσjl + σjkσsl
�

−β ∑
m∈ηs ∑
j∈ηs ∑
k∈ηr

�
σrmσjk + σrjσmk + σmjσrk
�

+β2 ∑
m∈ηs ∑
j∈ηs ∑
k∈ηr ∑
l∈ηr

�
σmjσkl + σmkσjl + σmlσjk
��

However, since we are interested in studying how the spatial correlations change as the system evolves,

we need to estimate a value for Φβ given a single global state X(t) =
�
x(t)
1 , x(t)
2 , . . . , x(t)
n
�
. Hence, to

compute Φβ from a single static configuration X(t) (a photograph of the system at a given moment),
we consider n = 1 in the previous equation, which means, among other things, that s = r (which
implies ηs = ηr) and that observations belonging to different local neighborhoods are independent
from each other (as we are dealing with a pairwise interaction Markovian process, it does not make
sense to model the interactions between variables that are far away from each other in the lattice).
Before proceeding, we would like to clarify some points regarding the estimation of the β
parameter and the computation of the expected Fisher information in the isotropic pairwise GMRF
model. Basically, there are two main possibilities: (1) the parameter is spatially-invariant, which
means that we have a unique value, ˆβ(t), for a global configuration of the system, X(t) (this is our
assumption); or (2) the parameter is spatially-variant, which means that we have a set of ˆβs values,

for s = 1, 2, . . . , n, each one of them estimated from Xs =
�
x(1)
s , x(2)
s , . . . , x(t)
s
�
(we are observing the
outcomes of a random pattern along time in a fixed position of the lattice). When we are dealing with
the first model (β is spatially-invariant), all possible observation patterns (samples) are extracted from
the global configuration by a sliding window (with the shape of the neighborhood system) that moves

258


Entropy 2014, 16, 1002–1036

through the lattice at a fixed time instant, t. In this case, we are interested in studying the spatial
correlations, not the temporal ones. In other words, we would like to investigate how the the spatial
structure of a GMRF model is related to Fisher information (this is exactly the scenario described
above, for which n = 1). Our motivation here is to characterize, via information-theoretic measures,
the behavior of the system as it evolves from states of minimum entropy to states of maximum entropy
(and vice versa) by providing a geometrical tool based on the definition of the Fisher curve , which will
be introduced in the following sections.
Therefore, in our case (n = 1), Equation (21) is further simplified for practical usage. By unifying
s and r to a unique index, i, we have a final expression for Φβ in terms of the local covariances between
the random variables in a given neighborhood system (i.e., for the eight nearest neighbors):

Φβ = 1

σ4

�
∑
j∈ηi ∑
k∈ηi

�
σ2σjk + 2σijσik
�
− 2β ∑
j∈ηi ∑
k∈ηi ∑
l∈ηi

�
σijσkl + σikσjl + σilσjk
�
(22)

+β2 ∑
j∈ηi ∑
k∈ηi ∑
l∈ηi ∑
m∈ηi

�
σjkσlm + σjlσkm + σjmσkl
��

Note that we have two types of covariances in the definition of Φβ for an isotropic pairwise GMRF: (1)
covariances between the central variable, xi, and a neighboring variable, xj, denoted by σij, for j ∈ ηi;
and (2) covariances between two neighboring variables, xj and xk, for j, k ∈ ηi. In the next sections, we
will see how to compute the value of Ψβ directly from the covariance matrix of the local patterns.

3.2. The Type II Expected Fisher Information

Following the same methodology of replacing the likelihood function by the pseudo-likelihood
function of the GMRF model, a closed form expression for Ψβ is developed. Plugging Equation (3)
into Equation (14) leads us to:

Ψβ = 1

σ2

n
∑
i=1
E

⎧
⎨

⎩

�
∑
xj∈ηi

�
xj − μ
�
�2⎫
⎬

⎭
(23)

= 1

σ2

n
∑
i=1
E

�
∑
xj∈ηi ∑
xk∈ηi

�
xj − μ
� (xk − μ)

�

=

= 1

σ2

n
∑
i=1

�
∑
xj∈ηi ∑
xk∈ηi
E
��
xj − μ
� (xk − μ)
�
�

= 1

σ2

n
∑
i=1 ∑
j∈ηi ∑
k∈ηi
σjk

Note that unlike Φβ, Ψβ does not depend explicitly on β (inverse temperature). As we have seen
before, Φβ is a quadratic function of the spatial dependence parameter.
In order to simplify the notations and also to make computations easier, the expressions for Φβ
and Ψβ can be rewritten in a matrix-vector form. Let Σp be the covariance matrix of the random vectors
⃗pi, i = 1, 2, . . . , n, obtained by lexicographic ordering of the local configuration patterns xi ∪ ηi. Thus,

259


Entropy 2014, 16, 1002–1036

considering a neighborhood system, ηi, of size K, we have Σp given by a (K + 1) × (K + 1) symmetric
matrix (for K + 1 odd, i.e., K = 4, 8, 12, . . .):

Σp =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

σ1,1
· · ·
σ1,K/2
σ1,(K/2)+1
σ1,(K/2)+2
· · ·
σ1,K+1
...
...
...
...
...
...
...
σK/2,1
· · ·
σK/2,K/2
σK/2,(K/2)+1
σK/2,(K/2)+2
· · ·
σK/2,K+1
σ(K/2)+1,1
· · ·
σ(K/2)+1,K/2
σ(K/2)+1,(K/2)+1
σ(K/2)+1,(K/2)+2
· · ·
σ(K/2)+1,K+1
σ(K/2)+2,1
· · ·
σ(K/2)+2,K/2
σ(K/2)+2,(K/2)+1
σ(K/2)+2,(K/2)+2
· · ·
σ(K/2)+2,K+1
...
...
...
...
...
...
...
σK+1,1
· · ·
σK+1,K/2
σK+1,(K/2)+1
σK+1,(K/2)+2
· · ·
σK+1,K+1

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

Let Σ−
p be the submatrix of dimensions K × K obtained by removing the central row and central
column of Σp (the covariances between xi and each one of its neighbors, xj). Then, for K + 1 odd, we
have:

Σ−
p =

⎛

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝

σ1,1
· · ·
σ1,K/2
σ1,(K/2)+2
· · ·
σ1,K+1
...
...
...
...
...
...
σK/2,1
· · ·
σK/2,K/2
σK/2,(K/2)+2
· · ·
σK/2,K+1
σ(K/2)+2,1
· · ·
σ(K/2)+2,K/2
σ(K/2)+2,(K/2)+2
· · ·
σ(K/2)+2,K+1
...
...
...
...
...
...
σK+1,1
· · ·
σK+1,K/2
σK+1,(K/2)+2
· · ·
σK+1,K+1

⎞

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠

(24)

Thus, Σ−
p is a matrix that stores only the covariances among the neighboring variables. Furthermore,
let ⃗ρ be the vector of dimensions K × 1 formed by all the elements of the central row of Σp, excluding
the middle one (which is a variance actually), that is:

⃗ρ =
�
σ(K/2)+1,1
· · ·
σ(K/2)+1,K/2
σ(K/2)+1,(K/2)+2
· · ·
σ(K/2)+1,K+1
�
(25)

Therefore, we can rewrite Equation (23) (for n = 1) using Kronecker products. The following definition
provides a fast way to compute Φβ exploring these tensor products.

Definition 5. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
system, ηi, of size K (usual choices for K are even values: four, eight, 12, 20 or 24). Assuming that X(t) =
{x(t)
1 , x(t)
2 , . . . , x(t)
n } denotes the global configuration of the system at time t and ⃗ρ and Σ−
p are defined as

Equations (25) and (24), the Type I expected Fisher information, Φβ, for this state, X(t), is:

Φβ = 1

σ4

�
σ2 ���Σ−
p
���
+ + 2
���⃗ρ ⊗⃗ρT���
+ − 6β
���⃗ρT ⊗ Σ−
p
���
+ + 3β2 ���Σ−
p ⊗ Σ−
p
���
+

�
(26)

where ∥A∥+ denotes the summation of all the entries of the matrix, A (not to be confused with a matrix
norm) and ⊗ denotes the Kronecker (tensor) product. From an information geometry perspective,
the presence of tensor products indicates the intrinsic differential geometry of a manifold in the form
of the Riemann curvature tensor [18]. Note that all the necessary information for computing the
Fisher information is somehow encoded in the covariance matrix of the local configuration patterns,
(xi ∪ ηi), i = 1, 2, . . . , n, as would be expected in the case of Gaussian variables (second-order statistics).
The same procedure is applied to the Type II expected Fisher information.

Definition 6. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood

system, ηi, of size K (usual choices for K are four, eight, 12, 20 or 24). Assuming that X(t) = {x(t)
1 , x(t)
2 , . . . , x(t)
n }

260


Entropy 2014, 16, 1002–1036

denotes the global configuration of the system at time t and Σ−
p is defined as Equation (24), the Type II expected

Fisher information, Ψβ, for this state, X(t), is given by:

Ψβ = 1

σ2

���Σ−
p
���
+
(27)

3.3. Information Equilibrium in GMRF Models

From the definition of both Φβ and Ψβ, a natural question that raises would be: under what
conditions do we have Φβ = Ψβ in an isotropic pairwise GMRF model? As we can see from

Equations (26) and (27), the difference between Φβ and Ψβ, from now on denoted by Δβ
�
⃗ρ, Σ−
p
�
,
is simply:

Δβ
�
⃗ρ, Σ−
p
�
= 1

σ4

�
2
���⃗ρ ⊗⃗ρT���
+ − 6β
���⃗ρT ⊗ Σ−
p
���
+ + 3β2 ���Σ−
p ⊗ Σ−
p
���
+

�
(28)

Then, intuitively, the condition for information equality is achieved when Δβ
�
⃗ρ, Σ−
p
�
= 0. As

Δβ
�
⃗ρ, Σ−
p
�
is a simple quadratic function of the inverse temperature parameter, β, we can easily find

that the value, β∗, for which Δβ
�
⃗ρ, Σ−
p
�
= 0, is:

β∗ =

���⃗ρT ⊗ Σ−
p
���
+
��Σ−p ⊗ Σ−p
��
+
±
√

3
3

�

3
��⃗ρT ⊗ Σ−p
��2
+ − 2
��Σ−p ⊗ Σ−p
��
+ ∥⃗ρ ⊗⃗ρT∥+
��Σ−p ⊗ Σ−p
��
+
(29)

provided that 3
���⃗ρT ⊗ Σ−
p
���
2

+ ≥ 2
���Σ−
p ⊗ Σ−
p
���
+

��⃗ρ ⊗⃗ρT��
+ and
���Σ−
p ⊗ Σ−
p
���
+ ̸= 0.
Note that if
��⃗ρ ⊗⃗ρT��
+ = 0, then one solution for the above equation is β∗ = 0.
In other words, when
σij = 0, ∀j ∈ ηi (no correlation between xi and its neighbors, xj), information equilibrium is achieved for
β∗ = 0, which in this case, is the maximum pseudo-likelihood estimate of β, since in this matrix-vector
notation, ˆβMPL is given by:

ˆβMPL =
∑
j∈ηi
ˆσij

∑
j∈ηi ∑
k∈ηi
ˆσjk
=
∥⃗ρ∥+
��Σ−p
��
+
(30)

In the isotropic pairwise GMRF model, if β = 0, then we have ∥⃗ρ∥+ = 0, and as a consequence,
Φβ = Ψβ. However, the opposite is not necessarily true, that is, we may observe that Φβ = Ψβ for a

non-zero β. One example is for β∗, a solution of Δβ
�
⃗ρ, Σ−
p
�
= 0.

4. Entropy in Isotropic Pairwise GMRFs

Our definition of entropy is done by repeating the same process employed to derive Φβ and Ψβ.
Knowing that the entropy of random variable x is defined by the expected value of self-information,
given by −log p(x), it can be thought of as a probability-based counterpart to the Fisher information.

Definition 7. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood

system, ηi. Assuming that X(t) = {x(t)
1 , x(t)
2 , . . . , x(t)
n } denotes the global configuration of the system at time t,
then the entropy, Hβ, for this state X(t) is given by:

261


Entropy 2014, 16, 1002–1036

Hβ = −E
�
log L
�
⃗θ; X(t)��
= −E

�

log
n
∏
i=1
p
�
xi|ηi,⃗θ
��

=
(31)

= n

2 log
�
2πσ2�
+
1

2σ2

n
∑
i=1
E

⎧
⎨

⎩

�

xi − μ − β ∑
j∈ηi

�
xj − μ
�
�2⎫
⎬

⎭ =

= n

2 log
�
2πσ2�
+
1

2σ2

n
∑
i=1

�

E
�
(xi − μ)2�
− 2βE

�
∑
j∈ηi
(xi − μ)
�
xj − μ
�
�

+ β2E

⎧
⎨

⎩

�
∑
j∈ηi

�
xj − μ
�
�2⎫
⎬

⎭

⎫
⎬

⎭

After some algebra, the expression for Hβ becomes:

Hβ = n

2 log
�
2πσ2�
+
1

2σ2

n
∑
i=1

�

σ2 − 2β ∑
j∈ηi
σij + β2 ∑
j∈ηi ∑
k∈ηi
σjk

�

=
(32)

=
�n

2 log(2πσ2) + n

2

�
− β

σ2

n
∑
i=1

�
∑
j∈ηi
σij

�

+ β2

2σ2

n
∑
i=1

�
∑
j∈ηi ∑
k∈ηi
σjk

�

Using the same matrix-vector notation introduced in the previous sections, we can further simplify the
expression for Hβ (considering n = 1).

Definition 8. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood

system, ηi. Assuming that X(t) = {x(t)
1 , x(t)
2 , . . . , x(t)
n } denotes the global configuration of the system at time t
and ⃗ρ and Σ−
p are defined as Equations (25) and (24), the entropy, Hβ, for this state, X(t), is given by:

Hβ = HG −
� β

σ2 ∥⃗ρ∥+ − β2

2σ2

���Σ−
p
���
+

�
= HG −
� β

σ2 ∥⃗ρ∥+ − β2

2 Ψβ

�
(33)

where HG denotes the entropy of a Gaussian random variable with variance σ2 and Ψβ is the Type II
expected Fisher information.
Note that Shannon entropy is a quadratic function of the spatial dependence parameter, β.
Since the coefficient of the quadratic term is strictly non-negative (Ψβ is the Type II expected Fisher
information), entropy is a convex function of β. Furthermore, as expected, when β = 0 and there is no
induced spatial dependence in the system, the resulting expression for Hβ is the usual entropy of a
Gaussian random variable, HG. Thus, there is a value,
ˆ
βMH, for the inverse temperature parameter,
which minimizes the entropy of the system. In fact, ˆβMH is given by:

∂Hβ
∂β = β

σ2

���Σ−
p
���
+ − 1

σ2 ∥⃗ρ∥+ = 0
(34)

ˆβMH = ∥⃗ρ∥+
��Σ−p
��
+
= ˆβMPL

262


Entropy 2014, 16, 1002–1036

showing that the maximum pseudo-likelihood and the minimum-entropy estimates are equivalent
in an isotropic pairwise GMRF model. Moreover, using the derived equations, we see a relationship
between Φβ, Ψβ and Hβ:

Φβ − Ψβ = Δβ
�
⃗ρ, Σ−
p
�
(35)

∂2Hβ
∂β2 = Ψβ

where the functional Δβ
�
⃗ρ, Σ−
p
�
that represents the difference between Φβ and Ψβ is defined by
Equation (28). These equations relate the entropy and one form of Fisher information (Ψβ) in GMRF
models, showing that Ψβ can be roughly viewed as the curvature of Hβ. In this sense, in a hypothetical
information equilibrium condition Ψβ = Φβ = 0, the entropy’s curvature would be null (Hβ would
never change). These results suggest that an increase in the value of Ψβ, which means stability (a
measure of agreement between the neighboring observations of a given point), contributes to the curve
and, therefore, to inducing a change in the entropy of the system. In this context, the analysis of the
Fisher information could bring us insights in predicting the entropy of a system.

5. Asymptotic Variance of MPL Estimators

It is known from the statistical inference literature that unbiasedness is a property that is not
granted by maximum likelihood estimation, nor by maximum pseudo-likelihood (MPL) estimation.
Actually, there is no universal method that guarantees the existence of unbiased estimators for a fixed
n-sized sample. Often, in the exponential family of distributions, maximum likelihood estimators
(MLEs) coincide with the UMVU (uniform minimum variance unbiased) estimators, because MLEs
are functions of complete sufficient statistics. There is an important result in statistical inference that
shows that if the MLE is unique, then it is a function of sufficient statistics. We could enumerate
and make a huge list of several properties that make maximum likelihood estimation a reference
method [15–17]. One of the most important properties concerns the asymptotic behavior of MLEs:
when we make the sample size grow infinitely (n → ∞), MLEs become asymptotically unbiased and
efficient. Unfortunately, there is no result showing that the same occurs in maximum pseudo-likelihood
estimation. The objective of this section is to propose a closed expression for the asymptotic variance
of the maximum pseudo-likelihood of β in an isotropic pairwise GMRF model. Unsurprisingly, this
variance is completely defined as a function of both forms of expected Fisher information, Ψβ and Φβ;
as for general values of the inverse temperature parameter, the information equality condition fails.

5.1. The Asymptotic Variance of the Inverse Temperature Parameter

In mathematical statistics, asymptotic evaluations uncover several fundamental properties of
inference methods, providing a powerful and general tool for studying and characterizing the behavior
of estimators. In this section, our objective is to derive an expression for the asymptotic variance
of the maximum pseudo-likelihood estimator of the inverse temperature parameter (β) in isotropic
pairwise GMRF models. It is known from the statistical inference literature that both maximum
likelihood and maximum pseudo-likelihood estimators share two important properties: consistency
and asymptotic normality [29,30]. It is possible, therefore, to completely characterize their behaviors
in the limiting case. In other words, the asymptotic distribution of ˆβMPL is normal, centered around
the real parameter value (since consistency means that the estimator is asymptotically unbiased),
with the asymptotic variance representing the uncertainty about how far we are from the mean (real
value). From a statistical perspective, ˆβMPL ≈ N
�
β, υβ
�
, where υβ denotes the asymptotic variance

263


Entropy 2014, 16, 1002–1036

of the maximum pseudo-likelihood estimator. It is known that the asymptotic covariance matrix of
maximum pseudo-likelihood estimators is given by [31]:

C(⃗θ) = H−1(⃗θ)J(⃗θ)H−1(⃗θ)
(36)

with:

H(⃗θ) = Eβ
�
∇2log L
�
⃗θ; X(t)��
(37)

J(⃗θ) = Varβ
�
∇log L
�
⃗θ; X(t)��
(38)

where H and J denote, respectively, the Jacobian and Hessian matrices regarding the logarithm of the
pseudo-likelihood function. Thus, considering the parameter of interest, β, we have the following
definition for its asymptotic variance, υβ (the derivatives are taken with respect to β):

υβ =
Varβ
�
∂
∂βlog L
�
⃗θ; X(t)��

E2
β
�
∂2
∂β2 log L
�
⃗θ; X(t)
�� =
Eβ

��
∂
∂βlog L
�
⃗θ; X(t)��2�
− E2
β
�
∂
∂βlog L
�
⃗θ; X(t)��

E2
β
�
∂2
∂β2 log L
�
⃗θ; X(t)
��
(39)

However, note that the expected value of the first derivative of log L
�
⃗θ; X(t)�
with relation to β is zero:

E
� ∂

∂βlog L
�
⃗θ; X(t)��
= 1

σ2

n
∑
i=1

�

E [xi − μ] − β ∑
j∈ηi
E
�
xj − μ
�
�

= 0
(40)

Therefore, the second term of the numerator of Equation (39) vanishes and the final expression for the
asymptotic variance of the inverse temperature parameter is given as the ratio between Φβ and Ψ2
β:

υβ =
1

�
∑
j∈ηi ∑
k∈ηi
σjk

�2

�
∑
j∈ηi ∑
k∈ηi

�
σ2σjk + 2σijσik
�
− 2β ∑
j∈ηi ∑
k∈ηi ∑
l∈ηi

�
σijσkl + σikσjl + σilσjk
�

+β2 ∑
j∈ηi ∑
k∈ηi ∑
l∈ηi ∑
m∈ηi

�
σjkσlm + σjlσkm + σjmσkl
��

(41)

This derivation leads us to another definition concerning an isotropic pairwise GMRF.

Definition 9. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood

system, ηi. Assuming that X(t) = {x(t)
1 , x(t)
2 , . . . , x(t)
n } denotes the global configuration of the system at time t,
and⃗ρ and Σ−
p are defined as Equations (25) and (24), the asymptotic variance of the maximum pseudo-likelihood
estimator of the inverse temperature parameter, β, is given by (using the same matrix-vector notation from the
previous sections):

υβ =
σ2 ���Σ−
p
���
+ + 2
��⃗ρ ⊗⃗ρT��
+ − 6β
���⃗ρT ⊗ Σ−
p
���
+ + 3β2 ���Σ−
p ⊗ Σ−
p
���
+
��Σ−p
��2
+
=
(42)

=
σ2

��Σ−p
��
+
+
σ4Δβ
�
⃗ρ, Σ−
p
�

��Σ−p
��2
+
= 1

Ψβ
+ 1

Ψ2
β

�
Φβ − Ψβ
�

264


Entropy 2014, 16, 1002–1036

Note that when information equilibrium prevails, that is Φβ = Ψβ, the asymptotic variance is
given by the inverse of the expected Fisher information. However, the interpretation of this equation
indicates that the uncertainty in the estimation of the inverse temperature parameter is minimized when
Ψβ is maximized. Essentially, this means that on average, the local pseudo-likelihood functions are not
flat, that is small changes on the local configuration patterns along the system cannot cause abrupt
changes in the expected global behavior (the global spatial dependence structure is not susceptible to
sharp changes). To reach this condition, there must be a reasonable degree of agreement between the
neighboring elements throughout the system, a behavior that is usually associated to low temperature
states (β is above a critical value and there is a visible induced spatial dependence structure).

6. The Fisher Curve of a System

With the definition of Φβ, Ψβ and Hβ, we have the necessary tools to compute three important
information-theoretic measures of a global configuration of the system. Our idea is that we can study
the behavior of a complex system by constructing a parametric curve in this information-theoretic
space as a function of the inverse temperature parameter, β. Our expectation is that the resulting
trajectory provides a geometrical interpretation of how the system moves from an initial configuration,
A (with a low entropy value for instance), to a desired final configuration, B (with a greater value
of entropy, for instance), since the Fisher information plays an important role in providing a natural
metric to the Riemannian manifolds of statistical models [18,19]. We will call the path from global State
A to global State B as the Fisher curve (from A to B) of the system, denoted by ⃗FB
A(β). Instead of using
time as the parameter to build the curve, ⃗F, we parametrize ⃗F by the inverse temperature parameter, β.

Definition 10. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood
system, ηi, and X(β1), X(β2), . . . , X(βn) be a sequence of outcomes (global configurations) produced by different
values of βi (inverse temperature parameters) for which A = βMIN = β1 < β2 < · · · < βn = βMAX = B.
The system’s Fisher curve from A to B is defined as the function ⃗F : ℜ → ℜ3 that maps each configuration,
X(βi), to a point
�
Φβi, Ψβi, Hβi
�
from the information space, that is:

⃗FB
A (β) =
�
Φβ, Ψβ, Hβ
�
β = A, . . . , B
(43)

where Φβ, Ψβ and Hβ denote the Type I expected Fisher information, the Type II expected Fisher
information and the Shannon entropy of the global configuration, X(β), defined by:

Φβ = 1

σ4

�
σ2 ���Σ−
p
���
+ + 2
���⃗ρ ⊗⃗ρT���
+ − 6β
���⃗ρT ⊗ Σ−
p
���
+ + 3β2 ���Σ−
p ⊗ Σ−
p
���
+

�
(44)

Ψβ = 1

σ2

���Σ−
p
���
+
(45)

Hβ = 1

2

�
log
�
2πσ2 + 1
��
−
� β

σ2 ∥⃗ρ∥+ − β2

2 Ψβ

�
(46)

In the next sections, we show some computational experiments that illustrate the effectiveness of
the proposed tools in measuring the information encoded in complex systems. We want to investigate
what happens to the Fisher curve as the inverse temperature parameter is modified in order to control
the system’s global behavior. Our main conclusion, which is supported by experimental analysis, is
that ⃗FB
A(β) ̸= ⃗FA
B (β). In other words, in terms of information, moving towards higher entropy states
is not the same as moving towards lower entropy states, since the Fisher curves that represent the
trajectory between the initial State A and the final State B are significantly different.

265


Entropy 2014, 16, 1002–1036

7. Computational Simulations

This section discusses some numerical experiments proposed to illustrate some applications of
the derived tools in both simulations and real data. Our computational investigations were divided
into two main sets of experiments:

(1) Local analysis: analysis of the local and global versions of the measures (φβ, ψβ, Φβ, Ψβ and Hβ),
considering a fixed inverse temperature parameter;
(2) Global analysis: analysis of the global versions of the measures (Φβ, Ψβ and Hβ) along Markov
chain Monte Carlo (MCMC) simulations in which the inverse temperature parameter is modified
to control the expected global behavior.

7.1. Learning from Spatial Data with Local Information-Theoretic Measures

First, in order to illustrate a simple application of both forms of local observed Fisher
information, φβ and ψβ, we performed an experiment using some synthetic images generated by
the Metropolis–Hastings algorithm. The basic idea of this simulation process is to start at an initial
configuration in which temperature is infinite (or β = 0). This basic initial condition is randomly
chosen, and after a fixed number of steps, the algorithm produces a configuration that is considered to
be a valid outcome of an isotropic pairwise GMRF model. Figure 1 shows an example of the initial
condition and the resulting system configuration after 1,000 iterations considering a second order
neighborhood system (eight nearest neighbors). The model parameters were chosen as: μ = 0, σ2 = 5
and β = 0.8.

266


Entropy 2014, 16, 1002–1036

Figure 1. Example of Gaussian Markov random field (GMRF) model outputs. The values of the inverse
temperature parameter, β, in the left and right configurations are zero and 0.8, respectively.

Three Fisher information maps were generated from both initial and resulting configurations.
The first map was obtained by calculating the value, φβ(xi), for every point of the system, that is for
i = 1, 2, . . . , n. Similarly, the second one was obtained by using ψβ(xi). The last information map was
built by using the ratio between φβ(xi) and ψβ(xi), motivated by the fact that boundaries are often
composed of patterns that are not expected to be “aligned” to the global behavior (and, therefore, show
high values of φβ(xi)) and also are somehow unstable (show low values of ψβ(xi)). We will recall
this measure, Lβ(xi) = φβ(xi)/ψβ(xi), the local L-information, once it is defined in terms of the first
two derivatives of the logarithm of the local pseudo-likelihood function. Figure 2 shows the obtained
information maps as images. Note that while φβ has a strong response for boundaries (the edges are
light), ψβ has a weak one (so the edges are dark), evidence in favor of considering L-information in
boundary detection procedures. Note also that in the initial condition, when the temperature is infinite,
the informative patterns are almost uniformly spread all over the system, while the final configuration

267


Entropy 2014, 16, 1002–1036

shows a more sparse representation in terms of information. Figure 3 shows the distribution of local
L-information for both systems’ configurations depicted in Figure 1.

Figure 2. Fisher information maps. The first row shows the information maps of the system when the
temperature is infinite (β = 0). The second row shows the same maps when the temperature is low
(β = 0.8). The first and second columns show information maps that were generated by computing
φβ(xi) and ψβ(xi) for each observation in the lattice. The column map was produced by computing the
local L-information, that is the ratio between both local information measures. In terms of information,
low temperature configurations are more sparse, since most local patterns are uninformative, due to
the strong alignment of the particles throughout the system, which is the expected global behavior for
β above a certain critical value.

268


Entropy 2014, 16, 1002–1036

�
�
�
�
�
��
��
��
��
��
��
�

����

����

����

����

�����

�����

�������������

������������������������

����������������������������������������������������������

�
�
��
��
��
��
��
��
��
��
��
�

���

�

���

�

���

�

��� ����
�

�������������

������������������������

�����������������������������������������������������������������

Figure 3.
Distribution of local L-information. When the temperature is infinite, the information
is spread along the system. For low temperature configurations, the number of local patterns with
zero information content significantly increases, that is the system is more sparse in terms of Fisher
information.

7.2. Analyzing Dynamical Systems with Global Information-Theoretic Measures

In order to study the behavior of a complex system that evolves from an initial State A to
another State B, we use the Metropolis–Hastings algorithm, an MCMC simulation method, to generate
a sequence of valid isotropic pairwise GMRF model outcomes for different values of the inverse
temperature parameter, β. This process is an attempt to perform a random walk on the state space of
the system, that is, in the space of all possible global configurations in order to analyze the behavior of
the proposed global measures: entropy and both forms of Fisher information. The main purpose of
this experiment is to observe what happens to Φβ, Ψβ and Hβ when the system evolves from a random
initial state to other global configurations. In other words, we want to investigate the Fisher curve of
the system in order to characterize its behavior in terms of information. Basically, the idea is to use the
Fisher curve as a kind of signature for the expected behavior of any system modeled by an isotropic
pairwise GMRF, making it possible to gain insights into the understanding of large complex systems.

269


Entropy 2014, 16, 1002–1036

To simulate a system in which we can control the inverse temperature parameter, we define an
updating rule for β based on fixed increments. In summary, we start with a minimum value βMIN
(when βMIN = 0, the temperature of the system is infinite). Then, the value of β in the iteration, t, is
defined as the value of β in t − 1 plus a small increment (Δβ), until it reaches a pre-defined upper bound,
βMAX. The process in then repeated with negative increments −Δβ, until the inverse temperature
reaches its minimum value, βMIN, again. This process continues for a fixed number of iterations, NMAX,
during an MCMC simulation. As a result of this approach, a sequence of GMRF samples is produced.
We use this sequence to calculate Φβ, Ψβ and Hβ and define the Fisher curve ⃗F, for β = βMIN, . . . , βMAX.
Figure 4 shows some of the system’s configurations along an MCMC simulation. In this experiment, the
parameters were defined as: βMIN = 0, Δβ = 0.001, βMAX = 0.15 and NMAX = 1, 000, μ = 0, σ2 = 5
and ηi = {(i − 1, j − 1), (i − 1, j), (i − 1, j + 1), (i, j − 1), (i, j + 1), (i + 1, j − 1), (i + 1, j), (i + 1, j + 1)}.

Figure 4. Global configurations along a Markov chain Monte Carlo (MCMC) simulation. Evolution of
the system as the inverse temperature parameter, β, is modified to control the expected global behavior.

A plot of both forms of the expected Fisher information, Φβ and Ψβ, for each iteration of
the MCMC simulation is shown in Figure 5.
The graph produced by this experiment shows
some interesting results. First of all, regarding upper and lower bounds on these measures, it is
possible to note that when there is no induced spatial dependence structure (β ≈ 0), we have an
information equilibrium condition (Φβ = Ψβ and the information equality holds). In this condition,
the observations are practically independent in the sense that all local configuration patterns convey
approximately the same amount of information. Thus, it is hard to find and separate the two categories
of patterns we know: the informative and the non-informative ones. Once they all behave in a similar
manner, there is no informative pattern to highlight. Moreover, in this information equilibrium
situation, Ψβ reaches its lower bound (in this simulation, we observed that in the equilibrium
Φβ ≈ Ψβ ≈ 8), indicating that this condition emerges when the system is most susceptible to a
change in the expected global behavior, since the uncertainty about β is maximum at this moment. In
other words, modification in the behavior of a small subset of local patterns may guide the system to a
totally different stable configuration in the future.
The results also show that the difference between Φβ and Ψβ is maximum when the system
operates with large values of β, that is, when organization emerges and there is a strong dependence
structure among the random variables (the global configuration shows clear visible clusters and
boundaries between them). In such states, it is expected that the majority of patterns be aligned to the
global behavior, which causes the appearance of few, but highly informative patterns: those connecting

270


Entropy 2014, 16, 1002–1036

elements from different regions (boundaries). Besides that, the results suggest that it takes more time
for the system to go from the information equilibrium state to organization than the opposite. We
will see how this fact becomes clear by analyzing the Fisher curve along Markov chain Monte Carlo
(MCMC) simulations. Finally, the results also suggest that both Φβ and Ψβ are bounded by a superior
value, possibly related to the size of the neighborhood system.

�
���
���
���
���
���
���
���
���
�

��

��

��

��

��

��

����������

������������������������������

�������������������������������������������������������������������

���
���
�

�

��

Figure 5. Evolution of Fisher information along an MCMC simulation. As the difference between Φβ
and Ψβ is maximized (*), the uncertainty about the real inverse temperature parameter is minimized
and the number of informative patterns increases. In the information equilibrium condition (**), it is
hard to find informative patterns, since there is no induced spatial dependence structure.

Figure 6 shows the real parameter values used to generate the GMRF outputs (blue line), the
maximum pseudo-likelihood estimative used to calculate Φβ and Ψβ (red line) and also a plot of the
asymptotic variances (uncertainty about the inverse temperature) along the entire MCMC simulation.

271


Entropy 2014, 16, 1002–1036

�
���
���
���
���
���
���
���
���
�����

�

����

����

����

����

���

����

����

����

����

����������

�������������������������������������������������������������������������������������

����������
��������������
��������

Figure 6. Real and estimated inverse temperatures along the MCMC simulation. The system’s global
behavior is controlled by the real inverse temperature parameter values (blue line), used to generate
the GMRF outputs. The maximum pseudo-likelihood estimative is used to compute both Φβ and Ψβ.
Note that the uncertainty about the inverse temperature increases as β → 0 and the system approaches
the information equilibrium condition.

We now proceed to the analysis of the Shannon entropy of the system along the simulation.
Despite showing a behavior similar to Ψβ, the range of values for entropy is significantly smaller. In
this simulation, we observed that 0 ≤ Hβ ≤ 4.5, 0 ≤ Φβ ≤ 18 and 8 ≤ Ψβ ≤ 61. An interesting point is
that knowledge of Φβ and Ψβ allows us to infer the entropy of the system. For example, looking at
Figures 5 and 7, we can see that Φβ and Ψβ start to diverge a little bit earlier (t ≈ 80), then the entropy
in a GMRF model begins to grow (t ≈ 120). Therefore, in an isotropic pairwise GMRF model, if the
system is close to the information equilibrium condition, then Hβ is low, since there is little variability
in the observed configuration patterns. When the difference between Φβ and Ψβ is large, Hβ increases.

272


Entropy 2014, 16, 1002–1036

�
���
���
���
���
���
���
���
���
�

���

�

���

�

���

����������

�������

��������������������������������������������

�������������������
����������

Figure 7.
Evolution of Shannon entropy along an MCMC simulation. Hβ start to grow when the
system leaves the equilibrium condition, where the entropy in the isotropic pairwise GMRF model is
identical to the entropy of a simple Gaussian random variable (since β → 0).

Another interesting global information-theoretic measure is L-information, from now on denoted
by Lβ, since it conveys all the information about the likelihood function (in a GMRF model, only the
first two derivatives of L(⃗θ; X(t)) are not null). Lβ is defined as the ratio between the two forms of
expected Fisher information, Φβ and Ψβ. A nice property about this measure is that 0 ≤ Lβ ≤ 1. With
this single measurement, it is possible to gain insights about the global system behavior. Figure 8 shows
that a value close to one indicates a system approximating the information equilibrium condition, while
a value close to zero indicates a system close to the maximum entropy condition (a stable configuration
with boundaries and informative patterns).

273


Entropy 2014, 16, 1002–1036

�
���
���
���
���
���
���
���
���
�

���

���

���

���

���

���

���

���

���

�

����������

�������������

��������������������������������������������������

Figure 8.
Evolution of L-information along an MCMC simulation. When Lβ approaches one, the
system tends to the information equilibrium condition. For values close to zero, the system tends to the
maximum entropy condition.

To investigate the intrinsic non-linear connection between Φβ, Ψβ and Hβ in a complex system
modeled by an isotropic pairwise GMRF model, we now analyze its Fisher curves. The first curve,
which is a planar one, is defined as ⃗F(β) = (Φβ, Ψβ), for A = βmin to B = βmax and shows how Fisher
information changes when the inverse temperature of the system is modified to control the global
behavior. Figure 9 shows the results. In the first image, the blue piece of the curve is the path from
A to B, that is, ⃗F(β)B
A, and the red piece is the inverse path (from B to A), that is, ⃗F(β)A
B . We must
emphasize that ⃗F(β)B
A is the trajectory from a lower entropy global configuration to a higher entropy
global configuration. On the other hand, when the system moves from B to A, we are moving towards
entropy minimization. To make this clear, the second image of Figure 9 illustrates the same Fisher
curve as before, but now in three dimensions, that is, ⃗F(β) = (Φβ, Ψβ, Hβ). For comparison purposes,
Figure 10 shows the Fisher curves for another MCMC simulation with different parameter settings.
Note that the shape of the curves are quite similar to those in Figure 9.

274


Entropy 2014, 16, 1002–1036

0
2
4
6
8
10
12
14
16
18
20
0

10

20

30

40

50

60

70

PHI

PSI

2D Fisher curve for a GMRF model

A

Equilibrium Line

B

0

5

10

15

20

0

20

40

60

80

2

2.5

3

3.5

4

PHI

3D Fisher curve for a GMRF model

PSI

H

B

A

Figure 9. 2D and 3D Fisher curves of a complex system along an MCMC simulation. The graph shows
a parametric curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from
a differential geometry perspective, as the divergence between Φβ and Ψβ increases, the torsion of the
parametric curve becomes evident (the curve leaves the plane of constant entropy).

275


Entropy 2014, 16, 1002–1036

0
2
4
6
8
10
12
14
16
18
20
0

10

20

30

40

50

60

70

PHI

PSI

2D Fisher curve for an isotropic pairwise GMRF model

Equilibrium line

B

A

0

5

10

15

20
0

20

40

60

80
2

2.1

2.2

2.3

2.4

PSI

3D Fisher curve for an isotropic pairwise GMRF model

PHI

H

A

B

Figure 10. 2D and 3D Fisher curves along another MCMC simulation. The graph shows a parametric
curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from a geometrical
perspective, the properties of these curves are essentially the same as the ones from the previous
simulation.

We can see that the majority of points along the Fisher curve is concentrated around two regions
of high curvature: (A) around the information equilibrium condition (an absence of short-term and
long-term correlations, since β = 0); and (B) around the maximum entropy value, where the divergence
between the information values are maximum (self-organization emerges, since β is greater than a
critical value, βc). The points that lie in the middle of the path connecting these two regions represent
the system undergoing a phase transition. Its properties change rapidly and in an asymmetric way,
since ⃗F(β)B
A ̸= ⃗F(β)A
B for a given natural orientation.
By now, some observations can be highlighted. First, the natural orientation of the Fisher curve
defines the direction of time. The natural A–B path (increase in entropy) is given by the blue curve and
the natural B–A path (decrease in entropy) is given by the red curve. In other words, the only possible
way to walk from A to B (increase Hβ) by the red path or to walk from B to A (decrease Hβ) by the
blue path would be moving back in time (by running the recorded simulation backwards).Eventually,
we believe that a possible explanation for this fact could be that the deformation process that takes the
original geometric structure (with constant curvature) of the usual Gaussian model (A) to the novel
geometric structure of the isotropic pairwise GMRF model (B) is not reversible. In other words, the
way the model is "curved" is not simply the reversal of the "flattering" process (when it is restored to its

276


Entropy 2014, 16, 1002–1036

constant curvature form). Thus, even the basic notion of time seems to be deeply connected with the
relationship between entropy and Fisher information in a complex system: in the natural orientation
(forward in time), it seems that the divergence between Φβ and Ψβ is the cause of an increase in
the entropy, and the decrease of entropy is the cause of the convergence of Φβ and Ψβ. During the
experimental analysis, we repeated the MCMC simulations with different parameter settings, and
the observed behavior for Fisher information and entropy was the same. Figure 11 shows the graphs
of Φβ, Ψβ and Hβ for another recorded MCMC simulation. The results indicate that in the natural
orientation (in the direction of time), an increase in Ψβ seems to be a trigger for an increase in the
entropy and a decrease in the entropy seems to be a trigger for a decrease in Ψβ. Roughly speaking,
Ψβ “pushes Hβ up” and Hβ “pushes Ψβ down”.

�
���
���
���
���
����
����
����
����
����
����
�

��

��

��

��

��

��

��

����������

������������������

�����������������������������������������������������������

���
���

�
���
���
���
���
����
����
����
����
����
����
�

����

���

����

���

����

���

����

���

����������

�������

������������������������������������������������

Figure 11. Relations between entropy and Fisher information. When a system modeled by an isotropic
pairwise GMRF evolves in the natural orientation (forward in time), two rules that relate Fisher
information and entropy can be observed: (1) an increase in Ψβ is the cause of an increase in Hβ (the
increase in Hβ is a consequence of the increase in Ψβ); (2) a decrease in Hβ is the cause of a decrease in
Ψβ (the decrease in Ψβ is a consequence of the decrease in Hβ). In other words, when moving towards
higher entropy states, changes in Fisher information precedes changes in entropy (Ψβ “pushes Hβ
up”). When moving towards lower entropy states, changes in entropy precedes changes in Fisher
information (Hβ “pushes Ψβ down”).

In summary, the central idea discussed here is that while entropy provides a measure
of order/disorder of the system at a given configuration, X(t), Fisher information links these
thermodynamical states through a path (Fisher curve). Thus, Fisher information is a powerful

277


Entropy 2014, 16, 1002–1036

mathematical tool in the study of complex and dynamical systems, since it establishes how these
different thermodynamical states are related along the evolution of the inverse temperature. Instead of
knowing whether the entropy, Hβ, is increasing or decreasing, with Fisher information, it is possible to
know how and why this change is happening.

7.2.1. The Effect of Induced Perturbations in the System

To test whether a system can recover part of its original configuration after a perturbation is
induced, we conducted another computational experiment. During a stable simulation, two kinds of
perturbations were induced in the system: (1) the value of the inverse temperature parameter was set
to zero for the next consecutive two iterations; (2) the value of the inverse temperature parameter was
set to the equilibrium value, β∗ (the solution of Equation 28), for the next consecutive two iterations.
We should mention that in both cases, the original value of β is recovered after these two iterations
are completed.
When the system is disturbed by setting β to zero, the simulations indicate that the system is
not successful in recovering components from its previous stable configuration (note that Φβ and Ψβ
clearly touch one another in the graph). When the same perturbation is induced, but using the smallest
of the two β∗ values (minimum solution of Equation 28), after a short period of turbulence, the system
can recover parts (components, clusters) of its previous stable state. This behavior suggests that this
softer perturbation is not enough to remove all the information encoded within the spatial dependence
structure of system, preserving some of the long-term correlations in data (stronger bonds), slightly
remodeling the large clusters presented in the system. Figures 12 and 13 illustrate the results.

7.3. Considerations and Final Remarks

The goal of this section is to summarize the main results obtained in this paper, focusing on the
interpretation of the Fisher curve of a system modeled by a GMRF. First, our system is initialized with a
random configuration, simulating that in the moment of its creation, the temperature is infinite (β = 0).
We observe two important things at this moment: (1) there is a perfect symmetry in information, since
the equilibrium condition prevails, that is, Φβ = Ψβ; (2) the entropy of the system is minimal. By a
mere convention, we name this initial state of minimal entropy, A.
By reducing the global temperature (β increases), this “universe” is deviating from this initial
condition. As the system is drifted apart from the initial condition, we clearly see a break in the
symmetry of information (Φβ diverges from Ψβ), which apparently is the cause for an increase in the
system’s entropy, since this symmetry break seems to precede an increase in the entropy, H. This is a
fundamental symmetry break, since other forms of ruptures that will happen in the future and will give
rise to several properties of the system, including the basic notion of time as an irreversible process,
follow from this first one. During this first stage of evolution, the system evolves to the condition of
maximum entropy, named B.
Hence, after the break in the information equilibrium condition, there is a significant increase in
the entropy as the system continues to evolve. This stage lasts while the temperature of the system is
further reduced or kept established. When the temperature starts to increase (β decreases), another
form of symmetry break takes place. By moving towards the initial condition (A) from B, changes in
the entropy seem to precede changes in Fisher information (when moving from A to B, we observe
exactly the opposite). Moreover, the variations in entropy and Fisher information towards A are
not symmetric with the variations observed when we move towards B, a direct consequence of that
first fundamental break of the information equilibrium condition. By continuing this process of
increasing the temperature of the system until infinity (β is approaching zero), we take our system to a
configuration that is equivalent to the initial condition, that is, where information equilibrium prevails.
This fundamental symmetry break becomes evident when we look at the Fisher curve of the
system. We clearly see that the path from the state of minimum entropy, A, and the state of maximum
entropy, B, defined by the curve, ⃗FB
A (the blue trajectory in Figure 9), is not the same as the path from B

278


Entropy 2014, 16, 1002–1036

to A, defined by the curve, ⃗FA
B (the red trajectory in Figure 9). An implication of this behavior is that if
the system is moving along the arrow of time, then we are moving through the Fisher curve in the
clockwise orientation. Thus, the only way to go from A to B along ⃗FA
B (the red path) is going back in
time.
Therefore, if that first fundamental symmetry break did not exist, or even if it had happened, but
all the posterior evolution of Φβ, Ψβ and Hβ were absolutely symmetric (i.e., the variations in these
measures were exactly the same when moving from A to B and when moving from B to A), what we

�
��
���
���
���
���
���
���
���
�

��

��

��

��

��

��

����������

������������������

����������������������������������������������������

���
���

�
��
���
���
���
���
���
���
���
�

��

��

��

��

��

��

����������

������������������

�����������������������������������������������������

���
���

Figure 12.
Disturbing the system to induce changes. Variation on Φβ and Ψβ after the system is
disturbed by an abrupt change in the value of β. In the first image, the inverse temperature is set
to zero. Note that Φβ and Ψβ touch one another, indicating that no residual information is kept, as
if the simulation had been restarted from a random configuration. In the second image, the inverse
temperature is set to the equilibrium value, β∗. The results suggest that this kind of perturbation is not
enough to remove all the information within the spatial dependence structure, allowing the system to
recover a significant part of its original configuration after a short stabilization period.

279


Entropy 2014, 16, 1002–1036

Figure 13. The sequence of outputs along the MCMC simulation before and after the system is
disturbed. The first row (when β is set to zero) shows that the system evolved to a different stable
configuration after the perturbation. The second row (when β is set to β∗) indicates that the system
was able to recover a significant part from its previous stable configuration.

would actually see is that ⃗FB
A = ⃗FA
B . As a consequence, to decrease/increase the system’s temperature
would be like moving towards the future/past. In fact, the basic notion of time in that system would
be compromised, since time would be a perfectly reversible process (just similar to a spatial dimension,
in which we can move in both directions). In other words, we would not distinguish whether the
system is moving forward or moving backwards in time.

8. Conclusions

The definition of what is information in a complex system is a fundamental concept in
the study of many problems. In this paper, we discussed the roles of two important statistical
measures in isotropic pairwise Markov random fields composed of Gaussian variables: Shannon
entropy and Fisher information. By using the pseudo-likelihood function of the GMRF model, we
derived analytical expressions for these measures. The definition of a Fisher curve as a geometric
representation for the study and analysis of complex systems allowed us to reveal the intrinsic
non-linear relation between these information-theoretic measures and gain insights about the behavior
of such systems. Computational experiments demonstrate the effectiveness of the proposed tools
in decoding information from the underlying spatial dependence structure of a Gaussian-Markov
random field. Typical informative patterns in a complex systems are located in the boundaries of
the clusters. One of the main conclusions of this scientific investigation concerns the notion of time
in a complex system. The obtained results suggest that the relationship between Fisher information
and entropy determines whether the system is moving forward or backward in time. Apparently,
in the natural orientation (when the system is evolving forward in time), when β is growing, that
is, the temperature of the system is reducing, an increase in Fisher information leads to an increase
in the system’s entropy, and when β is reducing, that is the temperature of the system is growing,

280


Entropy 2014, 16, 1002–1036

a decrease in the system’s entropy leads to a decrease in the Fisher information. In future works
we expect to completely characterize the metric tensor that represents the geometric structure of the
isotropic pairwise GMRF model by specifying all the elements of the Fisher information matrix. Future
investigations should also include the definition and analysis of the proposed tools in other Markov
random field models, such as the Ising and Potts pairwise interaction models. Besides, a topic of
interest concerns the investigation of minimum and maximum information paths in graphs to explore
intrinsic similarity measures between objects belonging to a common surface or manifold in ℜn. We
believe this study could bring benefits to some pattern recognition and data analysis computational
applications.

Acknowledgments: The author would like to thank CNPQ(Brazilian Council for Research and Development) for
the financial support through research grant number 475054/2011-3.

Conflicts of Interest: Conflict of Interest
The authors declare no conflict of interest.

References

1.
Shannon, C.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana,
Chicago, IL & London, USA, 1949.
2.
Rényi, A. On measures of information and entropy. In Proceedings of the 4th Berkeley Symposium on
Mathematics, Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California
Press: Berkeley, CA, USA, 1961. pp. 547–561
3.
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487.
4.
Bashkirov, A. Rényi entropy as a statistical entropy for complex systems.
Theor. Math. Phys. 2006,
149, 1559–1573.
5.
Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630.
6.
Grad, H. The many faces of entropy. Comm. Pure. Appl. Math. 1961, 14, 323–254.
7.
Adler, R.; Konheim, A.; McAndrew, A. Topological entropy. Trans. Am. Math. Soc. 1965, 114, 309–319.
8.
Goodwyn, L. Comparing topological entropy with measure-theoretic entropy. Am. J. Math. 1972, 94, 366–388.
9.
Samuelson, P.A. Maximum principles in analytical economics. Am. Econ. Rev. 1972, 62, 249–262.
10.
Costa, M. Writing on dirty paper. IEEE T. Inform. Theory 1983, 29, 439–441.
11.
Dembo, A.; Cover, T.; Thomas, J.
Information theoretic inequalities.
IEEE T. Inform.
Theory 1991,
37, 1501–1518.
12.
Cover, T.; Zhang, Z. On the maximum entropy of the sum of two dependent random variables. IEEE T.
Inform. Theory 1994, 40, 1244–1246.
13.
Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, London,
2004.
14.
Frieden, B.R.; Gatenby, R.A. Exploratory Data Analysis Using Fisher Information; Springer: London, UK, 2006.
15.
Lehmann, E.L. Theory of Point Estimation; Wiley: New York, NY, USA, 1983.
16.
Bickel, P.J. Mathematical Statistics; Holden Day: New York, NY, USA, 1991.
17.
Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Duxbury: New York, NY, USA, 2002.
18.
Amari, S. Nagaoka, H. Methods of information geometry (Translations of mathematical monographs vol. 191); AMS
Bookstore: Tokyo, Japan, 2000.
19.
Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234.
20.
Anandkumar, A.; Tong, L.; Swami, A. Detection of Gauss-Markov random fields with nearest-neighbor
dependency. IEEE T. Inform. Theory 2009, 55, 816–827.
21.
Gómez-Villegas, M.A.; Main, P.; Susi, R. The effect of block parameter perturbations in Gaussian Bayesian
networks: Sensitivity and robustness. Inform. Sci. 2013, 222, 439–458.
22.
Moura, J.; Balram, N. Recursive structure of noncausal Gauss-Markov random fields. IEEE T. Inform. Theory
1992, 38, 334–354.
23.
Moura, J.; Goswami, S. Gauss-Markov random fields (GMrf) with continuous indices. IEEE Trans. Inform.
Theory 1997, 43, 1560–1573.

281


Entropy 2014, 16, 1002–1036

24.
Besag, J. Spatial interaction and the statistical analysis of lattice systems. J. Roy. Stat. Soc. B Stat. Meth. 1974,
36, 192–236.
25.
Besag, J. Statistical analysis of non-lattice data. The Statistician 1975, 24, 179–195.
26.
Hammersley, J.; Clifford, P. (University of California, Berkeley, Oxford and Bristol). Markov Field on Finite
Graphs and Lattices. Unpublished work, 1971.
27.
Efron, B.F.; Hinkley, D.V. Assessing the accuracy of the ml estimator: Observed versus expected fisher
information. Biometrika 1978, 65, 457–487.
28.
Isserlis, L. On a formula for the product-moment coefficient of any order of a normal frequency distribution
in any number of variables. Biometrika 1918, 12, 134–139.
29.
Jensen, J.; Künsh, H. On asymptotic normality of pseudo likelihood estimates for pairwise interaction
processes. Ann. Inst. Stat. Math. 1994, 46, 475–486.
30.
Winkler, G. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods: A Mathematical Introduction;
Springer-Verlag New York, Inc.: Secaucus, NJ, USA, 2006.
31.
Liang, G.; Yu, B. Maximum pseudo likelihood estimation in network tomography. IEEE T. Signal Proces.
2003, 51, 2043–2053.

c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

282


entropy

Article
Network Decomposition and Complexity Measures:
An Information Geometrical Approach

Masatoshi Funabashi

Sony Computer Science Laboratories, inc. Takanawa muse bldg. 3F, 3-14-13, Higashi Gotanda, Shinagawa-ku,
Tokyo 141-0022, Japan; E-Mail: masa_funabashi@csl.sony.co.jp; Tel.: +81-3-5448-4380; Fax: +81-3-5448-4273

Received: 28 March 2014; in revised form: 24 June 2014 / Accepted: 14 July 2014 /
Published: 23 July 2014

Abstract: We consider the graph representation of the stochastic model with n binary variables, and
develop an information theoretical framework to measure the degree of statistical association existing
between subsystems as well as the ones represented by each edge of the graph representation. Besides,
we consider the novel measures of complexity with respect to the system decompositionability, by
introducing the geometric product of Kullback–Leibler (KL-) divergence. The novel complexity
measures satisfy the boundary condition of vanishing at the limit of completely random and ordered
state, and also with the existence of independent subsystem of any size. Such complexity measures
based on the geometric means are relevant to the heterogeneity of dependencies between subsystems,
and the amount of information propagation shared entirely in the system.

Keywords: information geometry; complexity measure; complex network; system decomposition-
ability; geometric mean

1. Introduction

Complex systems sciences emphasize on the importance of non-linear interactions that can not be
easily approximated linearly. In other word, the degrees of non-linear interactions are the source of
complexity. The classical reductionism approach generally decomposes a system into its components
with linear interactions, and tries to evaluate whether the whole property of the system can still
be reproduced. If this decomposition of a system destroys too much information to reproduce the
system’s whole property, the plausibility of such reductionism is lost. Inversely, if we can evaluate
how much information is ignored by the decomposition, we can assume how much complexity of the
whole system is lost. This gives us a way to measure the complexity of a system with respect to the
system decomposition.
In stochastic systems described as a set of joint distributions, the interaction can basically be
expressed as the statistical association between the variables. The simplest reductionism approach is to
separate the whole system into some subsets of variables, and assume the independence between them.
If such decomposition does not affect the system’s property, the isolated subsystem is independent
from the rest. On the other hand, if the decomposition loses too much information, then the subsystem
is inside of a larger subsystem with strong internal dependencies and can not be easily separated.
The stochastic models have often been represented with the use of graph representation, and
treated with the name of complex network [1–3]. Generally, the nodes represent the variables and
the weights on the edges are the statistical association between them. However, if we consider the
information contained in the different orders of dependencies among variables, the graph with a single
kind of edges is not sufficient to express the whole information of the system [4]. An edge of a graph
with n nodes contains the information of statistical association up to the n-th order dependencies
among n variables. If we try to decompose the system independently by cutting these information, we
have to consider what it means to cut the edge of the graph from the information theoretical point
of view.

Entropy 2014, 16, 4132–4167; doi:10.3390/e16074132
www.mdpi.com/journal/entropy
283


Entropy 2014, 16, 4132–4167

Indeed, analysis on the degree of dependencies existing between variables derived many defini-
tion of complexity in stochastic model [5], which have been mostly studied with information theoretical
perspective. Beginning with seminal works of Lempel and Ziv (e.g., [6]), computation-oriented definition
of complexity takes deterministic formalization and measures the necessary information to reproduce
a given symbolic sequence exactly, which is classified with the name of algorithmic complexity [7–9].
On the other hand, statistical approach to complexity, namely statistical complexity, assumes
some stochastic model as theoretical basis, and refers to the structure of information source on it in
measure-theoretic way [10–12].
One of the most classical statistical complexities is the mutual information between two stochastic
variables, and its generalized form to measure dependence between n variables is proposed (e.g., [13])
and explored in relevance to statistical models and theories by several authors [14–16].
We should also recall that complexity is not necessary conditioned only by information theory,
but rather motivated from the organization of living system such as brain activity. The TSE complexity
shows further extension of generalized mutual information into biological context, where complexity
exists as the heterogeneity between different system hierarchies [17]. These statistical complexities
are all based on the boundary condition of vanishing at the limit of completely random and ordered
state [18].
The complexity measure is usually the projection from system’s variables to one-dimensional
quantity, which is composed to express the degree of characteristic that we define to be important in
what means “complexity”. Since the complexity measure is always a many-to-one association, it has both
aspects of compressing information to classify the system from simple to complex, and losing resolution
of the system’s phase space. If the system has n variables, we generally need n independent complexity
measures to completely characterize the system with real-value resolution. The problematics of
defining a complexity measure is situated on the edge of balancing the information compression on
system’s complexity with theoretical support, and the resolution of the system identification to be
maintained high enough to avoid trivial classification. The latter criterion increases its importance as
the system size becomes larger. The better complexity measure is therefore a set of indices, with as
less number as possible, which characterizes major features related to the complexity of the system.
In this sense, the ensemble of complexity measures is also analogous to the feature space of support
vector machine. A non-trivial set of complexity measures need to be complementary to each other in
parameter space for the possible best discrimination of different systems.
In this paper, we first consider the stochastic system with binary variables and theoretically
develop a way to measure the information between subsystems, which is consistent to the information
represented by the edges of the graph representation.
Next, we particularly focus on the generalized mutual information as a start point of the argument,
and further consider to incorporate network heterogeneity into novel measures of complexity with
respect to the system’s decompositionability. This approach will be revealed to be complementary to
TSE complexity as the difference between arithmetic and geometric means of information.

2. System Decomposition

Let us consider the stochastic system with n binary variables x = (x1, · · · , xn) where xi ∈
{0, 1} (1 ≤ i ≤ n). We denote the joint distribution of x by p(x). We define the decomposition pdec(x)
of p(x) into two subsystems y1 = (x1
1, · · · , x1
n1) and y2 = (x2
1, · · · , x2
n2) (n1 + n2 = n, y1 ∪ y2 = x,
y1 ∩ y2 = φ) as follows:

pdec(x) = p(y1)p(y2),
(1)

where p(y1) and p(y2) are the joint distributions of y1 and y2, respectively. For simplicity, hereafter
we denote the system decomposition using the smallest subscript of variables in each subsystem. For
example, in case n = 4, y1 = (x1, x3) and y2 = (x2, x4), we describe the decomposed system pdec(x)

284


Entropy 2014, 16, 4132–4167

as < 1212 >. The system decomposition means to cut all statistical association between the two
subsystems, which is expressed as setting the independent relation between them.
We will further consider the Equation (1) in terms of the graph representation. We define
the undirected graph Γ := (V, E) of the system p(x), whose vertices V = {x1, · · · , xn} and edges
E = V × V represent the variables and the statistical association, respectively. To express the system,
we set the value of each vertex as the value of the corresponding variable, and the weight of each edge
as the degree of dependency between the connected variables.
There is however a problem considering the representation with a single kind of edge. The
statistical association among variables is not only between two variables, but can be independently
defined among plural variables up to the n-th order. Therefore, the exact definition of the weight
of the edges remains unclear. To clarify these problematics, we consider the hierarchical marginal
distributions j as another coordinates of the system p(x) as follows:

j = (j1; j2; · · · ; jn),
(2)

where

j1
=
(η1, · · · , ηi, · · · , ηn), (1 < i < n),

j2
=
(η1,2, · · · , ηi,j, · · · , ηn−1,n), (1 < i < j < n),

...

jn
=
η1,2,··· ,n,
(3)

and

η1
=
∑
i2,··· ,in∈{0,1}
p(1, i2, · · · , in),

...

ηn
=
∑
i1,··· ,in−1∈{0,1}
p(i1, · · · , in−1, 1),

η1,2
=
∑
i3,··· ,in∈{0,1}
p(1, 1, i3, · · · , in),

...

ηn−1,n
=
∑
i1,··· ,in−2∈{0,1}
p(i1, · · · , in−2, 1, 1),

...

η1,2,··· ,n
=
p(1, 1, · · · , 1).
(4)

Since the definition of j is a linear transformation of p(x), both coordinates have the degrees of
freedom ∑n
k=1 nCk.
The subcoordinates j1 are simply the set of marginal distributions of each variable.
The
subcoordinates jk (1 < k ≤ n) include the statistical association among k variables, that can not
be expressed with the coordinates less than the k-th order. This means that the different statistical
associations exist independently in each order among the corresponding sets of the variables. The
statistical association represented by the weight of a graph edge {xi, xj} is therefore the superposition
of the different dependencies defined on every subset of x including xi and xj.
To measure the degree of statistical association in each order, the information geometry established
the following setting [19]. We first define another coordinates ` = (`1; `2; · · · ; `n) that are the dual

285


Entropy 2014, 16, 4132–4167

coordinates of j with respect to the Legendre transformation of the exponential family’s potential
function ψ(`) to its conjugate potential φ(j) as follows:

`1
=
(θ1, · · · , θn),

`2
=
(θ1,2, · · · , θn−1,n),
(5)
...

`n
=
θ1,2,··· ,n,

where

ψ(`)
=
log
1

p(0, · · · , 0),

φ(j)
= ∑
i
θiηi + ∑
i<j
θi,jηi,j + · · · + θ1,2,··· ,nη1,2,··· ,n − ψ(`),

θi
=
∂φ(j)

∂ηi
, (1 ≤ i ≤ n),

θi,j
=
∂φ(j)
∂ηi,j
, (1 ≤ i < j ≤ n),
(6)

...

θ1,2,··· ,n
=
∂φ(j)

∂η1,2,··· ,n
.

Note that j can be inversely derived from `, following Legendre transformation between φ(j) and
ψ(`):

ηi
=
∂ψ(`)

∂θi
, (1 ≤ i ≤ n),

ηi,j
=
∂ψ(`)
∂θi,j
, (1 ≤ i < j ≤ n),
(7)

...

η1,2,··· ,n
=
∂ψ(`)

∂θ1,2,··· ,n
.

Using the coordinates `, the system is described in the form of the exponential family as follows:

p(x) = ∑
i
θixi + ∑
i<j
θi,jxixj + · · · + θ1,2,··· ,nx1x2 · · · xn − ψ(`).
(8)

The information geometry revealed that the exponential family of probability distribution forms a
manifold with a dual-flat structure. More precisely, the coordinates ` form a flat manifold with respect
to the Fisher information matrix as the Riemannian metric, and α-connection with α = 1. Dually to `,
the coordinates j are flat with respect to the same metric but α-connection with α = −1. It is known that
` and j are orthogonal to each other with respect to the Fisher information matrix. This structure give
us a way to decompose the degree of statistical association among variables into separated elements of
arbitrary orders. We define the so-called k-cut mixture coordinates ık as follows [14].

ık
=
(jk−; `k+),
(9)

jk−
=
(j1, · · · , jk),
(10)

`k+
=
(`k+1, · · · , `n).
(11)

286


Entropy 2014, 16, 4132–4167

We also define the k-cut mixture coordinates ık
0 = (jk−; 0, · · · , 0) with no dependency above the
k-th order. We denote the system specified with ık and ık
0 as p(x, ık) and p(x, ık
0 ), respectively.
Then the degree of the statistical association more than the k-th order in the system can be
measured by the Kullback-Leibler (KL-) divergence D[p(x, ı) : p(x, ık
0 )].

2N · D[p(x, ı) : p(x, ık
0)] ∼ χ2(
n
∑
i=k+1
nCi),
(12)

where D[· : ·] is the KL-divergence from the first system to the second one.
Here, the decomposition is performed according to the orders of statistical association, which does
not spatially distinguish the vertices. If we define the weight of an edge {xi, xj} with the KL-divergence,
the above k-cut coordinates ık are not appropriate to measure the information represented in each
edge. We need to set another mixture coordinates so that to separate only the existing information
between xi and xj regardless of its order.
Let us return to the definition of the system decomposition and consider on the dual-flat
coordinates ` and j.

Proposition 1. The independence between the two decomposed systems y1 = (x1
1, · · · , x1
n1) and y2 =
(x2
1, · · · , x2
n2) can be expressed on the new coordinates jdec as follows:

ηdec
i
=
ηi, (1 ≤ i ≤ n),

ηdec
i,j
=

�
ηi,j, (1 ≤ i < j ≤ n),
if {xi, xj} ⊆ y1 or ⊆ y2

ηiηj, (1 ≤ i < j ≤ n),
else
,

ηdec
i,j,k
=

⎧
⎪
⎪
⎪
⎨

⎪
⎪
⎪
⎩

ηi,j,k, (1 ≤ i < j < k ≤ n),
if {xi, xj, xk} ⊆ y1 or ⊆ y2

ηi,jηk, (1 ≤ i < j < k ≤ n),
else if {xi, xj} ⊆ y1 or ⊆ y2

ηiηj,k, (1 ≤ i < j < k ≤ n),
else if {xj, xk} ⊆ y1 or ⊆ y2

ηjηi,k, (1 ≤ i < j < k ≤ n),
else (if {xi, xk} ⊆ y1 or ⊆ y2)

,

...
(13)

ηdec
1,2,··· ,n
=
ηs[i,k1,··· ,kn1−1]ηs[j,l1,··· ,ln2−1], (xi ∈ y1, xj ∈ y2),

where s[· · · ] is the ascending sort of the internal sequence.
Then the corresponding dual coordinates `dec take 0 elements as follows:

θdec
i,j
=
0,
(1 ≤ i < j <≤ n),
if {xi, xj} ∩ y1 ̸= φ and {xi, xj} ∩ y2 ̸= φ

θdec
i,j,k
=
0,
(1 ≤ i < j < k ≤ n),
if {xi, xj, xk} ∩ y1 ̸= φ and {xi, xj, xk} ∩ y2 ̸= φ

...

θdec
1,2,··· ,n
=
0.
(14)

Proof. For simplicity, we show the cases of n = 2 and n = 3 for the first node separation.

287


Entropy 2014, 16, 4132–4167

For n = 2, the above defined jdec for the system decomposition < 12 > give its dual coordinates
`dec as follows:

θdec
1
=
log
ηdec
1
− ηdec
1,2

1 − ηdec
1
− ηdec
2
+ ηdec
1,2
=
log
η1

1 − η1
,

θdec
2
=
log
ηdec
2
− ηdec
1,2

1 − ηdec
1
− ηdec
2
+ ηdec
1,2
=
log
η2

1 − η2
,

θdec
1,2
=
log
ηdec
1,2 (1 − ηdec
1
− ηdec
2
+ ηdec
1,2 )

(ηdec
1
− ηdec
1,2 )(ηdec
2
− ηdec
1,2 )
=
0,

(15)

which means the first and second node is independent.
For n = 3, the above defined jdec for the system decomposition < 122 > give its dual coordinates
`dec as follows:

θdec
1
=
log
ηdec
1
− ηdec
1,2 − ηdec
1,3 + ηdec
1,2,3

1 − ηdec
1
− ηdec
2
− ηdec
3
+ ηdec
1,2 + ηdec
1,3 + ηdec
2,3 − ηdec
1,2,3
= log
η1

1 − η1
,

θdec
2
=
log
ηdec
2
− ηdec
1,2 − ηdec
1,3 + ηdec
1,2,3

1 − ηdec
1
− ηdec
2
− ηdec
3
+ ηdec
1,2 + ηdec
1,3 + ηdec
2,3 − ηdec
1,2,3
= log
η2 − η2,3

1 − η2 − η3 + η2,3
,

θdec
3
=
log
ηdec
3
− ηdec
1,3 − ηdec
2,3 + ηdec
1,2,3

1 − ηdec
1
− ηdec
2
− ηdec
3
+ ηdec
1,2 + ηdec
1,3 + ηdec
2,3 − ηdec
1,2,3
= log
η3 − η2,3

1 − η2 − η3 + η2,3
,

(16)

θdec
1,2
=
log
(ηdec
1,2 − ηdec
1,2,3)(1 − ηdec
1
− ηdec
2
− ηdec
3
+ ηdec
1,2 + ηdec
1,3 + ηdec
2,3 − ηdec
1,2,3)

(ηdec
1
− ηdec
1,2 − ηdec
1,3 + ηdec
1,2,3)(ηdec
2
− ηdec
1,2 − ηdec
2,3 + ηdec
1,2,3)

=
0,

θdec
1,3
=
log
(ηdec
1,3 − ηdec
1,2,3)(1 − ηdec
1
− ηdec
2
− ηdec
3
+ ηdec
1,2 + ηdec
1,3 + ηdec
2,3 − ηdec
1,2,3)

(ηdec
1
− ηdec
1,2 − ηdec
1,3 + ηdec
1,2,3)(ηdec
3
− ηdec
1,3 − ηdec
2,3 + ηdec
1,2,3)

=
0,

θdec
2,3
=
log
(ηdec
2,3 − ηdec
1,2,3)(1 − ηdec
1
− ηdec
2
− ηdec
3
+ ηdec
1,2 + ηdec
1,3 + ηdec
2,3 − ηdec
1,2,3)

(ηdec
2
− ηdec
1,2 − ηdec
2,3 + ηdec
1,2,3)(ηdec
3
− ηdec
1,3 − ηdec
2,3 + ηdec
1,2,3)

=
log η2,3(1 − η2 − η3 + η2,3)

(η2 − η2,3)(η3 − η2,3) ,

(17)

θdec
1,2,3
=
log

�
ηdec
1,2,3

(ηdec
1,2 − ηdec
1,2,3)(ηdec
1,3 − ηdec
1,2,3)(ηdec
2,3 − ηdec
1,2,3)

×
(ηdec
1
− ηdec
1,2 − ηdec
1,3 + ηdec
1,2,3)(ηdec
2
− ηdec
1,2 − ηdec
2,3 + ηdec
1,2,3)(ηdec
3
− ηdec
1,3 − ηdec
2,3 + ηdec
1,2,3)

(1 − ηdec
1
− ηdec
2
− ηdec
3
+ ηdec
1,2 + ηdec
1,3 + ηdec
2,3 − ηdec
1,2,3)

�

=
0,

(18)

which means the first node is independent from the other nodes.
The generalization is possible with the use of recurrence formula between system size n and
n + 1, according to the symmetry of the model and Legendre transformation between jdec and `dec

coordinates.
Numerical proof can be obtained by computing directly 0 elements of `dec from jdec.

288


Entropy 2014, 16, 4132–4167

The definition of jdec means to decompose the hierarchical marginal distributions j into the
products of the subsystems’ marginal distributions, in case the subscripts traverse the two subsystems.
Therefore, only the statistical associations between two subsystems are set to be independent, while
the internal dependencies of each subsystem remain unchanged. This is analytically equivalent to
compose another mixture coordinates ¸, namely the < · · · >-cut coordinates, with proper description
of the system decomposition with < · · · >. The ¸ consists of the j coordinates with the subscripts that
do not traverse between the decomposed subsystems, and the ` coordinates whose subscripts traverse
between them.
For simplicity, we only describe here the case n = 4 and the decomposition < 1133 > (the set of
the first, second, and the third, fourth nodes each form a subsystem). The system p(x) is expressed
with the < 1133 >-cut coordinates ¸ as

ξ1
=
η1,
...

ξ4
=
η4,

ξ1,2
=
η1,2,

ξ1,3
=
θ1,3,

ξ1,4
=
θ1,4,

ξ2,3
=
θ2,3,
(19)

ξ2,4
=
θ2,4,

ξ3,4
=
η3,4,

ξ1,2,3
=
θ1,2,3,
...

ξ2,3,4
=
θ2,3,4,

ξ1,2,3,4
=
θ1,2,3,4.

The decomposed system with no statistical association between two subsystems have the
following coordinates ¸dec, which is, in any decomposition, equivalent to set all θ in ¸ as 0:

ξdec
1
=
η1,
...

ξdec
4
=
η4,

ξdec
1,2
=
η1,2,

ξdec
1,3
=
0,

ξdec
1,4
=
0,

ξdec
2,3
=
0,
(20)

ξdec
2,4
=
0,

ξdec
3,4
=
η3,4,

ξdec
1,2,3
=
0,

...

ξdec
2,3,4
=
0,

ξdec
1,2,3,4
=
0.

289


Entropy 2014, 16, 4132–4167

This is analytically equivalent to the definition of the decomposition (13)–(14) in case of < 1133 >.
Therefore, the KL-divergence D[p(x, ¸) : p(x, ¸dec)] measures the information lost by the system
decomposition. The following asymptotic agreement to χ2 test also holds.

Proposition 2.

2N · D[p(x, ¸) : p(x, ¸dec)] ∼ χ2(♯θ(¸)),
(21)

where ♯θ(¸) is the number of ` coordinates appearing in the ¸ coordinates.

3. Edge Cutting

We further expand the concept of system decomposition to eventually quantify the total amount of
information expressed by an edge of the graph. Let us consider to cut an edge {xi, xj} (1 ≤ i < j ≤ n)
of the graph with n vertices. Hereafter we call this operation as the edge cutting i − j. In the same way
as the system decomposition, the edge cutting corresponds to modify the j coordinates to produce jec

coordinates as follows:

ηec
i,j
=
ηiηj,

ηec
s[i,j,k1]
=
ηs[i,k1]ηs[j,k1],

ηec
s[i,j,k1,k2]
=
ηs[i,k1,k2]ηs[j,k1,k2],
(22)

...

ηec
s[i,j,k1,··· ,kn−2]
=
ηs[i,k1,··· ,kn−2]ηs[j,k1,··· ,kn−2],

({i, j, k1, · · · , kn−2}
=
{1, · · · , n}),

and the rest of jec remains the same as those of j.
The formation of jec from j consists of replacing the k-th order elements (k ≥ 3) of j including both
i and j in its subscripts, with the product of the k − 1-th order j in maximum subgraphs (k − 1 vertices)
each including i or j. This means that all orders of statistical association including the variables xi and
xj are set to be independent only between them. Other relations that do not include simultaneously xi
and xj remain unchanged.
Certain combinations of edge cuttings coincide with system decompositions. For example, in case
n = 4, the edge cuttings 1 − 2, 1 − 3, and 1 − 4 are equivalent to the system decomposition < 1222 >.
We define the i − j-cut mixture coordinates ¸ for orthogonal decomposition of the statistical
association represented by the edge {xi, xj}. Although actual calculation can be performed only with j
coordinates, this generalization is necessary to have a geometrical definition of the orthogonality. For
simplicity, we only describe the ¸ in the case of n = 4:

290


Entropy 2014, 16, 4132–4167

ξ1
=
η1,
...

ξ4
=
η4,

ξ1,2
=
θ1,2,

ξ1,3
=
η1,3,

ξ1,4
=
η1,4,

ξ2,3
=
η2,3,

ξ2,4
=
η2,4,
(23)

ξ3,4
=
η3,4,

ξ1,2,3
=
θ1,2,3,

ξ1,2,4
=
θ1,2,4,

ξ1,3,4
=
η1,3,4,

ξ2,3,4
=
η2,3,4,

ξ1,2,3,4
=
θ1,2,3,4,

where orthogonality between the elements of j and ` holds with respect to the Fisher information
matrix.
Calculating the dual coordinates `ec of jec, we can define the coordinates ¸ec of the system after
the edge cutting 1 − 2 as follows:

ξec
1
=
η1,
...

ξec
4
=
η4,

ξec
1,2
=
θec
1,2,

ξec
1,3
=
η1,3,

ξec
1,4
=
η1,4,

ξec
2,3
=
η2,3,

ξec
2,4
=
η2,4,

ξec
3,4
=
η3,4,

ξec
1,2,3
=
θec
1,2,3,

ξec
1,2,4
=
θec
1,2,4,

ξec
1,3,4
=
η1,3,4,

ξec
2,3,4
=
η2,3,4,

ξec
1,2,3,4
=
θec
1,2,3,4.

Note that the edge cutting can not be defined simply by setting the corresponding elements of `ec as 0.
Then the KL-divergence D[p(x, ¸)
:
p(x, ¸ec)] represent the total amount of information
represented by the edge 1 − 2.
The following asymptotic agreement to χ2 test also holds:

291


Entropy 2014, 16, 4132–4167

Proposition 3.

2N · D[p(x, ¸) : p(x, ¸ec)] ∼ χ2(1 +
n−2
∑
k=1
n−2Ck).
(24)

We call this χ2 value or the KL-divergence itself as edge information of edge 1 − 2.

4. Generalized Mutual Information as Complexity with Respect to the Total System
Decomposition
In previous sections, we have introduced a measure of complexity in terms of system
decomposition, by measuring the KL-divergence between a given system and its independently
decomposed subsystems.
We consider here the total system decomposition, and measure the
informational distance I between the system and the totally decomposed system where each element
are independent.

I :=
n
∑
i=1
H(xi) − H(x1, · · · , xn),
(25)

where

H(x) := −∑
x
p(x) log(x).
(26)

This quantity is the generalization of mutual information, and is named in various ways such as
generalized mutual information, integration, complexity, multi-information, etc. according to different
authors. For simplicity, we call the I as “multi-information taking after [15]. This quantity can be
interpreted as a measure of complexity that sums up the order-wise statistical association existing in
each subset of components with information geometrical formalization [14]
For simplicity, we denote the multi-information I of n-dimensional stochastic binary variables as
follows, using the notation of the system decomposition:

I = D[< 111 · · · 1 >:< 123 · · · n >].
(27)

5. Rectangle-Bias Complexity

The multi-information contains some degrees of freedom in case n > 2. That is, we can define a
set of distributions {p(x)|I = const.} with different parameters but the same I value. This fact can be
clearly explained with the use of information geometry. From the Pythagorean relation, we obtain the
followings in case of n = 3:

D[< 111 >:< 113 >] + D[< 113 >:< 123 >]
=
D[< 111 >:< 123 >],

D[< 111 >:< 121 >] + D[< 121 >:< 123 >]
=
D[< 111 >:< 123 >],
(28)

D[< 111 >:< 122 >] + D[< 122 >:< 123 >]
=
D[< 111 >:< 123 >].

Using these relations, we can schematically represent the decomposed systems on a circle diagram
with diameter
√

I. This representation is based on the analogous algebra between Pythagorean relation
of KL-divergence, and that of Euclidian geometry where the circumferential angle of a semi-circular
arc is always π

2 .
Figure1 represents two different cases with the same I value in case n = 3. The distance between
two systems in the same diagram corresponds to the square root value of KL-divergence between
them. Clearly the left and right figures represent different dependencies between nodes, although
they both have the same I value. Such geometrical variation is possible by the abundance of degree of
freedom in dual coordinates compared to the given constraint. There exist 7 degrees of freedom in η
or θ coordinates for n = 3, while the only constraint is the invariance of I value, which only reduce 1

292


Entropy 2014, 16, 4132–4167

degree of freedom. The remaining 6 degrees of freedom can then be deployed to produce geometrical
variation in the circle diagram. As for considering system decomposition, the left figure is difficult to
obtain decomposed systems without losing much information. While in the right figure there exists
relatively easy decomposition < 122 >, which loses less information than any decomposition in the left
figure. We call such degree of facility of decomposition with respect to the losing information as system
decompositionability. In this sense, the left system is more complex although the 2 systems both have the
same I value. Especially, in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] = D[< 111 >:< 121 >],
the system does not have any easiest way of decomposition, and any isolation of a node loses significant
amount of information.
To further incorporate such geometrical structure reflecting system decompositionability into a
measure of complexity, we consider a mathematical way to distinguish between these two figures.
Although the total sum of KL-divergence along the sequence of system decomposition is always
identical to I by Pythagorean relation, their product can vary according to the geometrical composition
in the circle diagram. This is analogous to the isoperimetric inequality of rectangle, where regular
tetragon gives the maximum dimensions amongst constant perimeter rectangles.
We propose provisionary a new measure of complexity as follows, namely rectangle-bias
complexity Cr:

Cr =
1

|SD| − 2
∑
<···>∈SD
D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >],
(29)

where SD is the set of possible system decomposition in n binary variables, and |SD| is the element
number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5
for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value
for the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >
] = D[< 111 >:< 121 >]. We propose provisionary a new measure of complexity as follows, namely
rectangle-bias complexity Cr:

Cr =
1

|SD| − 2
∑
<···>∈SD
D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >],
(30)

where SD is the set of possible system decomposition in n binary variables, and |SD| is the element
number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5
for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value for
the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] =
D[< 111 >:< 121 >].

293


Entropy 2014, 16, 4132–4167

Figure 1. Circle diagrams of system decomposition in 3-node network. Both systems have the same value of
multi-information I that is expressed as the identical diameter length of the circles. 2 variations are
shown, where the left system is more complex (Cr high) in a sense any system decomposition requires
to lose more information than the easiest one (< 122 >) in the right figure (Cr low).

6. Complementarity between Complexities Defined with Arithmetic and Geometric Means

We evaluate the possibility and the limit of rectangle-bias complexity Cr comparing with other
proposed measures of complexity.
The Interests in measuring network heterogeneity have been developed toward the incorporation
of multi-scale characteristics into complexity measures. The TSE complexity is motivated from the
structure of the functional differentiation of brain activity, which measures the difference of neural
integration between all sizes of subsystems and the whole system [17]. Biologically motivated TSE
complexity is also investigated from theoretical point of view, to further attribute desirable property
as an universal complexity measure independent of system size [20]. The hierarchical structure of
the exponential family in information geometry also leads to the order-wise description of statistical
association, which can be regarded as a multi-scale complexity measure [14]. The relation between
the order-wise dependencies and the TSE complexity is theoretically investigated to establish the
order-wise component correspondence between them [15].
These indices of network heterogeneity, however, all depend on the arithmetic mean of the
component-wise information theoretical measure. We show that these arithmetic means still miss to
measure certain modularity based on the statistical independence between subsystems.
Figure 2 present the simplified cases where complexity measures with arithmetic means fail to
distinguish. We consider the two systems with different heterogeneity but identical multi-information
I. Here, the multi-information can not reflect the network heterogeneity. The TSE complexity and its
information geometrical correspondence in [15] has a sensitivity to measure the network heterogeneity,
but since the arithmetic mean is taken over all subsystems, they do not distinguish the component-wise
break of symmetry between different scales. The renormalized TSE complexity with respect to the
multi-information I still has the same insensitivity. Even by incorporating the information of each
subsystem scale, the arithmetic mean can balance out between the scale-wise variations, and a large
range of the heterogeneity in different scale can realize the same value of these complexities. For the
application in neuroscience, the assumption of a model with simple parametric heterogeneity and the
comparison of TSE complexity between different I values alleviate this limitation [17].

294


Entropy 2014, 16, 4132–4167

(a)
(b)

(c)
(d)

Figure 2. Schematic examples of stochastic systems with identical multi-information I where complexity
measures with arithmetic mean fail to distinguish.
(a):
Example 1 of stochastic system with
homogeneous mean of edge information and symmetric fluctuation of its heterogeneity; (b):
Example 2 of heterogeneous stochastic system with bimodal edge information distribution and
identical multi-information I and complexity based on arithmetic mean as example 1; (c): schematic
representation of the distribution of statistical association (edge information) in upper network; (d):
schematic representation of the distribution of statistical association (edge information) in upper
network.

In contrast to complexities with arithmetic mean, the rectangle-bias complexity Cr is related to
the geometrical mean. The Cr can distinguish the two systems in Figure 2, giving relatively high Cr
value to the left system and low value to the right one.
This does not mean , however, that the Cr has a finer resolution than other complexity
measures. The constant conditions of complexity measures are the constraints on ∑n
k=1 nCk degrees of
freedom in model parameter space, which define different geometrical composition of corresponding
submanifolds. We basically need ∑n
k=1 nCk independent measures to assure the real-value resolution
of network feature characterization. Complexities with arithmetic and geometric means are just giving
complementary information on network heterogeneity, or different constant-complexity submanifolds
structure in statistical manifold as depicted in Figure 3. Therefore, it is also possible to construct a
class of systems that has identical I and Cr values but different TSE complexity. Complexity measures
should be utilized in combination, with respect to the non-linear separation capacity of network
features of interest.

295


Entropy 2014, 16, 4132–4167

Figure 3. Schematic representation of complementarity between complexity measures based on arithmetic
mean (Ca) and geometric mean (Cg) of informational distance. An example of the n − 1 dimensional
constant-complexity submanifolds with respect to Ca = const. and Cg = const. conditions are depicted
with yellow and orange surface, respectively. The dimension of the whole statistical manifold S is the
parameter number n.

7. Cuboid-Bias Complexity with Respect to System Decompositionability

We consider the expansion of Cr into general system size n. The n ≥ 4 situation is different from
n = 3 and less in the existence of a hierarchical structure between system decompositions.
Figure 4 shows the hierarchy of the system decompositions in case n = 4. Such hierarchical
structure between system decompositions is not homogeneous with respect to the subsystems
number, and depends on the isomorphic types of decomposed systems. This fact produces certain
difference of meaning in complexity between each KL-divergences when considering the system
decompositionability.

Figure 4. Hierarchy of system decomposition for 4 nodes network (n = 4). Possible sequences of Seq =
{SD1(is) → SD2(is) → SD3(is) → SD4(is)|1 ≤ is ≤ |Seq| = 18} are connected with the lines.

A simple example in 4 nodes network is shown in Figure 5.

296


Entropy 2014, 16, 4132–4167

Figure 5. Hierarchical effect of sequential system decomposition on cuboid volume and rectangle surface
on circle graph. We consider to increase the diameter of the green circle from dotted to dashed one
without changing those of the red and blue circles, which gives different effect on the change of
D[< 1222 >:< 1233 >] and D[< 1133 >:< 1134 >] according to the hierarchical structure of the
decomposition sequences.

We consider the modification of 2 KL-divergences in the figure, D[< 1111 >:< 1222 >] and
D[< 1111 >:< 1133 >] from the diameter of green dotted circle to the dashed one.
The joint distribution P(x1, x2, x3, x4) of a discrete distribution with 4 binary variables
(x1, x2, x3, x4) (x1, x2, x3, x4 ∈ {0, 1}) have 24 − 1 = 15 parameters, which define the dual-flat
coordinates of statistical manifold in information geometry.
On the other hand, the possible system decompositions exist as the followings in n = 4:

SD
:=
{< 1111 >, < 1114 >, < 1131 >, < 1211 >, < 1222 >,

< 1133 >, < 1212 >, < 1221 >, < 1134 >, < 1214 >,

< 1231 >, < 1224 >, < 1232 >, < 1233 >, < 1234 >}.
(31)

Since the number of possible system decompositions is 15, and each is associated with the
modification of different sets of P(x1, x2, x3, x4) parameters, the system decompositions and KL-
divergences between them can be defined independently. This also holds even under the constant
condition of I value or other complexity measures except the ones imposing dependency between
system decompositions.
This means that we can independently modify the diameter of green dotted circle in Figure 5,
without changing the diameters of the red and blue circles, which define the system decompositions
< 1233 > and < 1134 > in the sub-hierarchy of < 1222 > and < 1133 >, respectively. Other
KL-divergences can also be maintained as given constant values for the same reason.
The rectangle-biased complexity Cr increases its value with such modification, but does not reflect
the heterogeneity of KL-divergences according to the hierarchy of system decompositions. If we
consider the system decompositionability as the mean facility to decompose the given system into
its finest components with respect to the “all” possible system decompositions, such hierarchical
difference also has a meaning in the definition of complexity.
The effect of modifying the diameter of the green dotted circle is different between the
decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→<

297


Entropy 2014, 16, 4132–4167

1133 >→< 1134 >→< 1234 >. The decrease of the KL-divergence D[< 1222 >:< 1233 >]
is less than D[< 1133 >:< 1134 >] since the diameter of the red dotted circle is larger than the
blue one in Figure 5. This means that the effect of changing the same amount of KL-divergences
in D[< 1111 >:< 1222 >] and D[< 1111 >:< 1133 >] produces larger effect on the sequence
< 1111 >→< 1133 >→< 1134 >→< 1234 > than < 1111 >→< 1222 >→< 1233 >→< 1234 >, if
compared at the sequence level. The rectangle-biased complexity Cr does not reflect such characteristics
since it does not distinguish between the hierarchical structure between the diameters of the green, red
and blue dotted circles.
To incorporate such hierarchical effect in a complexity measure with geometric mean, we have
the natural expansion of the rectangle-biased complexity Cr as the cuboid-bias complexity Cc, which is
defined as follows:

Cc :=
1

|Seq|

|Seq|
∑
is=1

n−1
∏
i=1
D[SDi(is) : SDi+1(is)],
(32)

where Seq represents the possible sequences of hierarchical system decompositions as follows:

Seq = {SD1(is) → SD2(is) → · · · SDi(is) · · · → SDn(is)|1 ≤ is ≤ |Seq|}.
(33)

The elements SDi(is) of Seq corresponds to the system decomposition, which is aligned according to
the hierarchy with the following algorithmic procedure (based on [15]):

(1) Initialization: Set the initial sets of system decomposition of all sequences in Seq as the whole
system SD1(is) :=< 111 · · · 1 > (1 ≤ is ≤ |Seq|).
(2) Step i → i + 1: If the system decomposition is the total system decomposition (SDi(is) :=<
123 · · · n >), then stop. Otherwise, choose a non-decomposed subsystem SSi(is) of the system
decomposition SDi(is), and further divide it into two independent subsystems SS1
i (is) and
SS2
i (is) different for each is. SDi+1(is) is then defined as a system decomposition of total system
that further separates independently subsystems SS1
i (is) and SS2
i (is), in addition to the previous
decomposition SDi(is).
(3) Go to the next step i + 1 → i + 2.

The value of |Seq| corresponds to the number of different sequences generated by this algorithm. For
example, |Seq| = 3 and |Seq| = 18 holds for n = 3 and n = 4, respectively. The general analytical form
|Seq|n of |Seq| with system size n is obtained as the following recurrence formula:

|Seq|n =

⌊ n

2 ⌋
∑
i=1
nCi|Seq|n−i|Seq|i,
(34)

where ⌊·⌋ is a floor function and with formal definition of |Seq|1 := 1.
The products of KL-divergences according to the hierarchical sequences of system decompositions
in Equation (32) is related to the volume of n − 1-dimensional cuboids in the circle diagram. An
example in case of n = 4 is presented in Figure 5, where two cuboids with 3 orthogonal edges of the
different decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→<
1133 >→< 1134 >→< 1234 > are depicted, whose cuboid volumes are

�

D[< 1111 >:< 1222 >]D[< 1222 >:< 1233 >]D[< 1233 >:< 1234 >],
(35)

and
�

D[< 1111 >:< 1133 >]D[< 1133 >:< 1134 >]D[< 1134 >:< 1234 >],
(36)

respectively.

298


Entropy 2014, 16, 4132–4167

In the same way as Cr, we took in the definition of Cc the arithmetic average of cuboid volumes
so that to renormalize the combinatorial increase of the decomposition paths (|Seq|) according to the
system size n.
Note that on the other hand we did not renormalize the rectangle-bias complexity Cr and the
cuboid-bias complexity Cc by taking the exact geometrical mean of each product of KL-divergences

such as
n−1�

∏n−1
i=1 D[SDi(is) : SDi+1(is)]. This is for further accessibility to theoretical analysis such as
variational method (see “Further Consideration" section), and does not change qualitative behavior
of Cr and Cc since the power root is a monotonically increasing function. This treatment can be
interpreted as taking the (n − 1)-th power of the geometric means for the hierarchical sequences of
KL-divergences.
A more comprehensive example on the utility of the cuboid-bias complexity Cc with respect to
the rectangle-biased one Cr is shown in Figure 6. We consider the 6 nodes networks (n = 6) with the
same I and Cr values but different heterogeneity. The system in the top left figure has a circularly
connected structure with medium intensity, while that of the top right figure has strongly connected 3
subsystems. These systems have qualitatively five different ways of system decomposition that are
the basic generators of all hierarchical sequences Seq = {SD1(is) → · · · → SD5(is)|1 ≤ is ≤ |Seq|} for
these networks. The five basial system decompositions are shown with the number 1⃝, 2⃝, 2⃝′, 3⃝ and
4⃝ in top figures.
The circle diagrams of these systems are depicted in the middle figures. To suppose the same
constant value of Cr in both systems, the following condition is satisfied in the middle right figure:
D[< 111111 >:
2⃝] < D[< 111111 >:
1⃝in Middle Left figure] < D[< 111111 >:
1⃝] < D[<
111111 >: 2⃝in Middle Left figure] < D[< 111111 >: 3⃝] < D[< 111111 >: 4⃝]. Furthermore, the
total surface of right triangles sharing the circle diameter as hypotenuse in the middle left and the
middle right figures are conditioned to be identical, therefore the rectangle-bias complexity Cr fails to
distinguish.
On the other hand, under the same condition, the cuboid-bias complexity Cc distinguishes between
these two systems and gives higher value to the left one. The volume of 5-dimensional cuboids of the

decomposition sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝
−−−−−−−−→< 123456 > are schematically shown in the bottom
figures, maintaining the quantitative difference between KL-divergences. Since the multi-information
I is identical between the two systems, so is the values of KL-divergence D[< 111111 >:< 123456 >],

which is the sum of the KL-divergences along the sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝
−−−−−−−−→< 123456 >
from the Pythagorean theorem. This means that the inequality between the cuboid volumes can be
represented as the isoperimetric inequality of high-dimensional cuboid. As a consequence, the left
system has quantitatively higher value of Cc than the right one. The cuboid-bias complexity Cc is also
sensitive to such heterogeneity.

299


Entropy 2014, 16, 4132–4167

(a)
(b)

(c)
(d)

(e)
(f)

Figure 6. Meaning of taking geometric mean over the sequence of system decomposition in cuboid-bias complexity
Cc. (a): Example of 6-node network with circularly connected structure with medium intensity. Edge
width is proportional to edge information; (b): Example of 6-node network with strongly connected
3 subsystems. Edge width is proportional to edge information. The multi-information I of the two
systems in Top figures are conditioned to be identical; The dotted lines schematically represent possible
system decompositions. (c,d): Circle diagrams of each system decomposition in upper networks; The
total surface of right triangles sharing the circle diameter as hypotenuse in (c) and (d) are conditioned to
be identical, therefore the rectangle-bias complexity Cr fails to distinguish. (e,f): 5-dimensional cuboids
of upper networks (Figure 6a,b) whose edges are the root of KL-divergences for the strain of system

decomposition < 111111 > 1⃝ 2⃝ 2⃝
′ 3⃝ 4⃝
−−−−−−−−→< 123456 >. Only the first 3-dimensional part is shown with
solid line, and the remaining 2-dimensional part is represented with dotted line. The volume of
cuboid in (e) is larger than the one in (f), according to the isoperimetric inequality of high-dimensional
cuboid. The total squared length of each side is identical between two cuboids, which represents
multi-information I = D[< 111111 >:< 123456 >].

8. Regularized Cuboid-Bias Complexity with Respect to Generalized Mutual Information

We further consider the geometrical composition of system decompositions in the circle diagram
and insist the necessity of renormalizing the cuboid-bias complexity Cc with the multi-information I,
which gives another measure of complexity namely “regularized cuboid-bias complexity CR
c .”
We consider the situation in actual data where the multi-information I varies. Figure 7 shows
the n = 3 cases where the Cc fails to distinguish. Both the blue and red systems are supposed to have
the same Cc value by adjusting the red system to have relatively smaller values of KL-divergences

300


Entropy 2014, 16, 4132–4167

D[< 111 >:< 122 >] and D[< 113 >:< 123 >] than the blue one. Such conditioning is possible since
the KL-divergences are independent parameters with each other.

(a)
(b)
(c)

Figure 7.
Examples of the 3-node systems with identical cuboid-bias complexity Cc but different
multi-information I on circle graph. (a): System with smaller I but larger CRc ; (b): System with larger I but
smaller CRc ; (c): Superposition of the above two systems. The regularized cuboid-bias complexity CRc
distinguishes between the blue and red systems.

Although the Cc value is identical, the two systems have different geometrical composition of
system decompositions in the circle diagram. The red system has relatively easier way of decomposition
< 111 >→< 122 > if renormalized with the total system decomposition < 111 >→< 123 >. This
relative decompositionability with respect to the renormalization with the multi-information I can
be clearly understood by superimposing the circle diagram of the two systems and comparing the
angles between each and total decomposition paths (bottom figure). The red system has larger angle
between the decomposition paths < 111 >→< 122 > and < 111 >→< 123 > than any others in the
blue system, which represents the relative facility of the decomposition under renormalization with I.
In this term, the paths < 111 >→< 121 > in the red and blue system do not change its relative facility,
and the paths < 111 >→< 113 > are easier in the blue system.
To express the system decompositionability based on these geometrical compositions in a
comprehensive manner, we define the regularized cuboid-bias complexity CR
c as follows:

CR
c
:=
1

|Seq|

|Seq|
∑
is=1

n−1
∏
i=1

D[SDi(is) : SDi+1(is)]

D[< 11 · · · 1 >:< 12 · · · n >]

:=
Cc

D[< 11 · · · 1 >:< 12 · · · n >]n−1

:=
Cc
In−1 .
(37)

The red system then has quantitatively smaller CR
c value than the blue system in Figure 7.

9. Modular Complexity with Respect to the Easiest System Decomposition Path

We have considered so far the system decompositionability with respect to the all possible
decomposition sequences.
This was also a way to avoid the local fluctuation of the network
heterogeneity to be reflected in some specific decomposition paths. On the other hand, the easiest
decomposition is particularly important when considering the modularity of the system. If there exists
hierarchical structure of modularity in different scales with different coherence of the system, the
KL-divergence and the sequence of the easiest decomposition gives much information.

301


Entropy 2014, 16, 4132–4167

Figure 8 schematically shows a typical example where there exist two levels of modularity. Such
structure with different scales of statistical coherence appears as functional segregation in neural
systems [17], and is expected to be observed widely in complex systems.
The hierarchical topology of the easiest decomposition path reflects these structures.
For
example, in the system of Figure 8, the decompositions between <
1 1 · · · 1
> and <
1 1 1 1 5 5 5 5 9 9 9 9 13 13 13 13 > are easier than those inside of the 4-node subsystems. The values of
KL-divergence also reflect the hierarchy, giving relatively low values for the decomposition between
the 4-node subsystems, and high values inside of them. By examining the shortest decomposition
path and associated KL-divergences in possible Seq, one can project the hierarchical structure of the
modularity existing in the system.

Figure 8. Example of 16-node system < 11 · · · 1 > that has different levels of modularity. The four 4-node
subsystems < 1111 > (blue blocks) are loosely connected and easy to be decomposed, while inside each
component (red blocks) is tightly connected. The degree of connection represents statistical dependency
or edge information between subsystems. Such hierarchical structure can be detected by observing the
decomposition path of the modular complexity Cm.

For this reason, we define the modular complexity Cm as follows, which is the shortest path
component of the cuboid-bias complexity Cc:

Cm :=
n−1
∏
i=1
D[SDi(imin) : SDi+1(imin)],
(38)

where the index imin of the sequence SD1(imin) → SD2(imin) → · · · → SDn(imin) is chosen as follows:

imin
=
{i1} ∩ {i2} ∩ · · · ∩ {in−1},
(39)

where

{i1}
=
argmin
is
{D[SD1(is) : SD2(is)]|1 ≤ is ≤ |Seq|},

{i2}
=
argmin
i1
{D[SD2(i1) : SD3(i1)]|i1 ∈ {i1}},

...

{in−1}
=
argmin
in−2
{D[SDn−1(in−2) : SDn(in−2)]|in−1 ∈ {in−1}},
(40)

302


Entropy 2014, 16, 4132–4167

which gives eventually

imin
=
in−1.
(41)

This means that beginning from the undecomposed state < 11 · · · 1 >, we continue to choose
the shortest decomposition path in the next hierarchy of system decomposition. The minimization of
the path length is guaranteed by the sequential minimization since the geometric mean of isometric
path division is bounded below by its minimum component. imin is unique if the system is completely
heterogenous (i.e., D[SD1(ik) : SD2(ik)] ̸= D[SD1(il) : SD2(il)], 1 ≤ ik < il ≤ |Seq|), otherwise plural
decomposition paths that give the same Cm value are possible according to the homogeneity of the
system. Besides its value, the modular complexity Cm should be utilized with the sequence information
of the shortest decomposition path to evaluate the modularity structure of a system.
The cases where Cm are identical but Cc are different can be composed by varying the system
decompositions other than in the shortest path SD1(imin) → SD2(imin) → · · · → SDn(imin) without
modifying the index imin. There exist also inverse examples with identical Cc and different Cm, due to
the complementarity between Cm and Cc.
We finally define the regularized modular complexity CR
m as follows, for the same reason as defining
CR
c from Cc;

CR
m
:=
n−1
∏
i=1

D[SDi(imin) : SDi+1(imin)]
D[< 11 · · · 1 >:< 12 · · · n >]

:=
Cm

D[< 11 · · · 1 >:< 12 · · · n >]n−1

:=
Cm
In−1 .
(42)

Proposition 4. The cuboid-bias complexities Cc and CR
c are bounded by the modular complexities Cm and CR
m
respectively:

Cc ≤ Cm,
(43)

CR
c ≤ CR
m.
(44)

And they coincide at the maximum values under the given multi-information I:

max{Cm|I = const.}
=
max{Cc|I = const.},
(45)

max{CR
m}
=
max{CR
c }.
(46)

These relations (43)–(46) are numerically shown in the “Numerical Comparison” section.
The superiority of the modular complexities is due to the hierarchical dependency of
KL-divergence value in decomposition paths. In the shortest decomposition path defining modular
complexities, the easier system decomposition relatively increase its value since they incorporate more
number of edge cutting. Since we eventually cut all edges to obtain < 12 · · · n > at the end of the
decomposition sequence, collecting the edges with relatively weak edge information and cutting them
together augment the value of the product of KL-divergences. The modular complexities are then
the maximum value components among the possible decomposition paths calculated in cuboid-bias
complexities:

Cm
=
max

�
n−1
∏
i=1
D[SDi(is) : SDi+1(is)]

����� 1 ≤ is ≤ |Seq|

�

,
(47)

CR
m
=
max

�
n−1
∏
i=1

D[SDi(is) : SDi+1(is)]

D[< 11 · · · 1 >:< 12 · · · n >]n−1

����� 1 ≤ is ≤ |Seq|

�

.
(48)

303


Entropy 2014, 16, 4132–4167

The difference between the cuboid-bias complexities and the modular complexities is an index of
the geometrical variation of decomposed systems in the circle graph, which reflects the fluctuation of
the sequence-wise system decompositionability. If the variation of the system decompositionability for
each system decomposition is large, accordingly the modular complexities tend to give higher values
than the cuboid-bias complexities.

10. Numerical Comparison

We numerically investigate the complementarity between the proposed complexities, Cc, CR
c , Cm,
and CR
m. Since the minimum node number giving non-trivial meaning to these measures is n = 4, the
corresponding dimension of parameter space is ∑n
k=1 nCk = 15. The constant-complexity submanifolds
are therefore difficult to visualize due to the high dimensionality. For simplicity, we focus on the
2-dimensional subspace of this parameter space whose first axis ranging from random to maximum
dependencies of the system, and the second one representing the system decompositionability of
< 1133 >.
For this purpose, we introduce the following parameters α and β (0 ≤ α, β ≤ 1) in the j-coordinates
of the discrete distribution with 4-dimensional binary stochastic variable:

η1
=
η0,

η2
=
η0,

η3
=
η0,

η4
=
η0,

η1,2
=
η1η2 + α(η0 − ϵ − η1η2),
(49)

η3,4
=
η3η4 + α(η0 − ϵ − η3η4),

η1,3
=
η1η3 + αβ(η0 − ϵ − η1η3),

η1,4
=
η1η4 + αβ(η0 − ϵ − η1η4),

η2,3
=
η2η3 + αβ(η0 − ϵ − η2η3),

η2,4
=
η2η4 + αβ(η0 − ϵ − η2η4),

η1,2,3
=
η1,2η3 + αβ(η0 − 2ϵ − η1,2η3),

η1,2,4
=
η1,2η4 + αβ(η0 − 2ϵ − η1,2η4),

η1,3,4
=
η1η3,4 + αβ(η0 − 2ϵ − η1η3,4),

η2,3,4
=
η2η3,4 + αβ(η0 − 2ϵ − η2η3,4),

η1,2,3,4
=
η1,2η3,4 + αβ(η0 − 3ϵ − η1,2η3,4).

Where α represents the degree of statistical association from random (α = 0) to maximum (α = 1),
and β control the system decompositionability of < 1133 >. If β = 1, the system has the maximum
KL-divergence D[< 1111 >:< 1133 >] under the constraint of α parameter, and β = 0 gives D[<
1111 >:< 1133 >] = 0.
ϵ is the minimum value of the joint distribution of 4-dimensional variable, which is defined to be
more than 0 to avoid singularity in the dual-flat coordinates of statistical manifold. ϵ = 1.0 × 10−10

and η0 = 0.5 was chosen for the calculation.

304


Entropy 2014, 16, 4132–4167

The system with maximum statistical association under given η0 corresponds to the α = β = 1
condition in given parameters, whose j-coordinates become as follows:

η1
=
η0,
...

η4
=
η0,

η1,2
=
η0 − ϵ,
...

η3,4
=
η0 − ϵ,
(50)

η1,2,3
=
η0 − 2ϵ,
...

η2,3,4
=
η0 − 2ϵ,

η1,2,3,4
=
η0 − 3ϵ, .

On the other hand, the totally decomposed system corresponds to the α = 0 condition, and the
j-coordinates are:

η1
=
η0,
...

η4
=
η0,

η1,2
=
η0η0,
...

η3,4
=
η0η0,
(51)

η1,2,3
=
η0η0η0,
...

η2,3,4
=
η0η0η0,

η1,2,3,4
=
η0η0η0η0.

Note that the completely deterministic case η0 = 1.0 and α = β = 1 gives I = 0.
The intuitive meaning of these parameters α and β are also schematically depicted in Figure 9

bottom right.

305


Entropy 2014, 16, 4132–4167

(a)
(b)

(c)
(d)

Figure 9. Contour plot of the complexity landscape of I, Cc, Cm, CRc , and CRm on α-β plane. (a): Contour plot
superposition of Cc and Cm. (b): Contour plot superposition of CRc and CRm. (c): Contour plot of I.
The color of contour plots corresponds to the color gradient of 3D plots in Figure 10; (d): Schematic
representation of the system in different regions of α-β plane. Edge width represents the degree of edge
information, and independence is depicted with dotted line.

Figure 10 shows the landscape of the proposed complexities on the α-β plane. Their contour plots
are depicted in Figure 9. The proposed complexities each differs from others in almost everywhere
points on α-β plane except at the intersection lines. Therefore, these measures serve as the independent
features of the system, each has its specific meaning with respect to the system decompositionability.
The α-β plane shows a section of the actual structure of the complementarity expressed in Figure 3
between the proposed complexity measures.
The
relations
between
the
cuboid-bias
complexities
and
modular
complexities
in
Equations (43)–(46) are also numerically confirmed.
The modular complexities are superior
than the corresponding cuboid-bias complexities, and coincide at the parameter α = β = 1 giving
maximum values and dependencies in this parameterization.

306


Entropy 2014, 16, 4132–4167

(a)

(b)
(c)

(d)
(e)

Figure 10. Landscape of complexities I, Cc, Cm, CRc , and CRm on α-β plane. (a): Multi-information I; (b):
Cuboid-bias complexity Cc. (c): Modular complexity Cm;(d): Regularized cuboid-bias complexity
CRc ; (e): Regularized modular complexity CRm. All complexity measures show the complementarity
intersecting with each other, satisfying the boundary conditions vanishing at α = 0 and β = 0 except the
multi-information I. Note that regularized complexities CRc and CRm show singularity of convergence at
α → 0 due to the regularization of infinitesimal value.

In general case without the parameterization with α, β and η0, the boundary conditions of Cc, CR
c ,
Cm and CR
m include that of the multi-information I, which vanish at the completely random or ordered
state. This is common to other complexity measures such as the LMC complexity, and fit to the basic

307


Entropy 2014, 16, 4132–4167

intuition on the concept of complexity situated equivalently far from the completely predictable and
disordered states [21,22].
The proposed complexities further incorporate boundary conditions that vanish with the existence
of a completely independent subsystem of any size. This means that the Cc, CR
c , Cm and CR
m of a system
become 0 if we add another independent variable. This property does not reflect the intuition of
complexity defined by the arithmetic average of statistical measures. The proposed complexity can
better find its meaning in comparison to other complexity measures such as the multi-information
I, and by interactively changing the system scale to avoid trivial results with small independent
subsystem. For example, the proposed complexities could be utilized as the information criteria
for the model selection problems, especially with an approximative modular structure based on the
statistical independency of data between subsystems. We insist that the complementarity principle
between plural complexity measures of different foundation is the key to understand the complexity
in a comprehensive manner.
To characterize the property of Cc, CR
c , Cm and CR
m in relation to the diverse composition of each
system decomposition, it is useful to consider the geometry of their contour structure, as compared
in Figure 9. The contour can be formalized as Cc, CR
c , Cm, CR
m = const. for each complexity measure,
and D[< 11 · · · 1 >: SDi(is)] = const. (1 ≤ i ≤ n − 1, 1 ≤ is ≤ |Seq|) for each system decomposition.
For that purpose, analysis with algebraic geometry can be considered as a prominent tool. Algebraic
geometry investigates the geometrical property of polynomial equations [23]. The complexities Cc, CR
c ,
Cm and CR
m can be interpreted as polynomial functions by taking each system decomposition as novel
coordinates, therefore directly accessible to algebraic geometry. However, if we want to investigate the
contour of the complexities on the p parameter space, logarithmic function appears as the definition of
KL-divergence, which is a transcendental function and outreach the analytical requirement of algebraic
geometry. To introduce compatibility between the p parameter space of information geometry and
algebraic geometry, it suffices to describe the model by replacing the logarithmic functions as another n
variables such as q = log p, and reconsider the intersection between the result from algebraic geometry
on the coordinates (p, q) and q = log p condition. The contour of Cc, CR
c , Cm and CR
m is also important
to seek for the utility of these measures as a potential to interpret the dynamics of statistical association
as geodesics.

11. Further Consideration

11.1. Pythagorean Relations in System Decomposition and Edge Cutting

We further look back at the system decomposition and edge cutting in terms of the Pythagorean
relation between KL-divergences, which is based on the orthogonality between ` and j coordinates.
In system decomposition, the distribution of decomposed system is analytically obtained from
the product of subsystems’ η coordinates, which is equivalent to set all θdec parameters as 0 in mixture
coordinate ξdec. From the consistency of θdec parameters in ξdec being 0 in all system decompositions,
we have the Pythagorean relation according to the inclusion relation of system decomposition. For
example, the following holds:

D[< 1111 >:< 1234 >]
=
D[< 1111 >:< 1222 >]

+
D[< 1222 >:< 1233 >]
(52)

+
D[< 1233 >:< 1234 >].

The proof is in the same way as k-cut coordinates isolating k-tuple statistical association between
variables [14].
On the other hand, the edge cutting previously defined using the product of remaining maximum
cliques’ η coordinates does not coincides with the θec = 0 condition in mixture coordinates ξec. We
have defined the ηec values of edge cutting based only on the orthogonal relation between η and θ

308


Entropy 2014, 16, 4132–4167

coordinates, by generalizing the rule of system decomposition in ηec coordinates, and did not consider
the Pythagorean relation between different edge cuttings.
It is then possible to define another way of edge cutting using θec = 0 condition in ξec. Indeed,
in k-cut mixture coordinates, θk+ = 0 condition is derived from the independent condition of the
variables in all orders, and k-tuple statistical association is measured by reestablishing the η parameters
for the statistical association up to k − 1-tuple order. In the same way, we can set θdec = 0 condition for
ξdec of a system decomposition, and reestablish edges with respect to the η parameters, except the one
in focus for edge cutting.
As a simple example, consider the system decomposition < 1222 > and edge cutting 1 − 2 in
4-node graph. We have the mixture coordinate ξdec for the system decomposition as follows:

ξdec
1,2
=
θdec
1,2 = 0,

ξdec
1,3
=
θdec
1,3 = 0,

ξdec
1,4
=
θdec
1,4 = 0,

ξdec
1,2,3
=
θdec
1,2,3 = 0,
(53)

ξdec
1,3,4
=
θdec
1,3,4 = 0,

ξdec
1,2,3,4
=
θdec
1,2,3,4 = 0,

where all the rest of ξdec coordinates is equivalent to that of η coordinates.
We then consider the new way of edge cutting 1 − 2 by recovering the statistical association in
edges 1 − 3 and 1 − 4 from system decomposition < 1222 >, orthogonally to that of edge 1 − 2. The
new mixture coordinate ξEC changes to the following:

ξEC
1,2 = θEC
1,2 = 0,

ξEC
1,3 = η1,3,

ξEC
1,4 = η1,4,

ξEC
1,2,3 = θEC
1,2,3 = 0,
(54)

ξEC
1,3,4 = η1,3,4,

ξEC
1,2,3,4 = θEC
1,2,3,4 = 0,

and the rest is equivalent to that of η coordinates.
This new ξEC is also compatible with k-cut coordinates formalization for its simple θEC = 0
conditions. To obtain ξEC for arbitrary edge cutting i − j, one should take θEC containing i and j in
its subscript, set them to 0, and combine with η coordinates for the rest of the subscript. For plural
edge cuttings i − j, · · · , k − l (1 ≤ i, j, k, l ≤ n), it suffices to take θEC containing i and j, ... , k and l in
its subscript respectively, then set them to 0.
We finally obtain the Pythagorean relation between edge cuttings. Denoting the general edge
cutting(s) coordinates as ξi−j,··· ,k−l, the following holds for the example of system decomposition
< 1222 >:

D[< 1111 >:< 1222 >]
=
D[< 1111 >: p(ξ1−2)]

+
D[p(ξ1−2) : p(ξ1−2,1−3)]
(55)

+
D[p(ξ1−2,1−3) : p(ξ1−2,1−3,1−4)].

Despite the consistency with the dual structure between θ and η, we do not generally have
analytical solution to determine ηEC values from θEC = 0 conditions. We should call for some
numerical algorithm to solve θEC = 0 conditions with respect to ηEC values, which are in general
high-degree simultaneous polynomials. Furthermore, numerical convergence of the solution has to be

309


Entropy 2014, 16, 4132–4167

very strict, since tiny deviation from the conditions can become non-negligible by passing fractional
function and logarithmic function of θ coordinates.
On the other hand, the previously defined edge cutting with ξec using the product between
subgraphs’ η coordinates is analytically simple and does not need to consider the other edges’ recovery
from system decomposition or independence hypothesis. We then chose the previous way of edge
cutting for both calculability and clarity of the concept.
There have been many attempts to approximate complex network by low-dimensional system
with the use of statistical physics and network theory. As a contemporary example, moment-closure
approximation provides a various way to abstract essential dynamics e.g., in discrete adaptive
network [24]. Although the approximation takes several theoretical assumptions such as random graph
approximation, it is difficult to quantitatively reproduce the dynamics even in some simplest model.
This is partly due to homogeneous treatment of statistics such as truncation into pair-wise order. The
edge cutting can offer a complementary view on the evaluation of moment-closure approximations.
Using orthogonal decomposition between edge information, one can evaluate which part of network
link and which order of statistics contain essential information, which does not necessary conform to
top-down theoretical treatment.

11.2. Complexity of the Systems with Continuous Phase Space

We have developed the concept of system decompositionability based on discrete binary variables.
One can also apply the same principle to continuous variable.
For an ergodic map G : X → X in continuous space X, KS entropy h(μ, G) is defined as the
maximum of entropy rate with respect to all possible system decomposition A, when the invariant
measure μ exists:

h(μ, G) = sup
A
h(μ, G, A).
(56)

where A is the disjoint decomposition of X that consists of non-trivial sets ai, whose total number is
n(A), defined as

X =

n(A)
�

i=1
ai,
(57)

ai ∩ aj = φ, i ̸= j, 1 ≤ i, j ≤ n(A),
(58)

meaning the natural expansion of system decomposition into continuous space.
The entropy rate h(μ, G, A) in Equation (56) is defined as

h(μ, G, A) = lim
n→∞
1
n H(μ, A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)),
(59)

according to the entropy H(μ, A) based on the decomposition A = {ai}

H(μ, A) = −

n(A)
∑
i=1
μ(ai) ln μ(ai),
(60)

and the product C = A ∨ B as

C
=
A ∨ B

=
{ci = aj ∩ bk|1 ≤ j ≤ n(A), 1 ≤ k ≤ n(B)}.
(61)

310


Entropy 2014, 16, 4132–4167

In a more general case, topological entropy hT(G) is defined simply with the number of
decomposed subsystem elements by preimages as follows, without requiring ergodicity, therefore
neither the existence of invariant measure μ:

hT(G) = sup
A
lim
n→∞
1
n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)).
(62)

Topological entropy takes the maximum value of the possible preimage divisions, in order to
measure the complexity in terms of the mixing degree of the orbits. For example, if the KS entropy
is positive as h(μ, G) > 0, the dynamics of G on an invariant set of invariant measure μ is chaotic for
almost everywhere initial conditions. As for the positive topological entropy hT(G) > 0, the dynamics
of G contain chaotic orbits, but not necessary as attractive chaotic invariant set, since hT(G) ≥ h(μ, G)
and the KS entropy can be negative.
Although these definitions are useful to characterize the existence of chaotic dynamics, the system
decompositionability is another property representing different aspect of the system complexity. It
is rather the matter of the existence of independent dynamics components, or the degree of orbit
localization between arbitrary system decompositions. We propose the following “geometric topological
entropy” hg(G) applying the same principle of taking geometric product between all hierarchical
structure of the system decomposition A.

hg(G) := ∏
σ(A)>0
lim
n→∞
1
n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)),
(63)

where σ(A) > 0 means to take all components of A having positive Lebesgue measure on X.
This gives 0 if the preimage of certain ai ∈ A is ai itself, meaning there exist a subsystem ai whose
range is invariant under G, closed by itself. The system X can be completely divided into ai and the
rest. This corresponds to the existence of an independent subsystem in cuboid-bias and modular
complexities. In case such independent components do not exist, it still reflects the degree of orbit
localization for all possible system decompositions in multiplicative manner. The condition σ(A) > 0
is to avoid trivial case such as the existence of unstable limit cycle, whose Lebesgue measure is 0.
Typical example giving hg(G) = 0 is the function having independent ergodic components, such
as the Chirikov-Taylor map with appropriate parameter [25].

12. Conclusions and Discussion

We have theoretically developed a framework to measure the degree of statistical association
existing between subsystems as well as the ones represented by each edge of the graph representation.
We then reconsidered the problem of how to define complexity measures in terms of the construction
of non-linear feature space. We defined new type of complexity based on the geometrical product of
KL-divergence representing the degree of system decompositionability. Different complexity measures
as well as newly proposed ones are compared on a complementarity basis on statistical manifold.
Application of presented theory can encompass a large field of complex systems and data science,
such as social network, genetic expression network, neural activities, ecological database, and any
kind of complex networks with binary co-occurrence matrix data e.g., [26–29], databases: [30–34].
Continuous variables are also accessible by appropriate discretization of information source with e.g.,
entropy maximization principle.
In contrast to arithmetic mean of information over the whole system, geometric mean has not been
investigated sufficiently in the analysis of complex network. However in different fields, theoretical
ecology has already pointed out the importance of geometric mean when considering the long-term
fitness of a species population in a randomly varying environment [35,36]. Long-term fitness refers
to the ecological complexity of its survival strategy under large stochastic fluctuation. Here, we can
find useful analogy between the growth rate of a population in ecology and the spatio-temporal

311


Entropy 2014, 16, 4132–4167

propagation rate of information between subsystems in general. If we take an arbitrary subsystem
and consider the amount of information it can exchange with all other subsystems, the proposed
complexity measures with geometric mean reflect the minimum amount with amongst all possible
other subsystems, which can not be distinguished with arithmetic mean. The propagation rate of a
population in ecology and the information transmission in complex network hold mathematically
analogous structure. In population ecology, the variance of growth rate is crucial to evaluate the
long-term survival of the population. Even if the arithmetic mean of growth rate is high, large variance
will lead to low geometric mean even with a small amount of exceptionally small fitness situation,
which ecologically means extinction of an entire species. In stochastic network, the variance of system
decompositionability is essential to evaluate the amount of information shared between subsystems, or
information persistence in the entire network. Even the multi-information I is high, large heterogeneity
of edge information can lead to informational isolation of certain subsystem, which means extinction
of its information. If such subsystem is situated on the transmission pathway, information cannot
propagate across these nodes. Therefore, the proposed complexity measures CC, CR
C, Cm and CR
m
generally reflect the minimum amount of information propagation rate spread entirely on the system
without exception of isolated division.
Some recent studies on adaptive network focus on the evolution of network topology in response
to node activity, such as game-theoretic evolution of strategies [37], opinion dynamics on an evolving
network [38], epidemic spreading on an adaptive network [39], etc. Analysis of coevolution network
between variables and interactions can capture important dynamical feature of complex systems. In
contrast to topological network analysis, the newly proposed complexity measures can complement
its statistical dynamics analysis. In addition to the topological change of network model, (e.g., linking
dynamics of game theory, opinion community network structure, contact network of epidemics
transmission), one can evaluate the emerged statistical association between the variables that does
not necessary coincide with the network topology. Interesting feature of non-linear dynamics is the
unexpected correlation between distant variables, which is quantified as Tsallis entropy [40]. The
complementary relation between concrete interaction and resulting statistical association can provide a
twofold methodology to characterize the coevolutionary dynamics of adaptive network. Such strategy
can promote integrated science from laboratory experiments to open-field in natura situation, where
actual multi-scale problematics remain to be solved [41].
Arithmetic and geometric means can be integrated in a mutual formula called generalized
mean [42]. Therefore, the proposed complexity measures with geometric mean of KL-divergence is
an expansion of preexisting complexity measures with mixture coordinates. Table 1 summarizes the
generalization of complexity measure in this article. Based on the k-cut coordinates ı, the weighted
sum of KL-divergence representing k-tuple order of statistical association derived complexity measures
with (weighted) arithmetic mean such as multi-information I and TSE complexity. On the other hand,
we showed that subsystem-wise correlation can also be isolated with the use of mixture coordinates,
namely < · · · >-cut coordinates ¸. To quantify the heterogeneity of system decompositionability, we
generally took a weighted geometric mean of KL-divergence in CC, CR
C, Cm and CR
m. Here, the shortest
path selection of Cm and CR
m, and regularization of CR
C and CR
m with respect to multi-information I
can be interpreted as the weight function of geometric mean. This perspective brings a definition
of a generalized class of complexity measures based on the mixture coordinates and generalized
mean of KL-divergence. Information discrepancy can also be generalized from KL-divergence to
Bregman divergence, providing access to the concept of multiple centroids in large stochastic data
analysis such as image processing [43]. The blank columns of the Table 1 imply the possibility of
other complexity measures in this class. For example, the weighted geometric mean of KL-divergence
defined between k-cut coordinates is expected to yield complexity measures that are sensitive to
the heterogeneity of correlation orders. The weighted arithmetic mean of KL-divergence defined
between < · · · >-cut coordinates should be sensitive to the mean decompositionability of arbitrary
subsystem. Since these measures take analytically different form on mixture coordinates and/or mean

312


Entropy 2014, 16, 4132–4167

functions, their derivatives do not coincide, which give independent information of the system on
the complementary basis on statistical manifold, as long as the number of complexity measures are
inferior to the freedom degree of the system.

Table 1. Classification of complexity measures with KL-divergence on mixture coordinates.

Generalized Mean of KL-Divergence

Arithmetic Mean
Geometric Mean

Mixture Coordinates
k-cut ı
TSE complexity, I

< · · · >-cut ¸
CC, CR
C, Cm, CRm

Acknowledgments: This study was partially supported by CNRS, the long term study abroad support program
of the university of Tokyo, and the French government (Promotion Simone de Beauvoir).

Conflicts of Interest: Conflicts of Interest
The author declares no conflict of interest.

References

1.
Boccalettia, S.; Latorab, V.; Morenod, Y.; Chavezf, M.; Hwang, D.U. Complex Networks: Structure and
Dynamics. Phys. Rep. 2006, 424, 175–308.
2.
Strogatz, S.H. Exploring Complex Networks. Nature 2001, 410, 268–276.
3.
Wasserman, S.; Faust, K. Social Network Analysis; Cambridge University Press: Cambridge, UK, 1994.
4.
Funabashi, M.; Cointet, J.P.; Chavalarias, D. Complex Network. In Studies in Computational Intelligence;
Springer: Berlin/Heidelberg, Germany, 2009; Volume 207, pp. 161–172.
5.
Badii, R.; Politi, A. Complexity: Hierarchical Structures and Scaling in Physics; Cambridge University Press:
Cambridge, UK, 2008.
6.
Lempel, A.; Ziv, J. On the Complexity of Finite Sequences. IEEE Trans. Inf. Theory 1976, 22, 75–81.
7.
Li, M.; Vitanyi, P. Texts in Computer Science. In An Introduction to Kolmogorov Complexity and Its Applications,
2nd ed.; Springer: Berlin/Heidelberg, Germany, 1997.
8.
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 2006.
9.
Bennett, C. On the Nature and Origin of Complexity in Discrete, Homogeneous, Locally-Interacting Systems.
Found. Phys. 1986, 16, 585–592.
10.
Grassberger, P. Toward a Quantitative Theory of Self-Generated Complexity. Int. J. Theor. Phys. 1986,
25, 907–938.
11.
Crutchfield, J.P.; Feldman, D.P. Regularities Unseen, Randomness Observed: The Entropy Convergence
Hierarchy. Chaos 2003, 15, 25–54.
12.
Crutchfield, J.P. Inferring Statistical Complexity. Phys. Rev. Lett. 1989, 63, 105–108.
13.
Prichard, D.; Theiler, J. Generalized Redundancies for Time Series Analysis. Physica D 1995, 84, 476–493.
14.
Amari, S. Information Geometry on Hierarchy of Probability Distributions. IEEE Trans. Inf. Theory 2001,
47, 1701–1711.
15.
Ay, N.; Olbrich, E.; Bertschinger, N.; Jost, J. A Unifying Framework for Complexity Measures of Finite Systems;
Report 06-08-028; Santa Fe Institute: Santa Fe, NM, USA, 2006.
16.
MacKay, R.S. Nonlinearity in Complexity Science. Nonlinearity 2008, 21, T273–T281.
17.
Tononi, G.; Sporns, O.; Edelman, M. A Measure for Brain Complexity: Relating Functional Segregation and
Integration in the Nervous System. Proc. Natl. Acad. Sci. USA 1994, 91, 5033.
18.
Feldman, D.P.; Crutchfield, J.P. Measures of statistical complexity: Why? Phys. Lett. A 1998, 238, 244–252.
19.
Nakahara, H.; Amari, S. Information-Geometric Measure for Neural Spikes. Neural Comput. 2002, 14, 2269–
2316.
20.
Olbrich, E.; Bertschinger, N.; Ay, N.; Jost, J. How Should Complexity Scale with System Size? Eur. Phys. J. B
2008, 63, 407–415.
21.
Feldman, D.P.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252.
22.
Lopez-Ruiz, R.; Mancini, H.; Calbet, X. A Statistical Measure of Complexity. Phys. Lett. A 1995, 209, 321–326.

313


Entropy 2014, 16, 4132–4167

23.
Hodge, W.; Pedoe, D. Methods of Algebraic Geometry; Cambridge Mathematical Library, Cambridge University
Press: Cambridge, UK, 1994; Volume 1–3.
24.
Demirel, G.; Vazquez, F.; Bohme, G.; Gross, T. Moment-closure Approximations for Discrete Adaptive
Networks. Physica D 2014, 267, 68–80.
25.
Fraser, G., Ed. The New Physics for the Twenty-First Century; Cambridge University Press: Cambridge, UK,
2006; p. 335.
26.
Scott, J. Social Network Analysis: A Handbook; SAGE Publications Ltd.: London, UK, 2000.
27.
Geier, F.; Timmer, J.; Fleck, C. Reconstructing Gene-Regulatory Networks from Time Series, Knock-Out Data,
and Prior Knowledge. BMC Syst. Biol. 2007, 1, doi:10.1186/1752-0509-1-11.
28.
Brown, E.N.; Kass, R.E.; Mitra, P.P. Multiple Neural Spike Train Data Analysis: State-of-the-Art and Future
Challenges. Nat. Neurosci. 2004, 7, 456–461.
29.
Yee, T.W. The Analysis of Binary Data in Quantitative Plant Ecology.
Ph.D. Thesis, The University of
Auckland, New Zealand, 1993.
30.
Stanford Large Network Dataset Collection. Available online: http://snap.stanford.edu/data/ (accessed on
19 July 2014).
31.
BioGRID. Available online: http://thebiogrid.org/ (accessed on 19 July 2014).
32.
Neuroscience Information Framework.
Available online:
http://www.neuinfo.org/ (accessed on
19 July 2014).
33.
Global Biodiversity Information Facility. Available online: http://www.gbif.org/ (accessed on 19 July 2014).
34.
UCI Network Data Repository. Available online: http://networkdata.ics.uci.edu/index.php (accessed on 19
July 2014).
35.
Lewontin, R.C.; Cohen, D. On Population Growth in a Randomly Varying Environment. Proc. Natl. Acad.
Sci. USA 1969, 62, 1056–1060.
36.
Yoshimura, J.; Clark, C.W. Individual Adaptations in Stochastic Environments. Evol. Ecol. 1969, 5, 173–192.
37.
Wu, B.; Zhou, D.; Wang, L. Evolutionary Dynamics on Stochastic Evolving Networks for Multiple-Strategy
Games. Phys. Rev. E 2011, 84, 046111.
38.
Fu, F.; Wang, L. Coevolutionary Dynamics of Opinions and Networks: From Diversity to Uniformity.
Phys. Rev. E 2008, 78, 016104.
39.
Gross, T.; D’Lima, C.J.D.; Blasius, B. Epidemic Dynamics on an Adaptive Network.
Phys. Rev. Lett.
2006, 96, 208701.
40.
Tsallis, C. Possible Generalization of Boltzmann-Gibbs Statistics. J. Stat. Phys. 1988, 52, 479–487.
41.
Quintana-Murci, L.; Alcais, A.; Abel, L.; Casanova, J.L. Immunology in natura: Clinical, Epidemiological
and Evolutionary Genetics of Infectious Diseases. Nat. Immunol. 2007, 8, 1165–1171.
42.
Hardy, G.; Littlewood, J.; Polya, G. Inequalities; Cambridge University Press: Cambridge, UK, 1967; Chapter 3.
43.
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904.

c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

314


entropy

Article
The Entropy-Based Quantum Metric

Roger Balian

Institut de Physique Théorique, CEA/Saclay, F-91191 Gif-sur-Yvette Cedex, France;
E-Mail: roger@balian.fr

Received: 15 May 2014; in revised form: 25 June 2014 / Accepted: 11 July 2014 /
Published: 15 July 2014

Abstract: The von Neumann entropy S( ˆD) generates in the space of quantum density matrices
ˆD the Riemannian metric ds2 = −d2S( ˆD), which is physically founded and which characterises
the amount of quantum information lost by mixing ˆD and ˆD + d ˆD. A rich geometric structure is
thereby implemented in quantum mechanics. It includes a canonical mapping between the spaces
of states and of observables, which involves the Legendre transform of S( ˆD). The Kubo scalar
product is recovered within the space of observables. Applications are given to equilibrium and non
equilibrium quantum statistical mechanics. There the formalism is specialised to the relevant space of
observables and to the associated reduced states issued from the maximum entropy criterion, which
result from the exact states through an orthogonal projection. Von Neumann’s entropy specialises
into a relevant entropy. Comparison is made with other metrics. The Riemannian properties of the
metric ds2 = −d2S( ˆD) are derived. The curvature arises from the non-Abelian nature of quantum
mechanics; its general expression and its explicit form for q-bits are given, as well as geodesics.

Keywords: quantum entropy; metric; q-bit; information; geometry; geodesics; relevant entropy

1. A Physical Metric for Quantum States

Quantum physical quantities pertaining to a given system, termed as “observables” ˆO, behave
as non-commutative random variables and are elements of a C*-algebra. We will consider below
systems for which these observables can be represented by n-dimensional Hermitean matrices in a
finite-dimensional Hilbert space H. In quantum (statistical) mechanics, the “state” of such a system
encompasses the expectation values of all its observables [1]. It is represented by a density matrix ˆD,
which plays the rôle of a probability distribution, and from which one can derive the expectation value
of ˆO in the form
< ˆO >= Tr ˆD ˆO = ( ˆD; ˆO) .
(1)

Density matrices should be Hermitean (< ˆO > is real for ˆO = ˆO†), normalised (the expectation
value of the unit observable is Tr ˆD = 1) and non-negative (variances <
ˆO2 > − <
ˆO >2 are
non-negative). They depend on n2 − 1 real parameters. If we keep aside the multiplicative structure of
the set of operators and focus on their linear vector space structure, Equation (1) appears as a linear
mapping of the space of observables onto real numbers. We can therefore regard the observables and
the density operators ˆD as elements of two dual vector spaces, and expectation values (1) appear as
scalar products.
It is of interest to define a metric in the space of states. For instance, the distance between an
exact state ˆD and an approximation ˆDapp would then characterise the quality of this approximation.
However, all physical quantities come out in the form (1) which lies astride the two dual spaces of
observables and states. In order to build a metric having physical relevance, we need to rely on another
meaningful quantity which pertains only to the space of states.

Entropy 2014, 16, 3878–3888; doi:10.3390/e16073878
www.mdpi.com/journal/entropy
315


Entropy 2014, 16, 3878–3888

We note at this point that quantum states are probabilistic objects that gather information about
the considered system. Then, the amount of missing information is measured by von Neumann’s
entropy
S( ˆD) ≡ − Tr ˆD ln ˆD .
(2)

Introduced in the context of quantum measurements, this quantity is identified with the
thermodynamic entropy when ˆD is an equilibrium state. In non-equilibrium statistical mechanics,
it encompasses, in the form of “relevant entropy” (see Section 5 below), various entropies defined
through the maximum entropy criterion. It is also introduced in quantum computation. Alternative
entropies have been introduced in the literature, but they do not present all the distinctive and natural
features of von Neumann’s entropy, such as additivity and concavity.
As S( ˆD) is a concave function, and as it is the sole physically meaningful quantity apart from
expectation values, it is natural to rely on it for our purpose. We thus define [2] the distance ds between
two neighbouring density matrices ˆD and ˆD + d ˆD as the square root of

ds2 = −d2S( ˆD) = Tr d ˆDd ln ˆD .
(3)

This Riemannian metric is of the Hessian form since the metric tensor is generated by taking second
derivatives of the function S( ˆD) with respect to the n2 − 1 coordinates of ˆD. We may take for such
coordinates the real and imaginary parts of the matrix elements, or equivalently (Section 6) some linear
transform of these (keeping aside the norm Tr ˆD = 1).

2. Interpretation in the Context of Quantum Information

The simplest example, related to quantum information theory, is that of a q-bit (two-level system
or spin 1

2) for which n = 2. Its states, represented by 2 × 2 Hermitean normalised density matrices ˆD,
can conveniently be parameterised, on the basis of Pauli matrices, by the components rμ = D12 + D21,
i(D12 − D21), D11 − D22 (μ = 1, 2, 3) of a 3-dimensional vector r lying within the unit Poincaré–Bloch
sphere (r ≤ 1). From the corresponding entropy

S = 1 + r

2
ln
2

1 + r + 1 − r

2
ln
2

1 − r ,
(4)

we derive the metric

ds2 =
1

1 − r2

�r · dr

r

�2
+ 1

2r ln 1 + r

1 − r

����
r × dr

r

����

2
,
(5)

which is a natural Riemannian metric for q-bits, or more generally for positive 2 × 2 matrices. The
metric tensor characterizing (5) diverges in the vicinity of pure states r = 1, due to the singularity of
the entropy (2) for vanishing eigenvalues of ˆD. However, the distance between two arbitrary (even
pure) states ˆD′ and ˆD′′ measured along a geodesic is always finite. We shall see (Equation (29)) that
for n = 2 the geodesic distance s between two neighbouring pure states ˆD′ and ˆD′′, represented by
unit vectors r′ and r′′ making a small angle δϕ ∼ |r′ − r′′|, behaves as δs2 ∼ δϕ2 ln(4√π/δϕ). The
singularity of the metric tensor manifests itself through this logarithmic factor.
Identifying von Neumann’s entropy to a measure of missing information, we can give a simple
interpretation to the distance between two states. Indeed, the concavity of entropy expresses that some
information is lost when two statistical ensembles described by different density operators merge. By
mixing two equal size populations described by the neighbouring distributions ˆD′ = ˆD + 1

2δ ˆD and
ˆD′′ = ˆD − 1

2δ ˆD separated by a distance δs, we lose an amount of information given by

ΔS ≡ S
� ˆD
� − S( ˆD′) + S( ˆD′′)

2
∼ ffis2

8
,
(6)

316


Entropy 2014, 16, 3878–3888

and thereby directly related to the distance δs defined by (3). The proof of this equivalence relies on
the expansion of the entropies S( ˆD′) and S( ˆD′′) around ˆD, and is valid when Tr δ ˆD2 is negligible
compared to the smallest eigenvalue of ˆD. If ˆD′ and ˆD′′ are distant, the quantity 8ΔS cannot be
regarded as the square of a distance that would be generated by a local metric. The equivalence (6) for
neighbouring states shows that ds2 is the metric that is the best suited to measure losses of information
my mixing.
The singularity of δs2 at the edge of the positivity domain of ˆD may suggest that the result (6)
holds only within this domain. In fact, this equivalence remains nearly valid even in the limit of
pure states because ΔS itself involves a similar singularity. Indeed, if the states ˆD′ = |ψ′ >< ψ′|
and ˆD′′ = |ψ′′ >< ψ′′| are pure and close to each other, the loss of information ΔS behaves as
8ΔS ∼ ffi’2 ln(4/ffi’) where δϕ2 ∼ 2 Tr δD2. This result should be compared to various geodesic
distances between pure quantum states, which behave as δs2 ∼ δϕ2 ln(4√π/δϕ for the present metric,
and as δs2
BH = 4δs2
FS ∼ δϕ2 ∼ Tr( ˆD′ − ˆD′′)2 for the Bures – Helstrom and the quantum Fubini – Study
metrics, respectively (see Section 7; these behaviours hold not only for n = 2 but for arbitrary n since
only the space spanned by |ψ′ > and |ψ′′ > is involved). Thus, among these metrics, only ds2 = −d2S
can be interpreted in terms of information loss, whether the states ˆD′ and ˆD′′ are pure or mixed.
At the other extreme, around the most disordered state ˆD = ˆI/n, in the region ∥ n ˆD − ˆI ∥≪ 1,
the metric becomes Euclidean since ds2 = Tr d ˆDd ln ˆD ∼ n Tr(d ˆD)2 (for n = 2, ds2 = dr2). For a given
shift d ˆD, the qualitative change of a state ˆD, as measured by the distance ds, gets larger and larger as
the state ˆD becomes purer and purer, that is, when the information contents of ˆD increases.

3. Geometry of Quantum Statistical Mechanics

A rich geometric structure is generated for both states and observables by von Neumann’s
entropy through introduction of the metric ds2 = −d2S. Now, this metric (3) supplements the
algebraic structure of the set of observables and the above duality between the vector spaces of states
and of observables, with scalar product (1). Accordingly, we can define naturally within the space of
states scalar products, geodesics, angles, curvatures.
We can also regard the coordinates of d ˆD and d ln ˆD as covariant and contravariant components
of the same infinitesimal vector (Section 6). To this aim, let us introduce the mapping

ˆD ≡
e ˆX

Tr e ˆX
(7)

between ˆD in the space of states and ˆX in the space of observables. The operator ˆX appears as a
parameterisation of ˆD. (The normalisation of ˆD entails that ˆX, defined within an arbitrary additive
constant operator X0 ˆI, also depends on n2 − 1 independent real parameters.) The metric (3) can then
be re-expressed in terms of ˆX in the form

ds2 = Tr d ˆDd ˆX = Tr
� 1

0 dξ ˆDe−ξ ˆXd ˆXeξ ˆXd ˆX − (Tr ˆDd ˆX)2 = d2 ln Tr e ˆX = d2F ,
(8)

where we introduced the function
F( ˆX) ≡ ln Tr e ˆX
(9)

of the observable ˆX(The addition of X0 ˆI to ˆX results in the addition of the irrelevant constant X0 to F).
This mapping provides us with a natural metric in the space of observables, from which we recover
the scalar product between d ˆX1 and d ˆX2 in the form of a Kubo correlation in the state ˆD. The metric
(8) has been quoted in the literature under the names of Bogoliubov–Kubo–Mori.

4. Covariance and Legendre Transformation

We can recover the above geometric mapping (7) between ˆD and ˆX, or between the covariant
and contravariant coordinates of d ˆD, as the outcome of a Legendre transformation, by considering

317


Entropy 2014, 16, 3878–3888

the function F( ˆX). Taking its differential dF = Tr e ˆXd ˆX/ Tr e ˆX, we identify the partial derivatives
of F( ˆX) with the coordinates of the state ˆD = e ˆX/ Tr e ˆX, so that ˆD appears as conjugate to ˆX in the
sense of Legendre transformations. Expressing then ˆX as function of ˆD and inserting into F − Tr ˆD ˆX,
we recognise that the Legendre transform of F( ˆX) is von Neumann’s entropy F − Tr ˆD ˆX = S( ˆD) =
− Tr ˆD ln ˆD. The conjugation between ˆD and ˆX is embedded in the equations

dF = Tr ˆDd ˆX ;
dS = − Tr ˆXd ˆD .
(10)

Legendre transformations are currently used in equilibrium thermodynamics. Let us show that
they come out in this context directly as a special case of the present general formalism. The entropy of
thermodynamics is a function of the extensive variables, energy, volume, particle numbers, etc. Let us
focus for illustration on the energy U, keeping the other extensive variables fixed. The thermodynamic
entropy S(U), a function of the single variable U, generates the inverse temperature as β = ∂S/∂U.
Its Legendre transform is the Massieu potential F(β) = S − βU. In order to compare these properties
with the present formalism, we recall how thermodynamics comes out in the framework of statistical
mechanics. The thermodynamic entropy S(U) is identified with the von Neumann entropy (2) of the
Boltzmann–Gibbs canonical equilibrium state ˆD, and the internal energy with U = Tr ˆD ˆH. In the
relation (7), the operator ˆX reads ˆX = −β ˆH (within an irrelevant additive constant). By letting U or
β vary, we select within the spaces of states and of observables a one-dimensional subset. In these
restricted subsets, ˆD is parameterised by the single coordinate U, and the corresponding ˆX by the
coordinate −β.
By specialising the general relations (10) to these subsets, we recover the thermodynamic relations
dF = −Udβ and dS = βdU. We also recover, by restricting the metric (3) or (8) to these subsets, the
current thermodynamic metric ds2 =−(∂2S/∂U2)dU2 =−dUdβ.
More generally, we can consider the Boltzmann–Gibbs states of equilibrium statistical mechanics
as the points of a manifold embedded in the full space of states. The thermodynamic extensive
variables, which parameterise these states, are the expectation values of the conserved macroscopic
observables, that is, they are a subset of the expectation values (1) which parameterise arbitrary
density operators. Then the standard geometric structure of thermodynamics simply results from the
restriction of the general metric (3) to this manifold of Boltzmann–Gibbs states. The commutation of
the conserved observables simplifies the reduced thermodynamic metric, which presents the same
features as a Fisher metric (see Section 6).

5. Relevant Entropy and Geometry of the Projection Method

The above ideas also extend to non-equilibrium quantum statistical mechanics [2–4]. When
introducing the metric (3), we indicated that it may be used to estimate the quality of an approximation.
Let us illustrate this point with the Nakajima–Zwanzig–Mori–Robertson projection method, best
introduced through maximum entropy. Consider some set { ˆAk} of “relevant observables”, whose
time-dependent expectation values ak ≡ < ˆAk > = Tr ˆD ˆAk we wish to follow, discarding all other
variables. The exact state ˆD encodes the variables {ak} that we are interested in, but also the expectation
values (1) of the other observables that we wish to eliminate. This elimination is performed by
associating at each time with ˆD a “reduced state” ˆDR which is equivalent to ˆD as regards the set
ak = Tr ˆDR ˆAk, but which provides no more information than the values{ak}. The former condition
provides the constraints <
ˆAk > = ak, and the latter condition is implemented by means of the
maximum entropy criterion: One expresses that, within the set of density matrices compatible with
these constraints, ˆDR is the one which maximises von Neumann’s entropy (2), that is, which contains
solely the information about the relevant variables ak. The least biased state ˆDR thus defined has the
form ˆDR = e ˆXR/ Tr e ˆXR, where ˆXR ≡ ∑k λk ˆAk involves the time-dependent Lagrange multipliers λk,
which are related to the set ak through Tr ˆDR ˆAk = ak.

318


Entropy 2014, 16, 3878–3888

The von Neumann entropy S( ˆDR) ≡ SR{ak} of this reduced state ˆDR is called the “relevant
entropy” associated with the considered relevant observables ˆAk. It measures the amount of missing
information, when only the values {ak} of the relevant variables are given. During its evolution, ˆD
keeps track of the initial information about all the variables < ˆO > and its entropy S( ˆD) remains
constant in time. It is therefore smaller than the relevant entropy S( ˆDR) which accounts for the
loss of information about the irrelevant variables. Depending on the choice of relevant observables
{ ˆAk}, the corresponding relevant entropies SR{ak} encompass various current entropies, such as the
non-equilibrium thermodynamic entropy or Boltzmann’s H-entropy.
The same structure as the one introduced above for the full spaces of observables and states is
recovered in this context. Here, for arbitrary values of the parameters λk, the exponents ˆXR = ∑k λk ˆAk
constitute a subspace of the full vector space of observables, and the parameters {λk} appear as the
coordinates of ˆXR on the basis { ˆAk}. The corresponding states ˆDR, parameterised by the set {ak},
constitute a subset of the space of states, the manifold R of “reduced states”(Note that this manifold is
not a hyperplane, contrary to the space of relevant observables; it is embedded in the full vector space
of states, but does not constitute a subspace). By regarding SR{ak} as a function of the coordinates {ak},
we can define a metric ds2 = −d2SR{ak} on the manifold R, which is the restriction of the metric (3).
Its alternative expression ds2 = ∑k dakdλk = d2FR{λk}, where FR{λk} ≡ ln Tr exp ∑k λk ˆAk, is a
restriction of (8). The correspondence between the two parameterisations {ak} and {λk} is again
implemented by the Legendre transformation which relates SR{ak} and FR{λk}.
The projection method relies on the mapping ˆD �→ ˆDR which associates ˆDR to ˆD. It consists
in replacing the Liouville–von Neumann equation of motion for ˆD by the corresponding dynamical
equation for ˆDR on the manifold R, or equivalently for the coordinates {ak} or for the coordinates {λk},
a programme that is in practice achieved through some approximations. This mapping is obviously
a projection in the sense that ˆD �→ ˆDR �→ ˆDR, but moreover the introduction of the metric (3) shows
that the vector ˆD − ˆDR in the space of states is perpendicular to the manifold R at the point ˆDR.
This property is readily shown by writing, in this metric, the scalar product Tr d ˆD d ˆX′ of the vector
d ˆD = ˆD − ˆDR by an arbitrary vector d ˆD′ in the tangent plane of R. The latter is conjugate to any
combination d ˆX′ of observables ˆAk, and this scalar product vanishes because Tr ˆD ˆAk = Tr ˆDR ˆAk. Thus
the mapping ˆD �→ ˆDR appears as an orthogonal projection, so that the relevant state ˆDR associated
with ˆD may be regarded as its best possible approximation on the manifold R.

6. Properties of the Metric

The metric tensor can be evaluated explicitly in a basis where the matrix ˆD is diagonal. Denoting
by Di its eigenvalues and by dDij the matrix elements of its variations, we obtain from (3)

ds2 = Tr
� ∞

0
dξ
� d ˆD

ˆD + ξ

�2
= ∑
ij

ln Di − ln Dj

Di − Dj
dDijdDji .
(11)

(For Di = Dj,whether or not i = j, the ratio is defined as 1/Di by continuity.) In the same basis, the
form (8) of the metric reads

ds2 = 1

Z ∑
ij

eXi − eXj

Xi − Xj
dXijdXji −
�
∑i eXidXii

Z

�2
,
(12)

with Z = ∑i eXi(For Xi = Xj, the ratio is eXi). The singularity of the metric (11) in the vicinity of
vanishing eigenvalues of ˆD, in particular near pure states (end of Section 2), is not apparent in the
representation (12) of this metric, because the mapping from ˆD to ˆX sends the eignevalue Xi to −∞
when Di tends to zero.
Let us compare the expression (11) with the corresponding classical metric, which is obtained
by starting from Shannon’s entropy instead of von Neumann’s entropy. For discrete probabilities pi,

319


Entropy 2014, 16, 3878–3888

we have then S{pi} = − ∑i pi ln pi and hence the same definition ds2 = −d2S{pi} as above of an
entropy-based metric yields ds2 = ∑i dp2
i /pi, which is identified with the Fisher information metric.
The present metric thus appears as the extension to quantum statistical mechanics of the Fisher metric
when the latter is interpreted in terms of entropy. In fact, the terms of (11) which involve the diagonal
elements i = j of the variations d ˆD reduce to dD2
ii/Di. This result was expected since density matrices
behave as probability distributions if both ˆD and d ˆD are diagonal.
Let us more generally consider in (11), instead of solely diagonal variations dDii, variations dDij
with indices i and j such that
��Di − Dj
�� ≪ Di + Dj. The expansion of Di and Dj around 1

2(Di + Dj)
in the corresponding ratios of (11) yields (ln Di − ln Dj)/(Di − Dj) ∼ 2/(Di + Dj). The considered
terms of (11) are therefore the same as in the Bures–Helstrom metric

ds2
BH = ∑
ij

2

Di + Dj
dDijdDji ,
(13)

introduced long ago as an extension to matrices of the Fisher metric [5]. We thus recover this
Bures–Helstrom metric as an approximation of the present entropy-based metric ds2 = −d2S( ˆD).
For n = 2, ds2
BH is obtained from the expression (5) of ds2 by omitting the factor tanh−1 r/r entering
the second term.
In order to express the properties of the Riemannian metric (3) in a general form, which will
exhibit the tensor structure, we use a Liouville representation. There, the observables ˆO = Oμ ˆΩμ,
regarded as elements of a vector space, are represented by their coordinates Oμ on a complete basis
ˆΩμ of n2 observables. The space of states is spanned by the dual basis ˆΣμ, such that Tr ˆΩν ˆΣμ = δν
μ, and
the states ˆD = Dμ ˆΣμ are represented by their coordinates Dμ. Thus, the expectation value (1) is the
scalar product DμOμ. In the matrix representation which appears as a special case, μ denotes a pair of
indices i, j, ˆΩμ stands for | j >< i |, ˆΣμ for | i >< j |, Oμ denotes the matrix element Oji and Dμ the
element Dij. For the q-bit (n = 2) considered in Section 2, we have chosen the Pauli operators ˆσμ as
basis ˆΩμ for observables, and 1

2 ˆσμ as dual basis ˆΣμ for states, so that the coordinates Dμ = Tr ˆD ˆΩμ

of ˆD = 1

2( ˆI + rμ ˆσμ) are the components rμ of the vector r (The unit operator ˆI is kept aside since ˆD
is normalised and since constants added to ˆX are irrelevant). The function F{X} = ln Tr e ˆX of the
coordinates Xμ of the observable ˆX, and the von Neumann entropy S{D} as function of the coordinates
Dμ of the state ˆD, are related by the Legendre transformation F = S + DμXμ, and the relations (10) are
expressed by Dμ = ∂F/∂Xμ, Xμ = −∂S/∂Dμ. The metric tensor is given by

gμν =
∂2F

∂Xμ∂Xν
,
gμν = −
∂2S

∂Dμ∂Dν ,
(14)

and the correspondence issued from (7) between covariant and contravariant infinitesimal variations
of ˆX and ˆD is implemented as dDμ = gμνdXν, dXμ = gμνdDν.
These expressions exhibit the Hessian nature of the metric. This property simplifies the expression
of the Christoffel symbol, which reduces to

Γμνρ = −1

2
∂3S

∂Dμ∂Dν∂Dρ ,
(15)

and which provides a parametric representation ˆD(t) of the geodesics in the space of states through

d2Dμ

dt2
+ gμσΓσνρ
dDν

dt
dDρ

dt
= 0 .
(16)

Then, the Riemann curvature tensor comes out as

Rμρ νσ = gξζ(ΓμσξΓνρζ − ΓμνξΓρσζ) ,
(17)

320


Entropy 2014, 16, 3878–3888

the Ricci tensor and the scalar curvature as

Rμν = gρσRμρ νσ,
R = gμνRμν ,
(18)

We have noted that the classical equivalent of the entropy-based metric ds2 = −d2S is the Fisher
metric ∑i dp2
i /pi, which as regards the curvature is equivalent to a Euclidean metric. While the space of
classical probabilities is thus flat, the above equations show that the space of quantum states is curved.
This curvature arises from the non-commutation of the observables, it vanishes for the completely
disordered state ˆD = ˆI/n. Curvature can thus be used as a measure of the degree of classicality of a
state.

7. Geometry of the Space of q-Bits

In the illustrative example of a q-bit, the operator ˆX = χμ ˆσμ associated with ˆD is parameterised
by the 3 components of the vector χμ (μ = 1, 2, 3), related to r by χ = tanh−1 r and χμ/χ = rμ/r. The
metric tensor given by (5) is expressed as

gμν = Krμrν + χ

r δμν ,
K ≡ 1

r
d
dr
χ
r = 1

r2

�
1

1 − r2 − χ

r

�
,
(19)

gμν = (1 − r2)pμν + r

χqμν .

(We have defined rμ = rμ, δμν = δμ
ν = δμν so as to introduce the projectors rμrν/r2 ≡ pμν ≡ δμν − qμν

in the Euclidean 3-dimensional space, and thus to simplify the subsequent calculations.) In polar
coordinates r = (r, θ, ϕ), the infinitesimal distance takes the form

ds2 = drdχ + rχ(dθ2 + sin2 θdϕ2) .
(20)

We determine from (15) and (19) the explicit form

Γμνρ = K

2
�
rμδνρ + rνδμρ + rρδμν
� + 1

2r
dK
dr rμrνrρ
(21)

of the Christoffel symbol. By raising its first index with gμν and using polar coordinates, we obtain
from (16) the equations of geodesics for n = 2. Within the Poincaré–Bloch sphere the geodesics are
deduced by rotations from a one-parameter family of curves which lie in the θ = 1

2π, |ϕ| ≤ 1

2π
half-plane and which are symmetric with respect to the ϕ = 0 axis. This family is characterized by the
equations (where χ = tanh−1 r):

d2r
dt2 +
r

1 − r2

�dr

dt

�2
− r

2

�
1 + χ

r

�
1 − r2�� �dϕ

dt

�2
= 0 ,
(22)

d2ϕ
dt2 + 1

r
dr
dt
dϕ
dt + 1

χ
dχ
dt
dϕ
dt = 0 ,
(23)

and the boundary conditions at t = 0:

r (0) = a ,
ϕ (0) = 0 ,
dr (0)

dt
= 0 ,
dϕ (0)

dt
= 1

k ,
k2 = a tanh−1 a .
(24)

Equation (23) provides, using the boundary conditions (24):

dϕ
dt = k

rχ .
(25)

321


Entropy 2014, 16, 3878–3888

Insertion of (25) into (22) gives rise to an equation for r (t), which can be integrated by regarding t as a
function of ζ = arcsin r. One obtains:

�dr

dt

�2
=
�
1 − r2� �
1 − k2

rχ

�
.
(26)

The scale of t has been fixed by relating to r (0) the boundary condition (24) for dϕ (0) /dt, a choice
which ensures that ds2 = drdχ + rχdϕ2 = dt2, and hence that the parameter t measures the distance
along geodesics.
For k = 0, we obtain r = |sin t|, ϕ = ±π/2. Thus, the longest geodesics are the diameters of
the Poincaré–Bloch sphere. We find the value π for their “length”, that is, for the geodesic distance
between two orthogonal pure states. At the other extreme, when the middle point r = a, ϕ = 0 of
a geodesic lies close to the surface r = 1 of the sphere, the asymptotic form of the equation (26) is
solved as

t = ±2k√

πe−k2 erf ξ ,
ξ =

�

1
2 ln 1 − a

1 − r ,
k2 = 1

2 ln
2

1 − a
(27)

(by taking ξ as variable instead of r). The determination of the explicit equations of such short geodesic
curves is achieved by integrating (25) into

ϕ = t

k = ±2√

πe−k2 erf ξ .
(28)

From (27) and (28) we can determine the geodesic distance between two neighbouring pure states ˆD′ =
|ψ′ >< ψ′| and ˆD′′ = |ψ′′ >< ψ′′| represented by the points rmax = 1, ϕmax = ± 1

2δϕ with δϕ small.
At these two points, we have ξ → ∞, erf ξ = 1, and this determines k in terms of 1

2δϕ through (28).
The length of the geodesic that joins them, given by (27), is:

δs2 = δϕ2 ln 4√π

δϕ
,
δϕ = arccos
��< ψ′ | ψ′′ >
�� .
(29)

Thus, in spite of its singularity for r = 1, the present 3-dimensional metric (5) in the space r, θ, ϕ defines
distances between pure states represented by points on the surface r = 1 of the Poincaré–Bloch sphere.
However, It should be noted that the presence of the logarithmic factor in (29) forbids such distances
to be generated by a 2-dimensional metric in the space θ, ϕ. In fact, the distance (29) is measured along
a geodesic that penetrates the sphere r = 1, because no geodesic is tangent to the surface of this sphere
nor lies on its surface.
In contrast, all geodesics produced by the Bures–Helstrom metric are tangent to the surface of the
sphere, or are its great circles. They are given by Equations (25) and (26), where χ is replaced by r and
k by a; the solution of these equations provides the ellipses

r cos ϕ = a cos t ,
r sin ϕ = sin t .
(30)

Here as above, the largest distance π is reached for orthogonal pure states represented by opposite
points on the sphere, but now a peculiarity occurs. Whereas the metric ds2 = −d2S produces a single
geodesic, the diameter joining these two points (with “length” π), the Bures metric produces a double
infinity of geodesics, the half-ellipses (30) having as long axis this diameter, and having all the same
“length” π. Other pairs of pure states are joined by geodesics which are arcs of great circles, and their
Bures distance δsBH = δϕ is identified with the ordinary length of the arc. Here for n = 2 as in the
general case, the 3-dimensional Bures–Helstrom metric admits a restriction to pure states generated by
a 2-dimensional metric, which is identified with the quantum Fubini–Study metric, itself defined only
for pure states by sFS = arccos |< ψ′ | ψ′′ >| = 1

2sBH.

322


Entropy 2014, 16, 3878–3888

Returning to the metric ds2 = d2S, the Riemann curvature is obtained from (17) as

Rμ
ρ νσ = K

4

�
(r2 + r

χ − 1)(qμ
σqνρ − qμ
νqρσ) + (r2 − r

χ + 1)(pμ
σqνρ − pμ
νqρσ)
(31)

+ r

χ
1

1 − r2 (r2 − r

χ + 1)(qμ
σpνρ − qμ
νpρσ)
�
.

Contracting with gρσ the indices of (30) as in (18), we finally derive the Ricci curvature

Rμ
ν = −Kr

2χ

�
r2δμ
ν + χ − r

χ
pμ
ν

�
,
(32)

and the scalar curvature

R = −Kr

2χ

�
3r2 + χ − r

χ

�
.
(33)

Both are negative in the whole Poincaré sphere. In the limit r → 0, the curvature R vanishes as
R ∼ − 10

9 r2, as expected from the general argument of Section 2: a weakly polarised spin behaves
classically. At the other extreme r → 1, R behaves as R ∼ −2 [(1 − r) | ln(1 − r) |]−1; it diverges, again
as expected: pure states have the largest quantum nature.

The metric ds2 = −d2S, introduced above in the context of quantum mechanics for mixed states
(and their pure limit) and information theory, might more generally be useful to characterise distances
in spaces of positive matrices.

Conflicts of Interest: Conflicts of Interest
The author declares no conflict of interest.

References

1.
Thirring, W. Quantum Mechanics of Large Systems.
In A Course of Mathematical Physics; Volume 4;
Springler-Verlag: New York, NY, USA, 1983.
2.
Balian, R.; Alhassid, Y.; Reinhardt, H. Dissipation in many-body systems: A geometric approach based on
information theory. Phys. Rep. 1986, 131, 1–146.
3.
Balian, R. Incomplete descriptions and relevant entropies. Am. J. Phys. 1999, 67, 1078–1090.
4.
Balian, R. Information in statistical physics. Stud. Hist. Philos. Mod. Phys. 2005, 36, 323–353.
5.
Bures, D. An extension of Kakutani’s theorem. Trans. Am. Math. Soc. 1969,135, 199–212.

c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

323


entropy

Article
Extending the Extreme Physical Information to
Universal Cognitive Models via a Confident
Information First Principle

Xiaozhao Zhao 1, Yuexian Hou 1,2,*, Dawei Song 1,3 and Wenjie Li 2

1 School of Computer Science and Technology, Tianjin University, Tianjin 300072, China; E-Mails:
0.25eye@gmail.com (X.Z.); dawei.song2010@gmail.com (D.S.)
2 Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China;
E-Mail: cswjli@comp.polyu.edu.hk
3 Department of Computing and Communications, The Open University, Milton Keynes MK76AA, UK
*
E-Mail: yxhou@tju.edu.cn; Tel.: +86-022-27406538.

Received: 25 March 2014; in revised form: 6 June 2014 / Accepted: 20 June 2014 /
Published: 1 July 2014

Abstract: The principle of extreme physical information (EPI) can be used to derive many known
laws and distributions in theoretical physics by extremizing the physical information loss K, i.e.,
the difference between the observed Fisher information I and the intrinsic information bound J
of the physical phenomenon being measured. However, for complex cognitive systems of high
dimensionality (e.g., human language processing and image recognition), the information bound
J could be excessively larger than I (J ≫ I), due to insufficient observation, which would lead
to serious over-fitting problems in the derivation of cognitive models. Moreover, there is a lack
of an established exact invariance principle that gives rise to the bound information in universal
cognitive systems. This limits the direct application of EPI. To narrow down the gap between I and
J, in this paper, we propose a confident-information-first (CIF) principle to lower the information
bound J by preserving confident parameters and ruling out unreliable or noisy parameters in the
probability density function being measured. The confidence of each parameter can be assessed
by its contribution to the expected Fisher information distance between the physical phenomenon
and its observations. In addition, given a specific parametric representation, this contribution can
often be directly assessed by the Fisher information, which establishes a connection with the inverse
variance of any unbiased estimate for the parameter via the Cramér–Rao bound. We then consider
the dimensionality reduction in the parameter spaces of binary multivariate distributions. We show
that the single-layer Boltzmann machine without hidden units (SBM) can be derived using the CIF
principle. An illustrative experiment is conducted to show how the CIF principle improves the
density estimation performance.

Keywords: information geometry; Boltzmann machine; Fisher information; parametric reduction

1. Introduction

Information has been found to play an increasingly important role in physics. As stated in
Wheeler [1]: “All things physical are information-theoretic in origin and this is a participatory
universe...Observer participancy gives rise to information; and information gives rise to physics”.
Following this viewpoint, Frieden [2] unifies the derivation of physical laws in major fields of physics,
from the Dirac equation to the Maxwell-Boltzmann velocity dispersion law, using the extreme physical
information principle (EPI). More specifically, a variety of equations and distributions can be derived by
extremizing the physical information loss K, i.e., the difference between the observed Fisher information
I and the intrinsic information bound J of the physical phenomenon being measured.

Entropy 2014, 16, 3670–3688; doi:10.3390/e16073670
www.mdpi.com/journal/entropy
324


Entropy 2014, 16, 3670–3688

The first quantity, I, measures the amount of information as a finite scalar implied by the data
with some suitable measure [2]. It is formally defined as the trace of the Fisher information matrix [3].
In addition to I, the second quantity, the information bound J, is an invariant that characterizes the
information that is intrinsic to the physical phenomenon [2]. During the measurement procedure, there
may be some loss of information, which entails I = κJ, where κ ≤ 1 is called the efficiency coefficient
of the EPI process in transferring the Fisher information from the phenomenon (specified by J) to
the output (specified by I). For closed physical systems, in particular, any solution for I attains some
fraction of J between 1/2 (for classical physics) and one (for quantum physics) [4].
However, it is usually not the case in cognitive science. For complex cognitive systems (e.g.,
human language processing and image recognition), the target probability density function (pdf) being
measured is often of high dimensionality (e.g., thousands of words in a human language vocabulary
and millions of pixels in an observed image). Thus, it is infeasible for us to obtain a sufficient collection
of observations, leading to excessive information loss between the observer and nature. Moreover,
there is a lack of an established exact invariance principle that gives rise to the bound information in
universal cognitive systems. This limits the direct application of EPI in cognitive systems.
In terms of statistics and machine learning, the excessive information loss between the observer
and nature will lead to serious over-fitting problems, since the insufficient observations may not
provide necessary information to reasonably identify the model and support the estimation of the
target pdf in complex cognitive systems. Actually, a similar problem is also recognized in statistics and
machine learning, known as the model selection problem [5]. In general, we would require a complex
model with a high-dimensional parameter space to sufficiently depict the original high-dimensional
observations. However, over-fitting usually occurs when the model is excessively complex with
respect to the given observations. To avoid over-fitting, we would need to adjust the complexity of the
models to the available amount of observations and, equivalently, to adjust the information bound J
corresponding to the observed information I.
In order to derive feasible computational models for cognitive phenomenon, we propose a
confident-information-first (CIF) principle in addition to EPI to narrow down the gap between I and J
(thus, a reasonable efficiency coefficient κ is implied), as illustrated in Figure 1. However, we do not
intend to actually derive the distribution laws by solving the differential equations of the extremization
of the new information loss K′. Instead, we assume that the target distribution belongs to some general
multivariate binary distribution family and focus on the problem of seeking a proper information
bound with respect to the constraint of the parametric number and the given observations.

Figure 1. (a) The paradigm of the extreme physical information principle (EPI) to derive physical laws
by the extremization of the information loss K∗ (K∗ = J/2 for classical physics and K∗ = 0 for quantum
physics); (b) the paradigm of confident-information-first (CIF) to derive computational models by
reducing the information loss K′ using a new physical bound J′.

The key to the CIF approach is how to systematically reduce the physical information bound for
high-dimensional complex systems. As stated in Frieden [2], the information bound J is a functional
form that depends upon the physical parameters of the system. The information is contained in

325


Entropy 2014, 16, 3670–3688

the variations of the observations (often imperfect, due to insufficient sampling, noise and intrinsic
limitations of the “observer”), and can be further quantified using the Fisher information of system
parameters (or coordinates) [3] from the estimation theory. Therefore, the physical information bound
J of a complex system can be reduced by transforming it to a simpler system using some parametric
reduction approach. Assuming there exists an ideal parametric model S that is general enough to
represent all system phenomena (which gives the ultimate information bound in Figure 1), our goal is
to adopt a parametric reduction procedure to derive a lower-dimensional sub-model M (which gives
the reduced information bound in Figure 1) for a given dataset (usually insufficient or perturbed by
noises) by reducing the number of free parameters in S.
Formally speaking, let q(ξ) be the ideal distribution with parameters ξ that describes the physical
system and q(ξ + Δξ) be the observations of the system with some small fluctuation Δξ in parameters.
In [6], the averaged information distance I(Δξ) between the distribution and its observations, the
so-called shift information, is used as a disorder measure of the fluctuated observations to reinterpret
the EPI principle. More specifically, in the framework of information geometry, this information
distance could also be assessed using the Fisher information distance induced by the Fisher–Rao
metric, which can be decomposed into the variation in the direction of each system parameter [7].
In principle, it is possible to divide system parameters into two categories, i.e., the parameters with
notable variations and the parameters with negligible variations, according to their contributions to the
whole information distance. Additionally, the parameters with notable contributions are considered
to be confident, since they are important for reliably distinguishing the ideal distribution from its
observation distributions. On the other hand, the parameters with negligible contributions can be
considered to be unreliable or noisy. Then, the CIF principle can be stated as the parameter selection
criterion that maximally preserves the Fisher information distance in an expected sense with respect
to the constraint of the parametric number and the given observations (if available), when projecting
distributions from the parameter space of S into that of the reduced sub-model M. We call it the
distance-based CIF. As a result, we could manipulate the information bound of the underlying system
by preserving the information of confident parameters and ruling out noisy parameters.
In this paper, the CIF principle is analyzed in the multivariate binary distribution family in the
mixed-coordinate system [8]. It turns out that, in this problematic configuration, the confidence of
a parameter can be directly evaluated by its Fisher information, which also establishes a connection
with the inverse variance of any unbiased estimate for the parameter via the Cramér–Rao bound [3].
Hence, the CIF principle can also be interpreted as the parameter selection procedure that keeps the
parameters with reliable estimates and rules out unreliable or noisy parameters. This CIF is called
the information-based CIF. Note that the definition of confidence in distance-based CIF depends on
both Fisher information and the scale of fluctuation, and the confidence in the information-based CIF
(i.e., Fisher information) can be seen as a special case of confidence measure with respect to certain
coordinate systems. This simplification allows us to further apply the CIF principle to improve existing
learning algorithms for the Boltzmann machine.
The paper is organized as follows. In Section 2, we introduce the parametric formulation for
the general multivariate binary distributions in terms of information geometry (IG) framework [7].
Then, Section 3 describes the implementation details of the CIF principle. We also give a geometric
interpretation of CIF by showing that it can maximally preserve the expected information distance (in
Section 3.2.1), as well as the analysis on the scale of the information distance in each individual system
parameter (in Section 3.2.2). In Section 4, we demonstrate that a widely used cognitive model, i.e., the
Boltzmann machine, can be derived using the CIF principle. Additionally, an illustrative experiment is
conducted to show how the CIF principle can be utilized to improve the density estimation performance
of the Boltzmann machine in Section 5.

326


Entropy 2014, 16, 3670–3688

2. The Multivariate Binary Distributions

Similar to EPI, the derivation of CIF depends on the analysis of the physical information bound,
where the choice of system parameters, also called “Fisher coordinates” in Frieden [2], is crucial.
Based on information geometry (IG) [7], we introduce some choices of parameterizations for binary
multivariate distributions (denoted as statistical manifold S) with a given number of variables n, i.e.,
the open simplex of all probability distributions over binary vector x ∈ {0, 1}n.

2.1. Notations for Manifold S

In IG, a family of probability distributions is considered as a differentiable manifold with certain
parametric coordinate systems. In the case of binary multivariate distributions, four basic coordinate
systems are often used: p-coordinates, η-coordinates, θ-coordinates and mixed-coordinates [7,9].
Mixed-coordinates is of vital importance for our analysis.
For the p-coordinates [p] with n binary variables, the probability distribution over 2n states
of x can be completely specified by any 2n − 1 positive numbers indicating the probability of the
corresponding exclusive states on n binary variables. For example, the p-coordinates of n = 2 variables
could be [p] = (p01, p10, p11). Note that IG requires all probability terms to be positive [7].
For simplicity, we use the capital letters I, J, . . . to index the coordinate parameters of probabilistic
distribution. To distinguish the notation of Fisher information (conventionally used in literature,
e.g., data information I and information bound J in Section 1) from the coordinate indexes, we
make explicit explanations when necessary from now on. An index I can be regarded as a subset
of {1, 2, . . . , n}. Additionally, pI stands for the probability that all variables indicated by I equal
to one and the complemented variables are zero. For example, if I = {1, 2, 4} and n = 4, then
pI = p1101 = Prob(x1 = 1, x2 = 1, x3 = 0, x4 = 1). Note that the null set can also be a legal index of
the p-coordinates, which indicates the probability that all variables are zero, denoted as p0...0.
Another coordinate system often used in IG is η-coordinates, which is defined by:

ηI = E[XI] = Prob{∏
i∈I
xi = 1}
(1)

where the value of XI is given by ∏i∈I xi and the expectation is taken with respect to the probability
distribution over x. Grouping the coordinates by their orders, the η-coordinate system is denoted
as [η] = (η1
i , η2
ij, . . . , ηn
1,2...n), where the superscript indicates the order number of the corresponding

parameter. For example, η2
ij denotes the set of all η parameters with the order number two.
The θ-coordinates (natural coordinates) are defined by:

log p(x) =
∑
I⊆{1,2,...,n},I̸=NullSet
θIXI − ψ(θ)
(2)

where ψ(θ) = log(∑x exp{∑I θIXI(x)}) is the cumulant generating function and its value equals to
− log Prob{xi = 0, ∀i ∈ {1, 2, ..., n}}. The θ-coordinate is denoted as [θ] = (θi
1, θij
2 , . . . , θ1,...,n
n
), where
the subscript indicates the order number of the corresponding parameter. Note that the order indices
locate at different positions in [η] and [θ] following the convention in Amari et al. [8].
The relation between coordinate systems [η] and [θ] is bijective. More formally, they are connected
by the Legendre transformation:

θI = ∂φ(η)

∂ηI
, ηI = ∂ψ(θ)

∂θI
(3)

where ψ(θ) is given in Equation (2) and φ(η) = ∑x p(x; η) log p(x; η) is the negative of entropy. It can
be shown that ψ(θ) and φ(η) meet the following identity [7]:

ψ(θ) + φ(η) − ∑ θIηI = 0
(4)

327


Entropy 2014, 16, 3670–3688

Next, we introduce mixed-coordinates, which is important for our derivation of CIF. In general,
the manifold S of probability distributions could be represented by the l-mixed-coordinates [8]:

[ζ]l = (η1
i , η2
ij, . . . , ηl
i,j,...,k, θi,j,...,k
l+1 , . . . , θ1,...,n
n
)
(5)

where the first part consists of η-coordinates with order less or equal to l (denoted by [ηl−]) and the
second part consists of θ-coordinates with order greater than l (denoted by [θl+]), l ∈ {1, ..., n − 1}.

2.2. Fisher Information Matrix for Parametric Coordinates

For a general coordinate system [ξ], the i-th row and j-th column element of the Fisher information
matrix for [ξ] (denoted by Gξ) is defined as the covariance of the scores of [ξi] and [ξj] [3], i.e.,

gij = E[∂ log p(x; ξ)

∂ξi
· ∂ log p(x; ξ)

∂ξj
]

under the regularity condition for the pdf that the partial derivatives exist. The Fisher information
measures the amount of information in the data that a statistic carries about the unknown
parameters [10]. The Fisher information matrix is of vital importance to our analysis, because the
inverse of Fisher information matrix gives an asymptotically tight lower bound to the covariance
matrix of any unbiased estimate for the considered parameters [3]. Another important concept related
to our analysis is the orthogonality defined by Fisher information. Two coordinate parameters ξi and
ξj are called orthogonal if and only if their Fisher information vanishes, i.e., gij = 0, meaning that their
influences on the log likelihood function are uncorrelated.

The Fisher information for [θ] can be rewritten as gIJ = ∂2ψ(θ)

∂θI∂θJ , and for [η], it is gIJ = ∂2φ(η)

∂ηI∂ηJ [7].

Let Gθ = (gIJ) and Gη = (gIJ) be the Fisher information matrices for [θ] and [η], respectively. It can be
shown that Gθ and Gη are mutually inverse matrices, i.e., ∑J gIJgJK = δI
K, where δI
K = 1 if I = K and
zero otherwise [7]. In order to generally compute Gθ and Gη, we develop the following Propositions 1
and 2. Note that Proposition 1 is a generalization of Theorem 2 in Amari et al. [8].

Proposition 1. The Fisher information between two parameters θI and θJ in [θ], is given by:

gIJ(θ) = ηI � J − ηIηJ
(6)

Proof. in Appendix A.

Proposition 2. The Fisher information between two parameters ηI and ηJ in [η], is given by:

gIJ(η) = ∑
K⊆I∩J
(−1)|I−K|+|J−K| · 1

pK
(7)

where | · | denotes the cardinality operator.

Proof. in Appendix B.

Based on the Fisher information matrices Gη and Gθ, we can calculate the Fisher information
matrix Gζ for the l-mixed-coordinate system [ζ]l, as follows:

Proposition 3. The Fisher information matrix Gζ of the l-mixed-coordinates [ζ]l is given by:

Gζ =

�
A
0
0
B

�

(8)

328


Entropy 2014, 16, 3670–3688

where A = ((G−1
η )Iη)−1, B = ((G−1
θ )Jθ)−1, Gη and Gθ are the Fisher information matrices of [η] and [θ],
respectively, Iη is the index set of the parameters shared by [η] and [ζ]l, i.e., {η1
i , ..., ηl
i,j,...,k}, and Jθ is the index

set of the parameters shared by [θ] and [ζ]l, i.e., {θi,j,...,k
l+1 , . . . , θ1,...,n
n
}.

Proof. in Appendix C.

3. The General CIF Principle

In this section, we propose the CIF principle to reduce the physical information bound for
high-dimensionality systems. Given a target distribution q(x) ∈ S, we consider the problem of
realizing it by a lower-dimensionality submanifold. This is defined as the problem of parametric
reduction for multivariate binary distributions. The family of multivariate binary distributions has
been proven to be useful when we deal with discrete data in a variety of applications in statistical
machine learning and artificial intelligence, such as the Boltzmann machine in neural networks [11,12]
and the Rasch model in human sciences [13,14].
Intuitively, if we can construct a coordinate system so that the confidences of its parameters
entail a natural hierarchy, in which high confident parameters are significantly distinguished from
and orthogonal to lowly confident ones, then we can conveniently implement CIF by keeping
the high confident parameters unchanged and setting the lowly confident parameters to neutral
values. Therefore, the choice of coordinates (or parametric representations) in CIF is crucial to its
usage. This strategy is infeasible in terms of p-coordinates, η-coordinates or θ-coordinates, since the
orthogonality condition cannot hold in these coordinate systems. In this section, we will show that the
l-mixed-coordinates [ζ]l meets the requirement of CIF.
In principle, the confidence of parameters should be assessed according to their contributions to
the expected information distance between the ideal distribution and its fluctuated observations. This is
called the distance-based CIF (see Section 1). For some coordinated systems, e.g., the mixed-coordinate
system [ζ]l, the confidence of a parameter can also be directly evaluated by its Fisher information. This
is called the information-based CIF (see Section 1). The information-based CIF (i.e., Fisher information)
can be seen as an approximation to distance-based CIF, since it neglects the influence of parameter
scaling to the expected information distance. However, considering the standard mixed-coordinates
[ζ]l for the manifold of multivariate binary distributions, it turns out that both distance-based CIF and
information-based CIF entail the same submanifold M (refer to Section 3.2 for detailed reasons).
For the purpose of legibility, we will start with the information-based CIF, where the parameter’s
confidence is simply measured using its Fisher information.
After that, we show that the
information-based CIF leads to an optimal submanifold M, which is also optimal in terms of the
more rigorous distance-based CIF.

3.1. The Information-Based CIF Principle

In this section, we will show that the l-mixed-coordinates [ζ]l meet the requirement of the
information-based CIF. According to Proposition 3 and the following Proposition 4, the confidences of
coordinate parameters (measured by Fisher information) in [ζ]l entail a natural hierarchy: the first part
of high confident parameters [ηl−] are separated from the second part of low confident parameters
[θl+]. Additionally, those low confident parameters [θl+] have the neutral value of zero.

Proposition 4. The diagonal elements of A are lower bounded by one, and those of B are upper bounded by one.

Proof. in Appendix D.

Moreover, the parameters in [ηl−] are orthogonal to the ones in [θl+], indicating that we could
estimate these two parts independently [9].
Hence, we can implement the information-based
CIF for parametric reduction in [ζ]l by replacing low confident parameters with neutral value

329


Entropy 2014, 16, 3670–3688

zero and reconstructing the resulting distribution.
It turns out that the submanifold of S
tailored by information-based CIF becomes [ζ]lt
=
(η1
i , ..., ηl
ij...k, 0, . . . , 0).
We call [ζ]lt the
l-tailored-mixed-coordinates.
To grasp an intuitive picture for the CIF strategy and its significance w.r.t mixed-coordinates,
let us consider an example with [p] = (p001 = 0.15, p010 = 0.1, p011 = 0.05, p100 = 0.2, p101 =
0.1, p110 = 0.05, p111 = 0.3). Then, the confidences for coordinates in [η], [θ] and [ζ]2 are given by the
diagonal elements of the corresponding Fisher information matrices. Applying the two-tailored CIF in
mixed-coordinates, the loss ratio of Fisher information is 0.001%, and the ratio of the Fisher information
of the tailored parameter (θ123
3
) to the remaining η parameter with the smallest Fisher information is
0.06%. On the other hand, the above two ratios become 7.58% and 94.45% (in η-coordinates) or 12.94%
and 92.31% (in θ-coordinates), respectively. We can see that [ζ]2 gives us a much better way to tell
apart confident parameters from noisy ones.

3.2. The Distance-Based CIF: A Geometric Point-of-View

In the previous section, the information-based CIF entails a submanifold of S determined by the
l-tailored-mixed-coordinates [ζ]lt. A more rigorous definition for the confidence of coordinates is the
distance-based confidence used in the distance-based CIF, which relies on both of the coordinate’s
Fisher information and its fluctuation scaling. In this section, we will show that the the submanifold
M determined by [ζ]lt is also an optimal submanifold M in terms of the distance-based CIF. Note that,
for other coordinate systems (e.g., arbitrarily rescaling coordinates), the information-based CIF may
not entail the same submanifold as the distance-based CIF.
Let q(x), with coordinate ζq, denote the exact solution to the physical phenomenon being
measured. Additionally, the act of observation would cause small random perturbations to q(x),
leading to some observation q′(x) with coordinate ζq + Δζq. When two distributions q(x) and q′(x) are
close, the divergence between q(x) and q′(x) on manifold S could be assessed by the Fisher information
distance: D(q, q′) = (Δζq · Gζ · Δζq)1/2, where Gζ is the Fisher information matrix and the perturbation
Δζq is small. The Fisher information distance between two close distributions q(x) and q′(x) on
manifold S is the Riemannian distance under the Fisher–Rao metric, which is shown to be the square
root of the twice of the Kullback–Leibler divergence from q(x) to q′(x) [8]. Note that we adopt the
Fisher information distance as the distance measure between two close distributions, since it is shown
to be the unique metric meeting a set of natural axioms for the distribution metrics [7,15,16], e.g., the
invariant property with respect to reparametrizations and the monotonicity with respect to the random
maps on variables.
Let M be a smooth k-dimensionality submanifold in S (k < 2n − 1). Given the point q(x) ∈ S,
the projection [8] of q(x) on M is the point p(x) that belongs to M and is closest to q(x) with respect
to the Kullback–Leibler divergence (K-L divergence) [17] from the distribution q(x) to p(x). On the
submanifold M, the projections of q(x) and q′(x) are p(x) and p′(x), with coordinates ζp and ζp + Δζp,
respectively, shown in Figure 2.
Let the preserved Fisher information distance be D(p, p′) after projecting on M. In order to retain

the information contained in observations, we need the ratio D(p,p′)

D(q,q′) to be as large as possible in the
expected sense, with respect to the given dimensionality k of M. The next two sections will illustrate
that CIF leads to an optimal submanifold M based on different assumptions on the perturbations Δζq.

330


Entropy 2014, 16, 3670–3688

Figure 2. By projecting a point q(x) on S to a submanifold M, the l-tailored mixed-coordinates [ζ]lt
gives a desirable M that maximally preserves the expected Fisher information distance when projecting
a ε-neighborhood centered at q(x) onto M.

3.2.1. Perturbations in Uniform Neighborhood

Let Bq be a ε-sphere surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε},
where KL(·, ·) denotes the K-L divergence and ε is small. Additionally, q′(x) is a neighbor of q(x)
uniformly sampled on Bq, as illustrated in Figure 2. Recall that, for a small ε, the K-L divergence can be
approximated by half of the squared Fisher information distance. Thus, in the parameterization
of [ζ]l, Bq is indeed the surface of a hyper-ellipsoid (centered at q(x)) determined by Gζ.
The
following proposition shows that the general CIF would lead to an optimal submanifold M that
maximally preserves the expected information distance, where the expectation is taken upon the
uniform neighborhood, Bq.

Proposition 5. Consider the manifold S in l-mixed-coordinates [ζ]l. Let k be the number of free parameters
in the l-tailored-mixed-coordinates [ζ]lt. Then, among all k-dimensional submanifolds of S, the submanifold
determined by [ζ]lt can maximally preserve the expected information distance induced by the Fisher–Rao metric.

Proof. in Appendix E.

3.2.2. Perturbations in Typical Distributions

To facilitate our analysis, we make a basic assumption on the underlying distributions q(x)
that at least (2n − 2n/2) p-coordinates are of the scale ϵ, where ϵ is a sufficiently small value. Thus,
residual p-coordinates (at most 2n/2) are all significantly larger than zero (of scale Θ(1/2(n/2))), and
their sum approximates one. Note that these assumptions are common situations in real-world data
collections [18], since the frequent (or meaningful) patterns are only a small fraction of all of the
system states.
Next, we introduce a small perturbation Δp to the p-coordinates [p] for the ideal distribution
q(x). The scale of each fluctuation ΔpI is assumed to be proportional to the standard variation of
corresponding p-coordinate pI by some small coefficients (upper bounded by a constant a), which can
be approximated by the inverse of the square root of its Fisher information via the Cramér–Rao bound.
It turns out that we can assume the perturbation ΔpI to be a√pI.
In this section, we adopt the l-mixed-coordinates [ζ]l = (ηl−; θl+), where l = 2 is used in
the following analysis. Let Δζq = (Δη2−; Δθ2+) be the incremental of mixed-coordinates after the
perturbation. The squared Fisher information distance D2(p, p′) = Δζq · Gζ · Δζq could be decomposed
into the direction of each coordinate in [ζ]l. We will clarify that, under typical cases, the scale of the

331


Entropy 2014, 16, 3670–3688

Fisher information distance in each coordinate of θl+ (reduced by CIF) is asymptotically negligible,
compared to that in each coordinate of ηl− (preserved by CIF).
The scale of squared Fisher information distance in the direction of ηI is proportional to ΔηI ·
(Gζ)I,I · ΔηI, where (Gζ)I,I is the Fisher information of ηI in terms of the mixed-coordinates [ζ]2. From
Equation (1), for any I of order one (or two), ηI is the sum of 2n−1 (or 2n−2) p-coordinates, and the scale
is Θ(1). Hence, the incremental Δη2− is proportional to Θ(1), denoted as a · Θ(1). It is difficult to give
an explicit expression of (Gζ)I,I analytically. However, the Fisher information (Gζ)I,I of ηI is bounded
by the (I, I)-th element of the inverse covariance matrix [19], which is exactly 1/gI,I(θ) =
1

ηI−η2
I (see

Proposition 3). Hence, the scale of (Gζ)I,I is also Θ(1). It turns out that the scale of squared Fisher
information distance in the direction of ηI is a2 · Θ(1).
Similarly, for the part θ2+, the scale of squared Fisher information distance in the direction of
θJ is proportional to ΔθJ · (Gζ)J,J · ΔθJ, where (Gζ)J,J is the Fisher information of θJ in terms of the
mixed-coordinates [ζ]2. The scale of θJ is maximally f (k)|log(√ϵ)| based on Equation (2), where k is
the order of θJ and f (k) is the number of p-coordinates of scale Θ(1/2(n/2)) that are involved in the
calculation of θJ. Since we assume that f (k) ≤ 2(n/2), the maximum scale of θJ is 2(n/2)|log(√ϵ)|. Thus,
the incremental ΔθJ is of a scale bounded by a · 2(n/2)|log(√ϵ)|. Similar to our previous deviation, the
Fisher information (Gζ)J,J of θJ is bounded by the (J, J)-th element of the inverse covariance matrix,
which is exactly 1/gJ,J(η) (see Proposition 3). Hence, the scale of (Gζ)J,J is (2k − f (k))−1ϵ. In summary,
the scale of squared Fisher information distance in the direction of θJ is bounded by the scale of

a2 · Θ(2nϵ |log(√ϵ)|2

2k− f (k) ). Since ϵ is a sufficiently small value and a is constant, the scale of squared Fisher

information distance in the direction of θJ is asymptotically zero.
In summary, in terms of modeling the fluctuated observations of typical cognitive systems, the
original Fisher information distance between the physical phenomenon (q(x)) and observations (q′(x))
is systematically reduced using CIF by projecting them on an optimal submanifold M. Based on our
above analysis, the scale of Fisher information distance in the directions of [ηl−] preserved by CIF is
significantly larger than that of the directions [θl+] reduced by CIF.

4. Derivation of Boltzmann Machine by CIF

In the previous section, the CIF principle is uncovered in the [ζ]l coordinates. Now, we consider
an implementation of CIF when l equals to two, which gives rise to the single-layer Boltzmann machine
without hidden units (SBM).

4.1. Notations for SBM

The energy function for SBM is given by:

ESBM(x; ξ) = −1

2xTUx − bTx
(9)

where ξ = {U, b} are the parameters and the diagonals of U are set to zero.
The Boltzmann
distribution over x is p(x; ξ) =
1
Zexp{−ESBM(x; ξ)}, where Z is a normalization factor. Actually,
the parametrization for SBM could be naturally expressed by the coordinate systems in IG (e.g.,
[θ] = (θi
1 = bi, θij
2 = Uij, θijk
3
= 0, ..., θ1,2,...,n
n
= 0)).

4.2. The Derivation of SBM using CIF

Given any underlying probability distribution q(x) on the general manifold S over {x}, the
logarithm of q(x) can be represented by a linear decomposition of θ-coordinates, as shown in
Equation (2). Since it is impractical to recognize all coordinates for the target distribution, we would
like to only approximate part of them and end up with a k-dimensional submanifold M of S, where k
(≪ 2n − 1) is the number of free parameters. Here, we set k to be the same dimensionality as SBM,
i.e., k = n(n+1)

2
, so that all candidate submanifolds are comparable to the submanifold endowed by

332


Entropy 2014, 16, 3670–3688

SBM (denoted as Msbm). Next, the rationale underlying the design of Msbm can be illustrated using the
general CIF.
Let the two-mixed-coordinates of q(x) on S be [ζ]2 = (η1
i , η2
ij, θi,j,k
3
, . . . , θ1,...,n
n
). Applying the
general CIF on [ζ]2, our parametric reduction rule is to preserve the high confident part parameters
[η2−] and replace low confident parameters [θ2+] by a fixed neutral value of zero. Thus, we derive
the two-tailored-mixed-coordinates: [ζ]2t = (η1
i , η2
ij, 0, . . . , 0), as the optimal approximation of q(x)
by the k-dimensional submanifolds. On the other hand, given the two-mixed-coordinates of q(x),
the projection p(x) ∈ Msbm of q(x) is proven to be [ζ]p = (η1
i , η2
ij, 0, . . . , 0) [8]. Thus, SBM defines a
probabilistic parameter space that is derived from CIF.

4.3. The Learning Algorithms for SBM

Let q(x) be the underlying probability distribution from which samples D = {d1, d2, . . . , dN} are
generated independently. Then, our goal is to train an SBM (with stationary probability p(x)) based on
D that realizes q(x) as faithfully as possible. Here, we briefly introduce two typical learning algorithms
for SBM: maximum-likelihood and contrastive divergence [11,20,21].
Maximum-likelihood (ML) learning realizes a gradient ascent of log-likelihood of D:

ΔUij = ε∂l(ξ; D)

∂Uij
= ε(Eq[xixj] − Ep[xixj])
(10)

where ε is the learning rate and l(ξ; D) =
1
N ∑N
n=1 log(dn; ξ). Eq[·] and Ep[·] are expectations over
q(x) and p(x), respectively. Actually, Eq[xixj] and Ep[xixj] are the coordinates η2
ij of q(x) and p(x),
respectively. Eq[xixj] could be unbiasedly estimated from the sample. Markov chain Monte Carlo [22]
is often used to approximate Ep[xixj] with an average over samples from p(x).
Contrastive divergence (CD) learning realizes the gradient descent of a different objective function
to avoid the difficulty of computing the log-likelihood gradient, shown as follows:

ΔUij = −ε∂(KL(q0||p) − KL(pm||p))

∂Uij
= ε(Eq0[xixj] − Epm[xixj])
(11)

where q0 is the sample distribution, pm is the distribution by starting the Markov chain with the data
and running m steps and KL(·||·) denotes the K-L divergence. Taking samples in D as initial states, we
could generate a set of samples for pm(x). Those samples can be used to estimate Epm[xixj].
From the perspective of IG, we can see that ML/CD learning is to update parameters in SBM,
so that its corresponding coordinates [η2−] are getting closer to the data (along with the decreasing
gradient). This is consistent with our theoretical analysis in Section 3 and Section 4.2 that SBM uses
the most confident information (i.e., [η2−]) for approximating an arbitrary distribution in an expected
sense.

5. Experimental Study: Incorporate Data into CIF

In the information-based CIF, the actual values of the data were not used to explicitly effect the
output PDF (e.g., the derivation of SBM in Section 4). The data constrains the state of knowledge about
the unknown pdf. In order to force the estimate of our probabilistic model to obey the data, we need
to further reduce the difference between data information and physical information bound. How can
this be done?
In this section, the CIF principle will also be used to modify existing SBM training algorithm (i.e.,
CD-1) by incorporating data information. Given a particular dataset, the CIF can be used to further
recognize less-confident parameters in SBM and to reduce them properly. Our solution here is to
apply CIF to take effect on the learning trajectory with respect to specific samples and, hence, further
confine the parameter space to the region indicated by the most confident information contained in
the samples.

333


Entropy 2014, 16, 3670–3688

5.1. A Sample-Specific CIF-Based CD Learning for SBM

The main modification of our CIF-based CD algorithm (CD-CIF for short) is that we generate
the samples for pm(x) based on those parameters with confident information, where the confident
information carried by certain parameter is inherited from the sample and could be assessed using its
Fisher information computed in terms of the sample.
For CD-1 (i.e., m=1), the firing probability for the i-th neuron after a one-step transition from the

initial state x(0) = {x(0)
1 , x(0)
2 , . . . , x(0)
n }) is:

p(x(1)
i
= 1|x(0)) =
1

1 + exp{− ∑j̸=i Uijx(0)
j
− bi}
(12)

For CD-CIF, the firing probability for the i-th neuron in Equation (12) is modified as follows:

p(x(1)
i
=1|x(0))=
1

1+exp{−
∑
(j̸=i)&(F(Uij)>τ)
Uijx(0)
j −bi}
(13)

where τ is a pre-selected threshold, F(Uij) = Eq0[xixj] − Eq0[xixj]2 is the Fisher information of Uij (see
Equation (6)) and the expectations are estimated from the given sample D. We can see that those
weights whose Fisher information are less than τ are considered to be unreliable w.r.t D. In practice,
we could setup τ by the ratio r to specify the proportion of the total Fisher information TFI of all
parameters that we would like to remain, i.e., ∑Uij>τ,i<j F(Uij) = r ∗ TFI.
In summary, CD-CIF is realized in two phases. In the first phase, we initially “guess” whether
certain parameter could be faithfully estimated based on the finite sample. In the second phase, we
approximate the gradient using the CD scheme, except for when the CIF-based firing function in
Equation (13) is used.

5.2. Experimental Results
In this section, we empirically investigate our justifications for the CIF principle, especially how
the sample-specific CIF-based CD learning (see Section 5) works in the context of density estimation.
Experimental Setup and Evaluation Metric: We utilize the random distribution uniformly generated
from the open probability simplex over 10 variables as underlying distributions, whose samples size
N may vary. Three learning algorithms are investigated: ML, CD-1 and our CD-CIF. K-L divergence is
used to evaluate the goodness-of-fit of the SBM’s trained by various algorithms. For sample size N,
we run 100 instances (20 (randomly generated distributions) × 5 (randomly running)) and report the
averaged K-L divergences. Note that we focus on the case that the variable number is relatively small
(n = 10) in order to analytically evaluate the K-L divergence and give a detailed study on algorithms.
Changing the number of variables only offers a trivial influence for the experimental results, since we
obtained qualitatively similar observations on various variable numbers (not reported here).
Automatically Adjusting r for Different Sample Sizes: The Fisher information is additive for i.i.d.
sampling. When sample sizes change, it is natural to require that the total amount of Fisher information
contained in all tailored parameters is steady. Hence, we have α = (1 − r)N, where α indicates the
amount of Fisher information and becomes a constant when the learning model and the underlying
distribution family are given. It turns out that we can first identify α using the optimal r w.r.t several
distributions generated from the underlying distribution family and then determine the optimal r’s for
various sample sizes using r = 1 − α/N. In our experiments, we set α = 35.
Density Estimation Performance: The averaged K-L divergences between SBMs (learned by ML,
CD-1 and CD-CIF with the r automatically determined) and the underlying distribution are shown in
Figure 3a. In the case of relatively small samples (N ≤ 500) in Figure 3a, our CD-CIF method shows
significant improvements over ML (from 10.3% to 16.0%) and CD-1 (from 11.0% to 21.0%). This is
because we could not expect to have reliable identifications for all model parameters from insufficient

334


Entropy 2014, 16, 3670–3688

samples, and hence, CD-CIF gains its advantages by using parameters that could be confidently
estimated. This result is consistent with our previous theoretical insight that Fisher information gives a
reasonable guidance for parametric reduction via the confidence criterion. As the sample size increases
(N ≥ 600), CD-CIF, ML and CD-1 tend to have similar performances, since, with relatively large
samples, most model parameters can be reasonably estimated, hence the effect of parameter reduction
using CIF gradually becomes marginal. In Figure 3b and Figure 3c, we show how sample size affects
the interval of r. For N = 100, CD-CIF achieves significantly better performances for a wide range of r.
While, for N = 1, 200, CD-CIF can only marginally outperform baselines for a narrow range of r.

�
���
���
���
���
����
����

���

����

���

����

���

����

���

����

�������������������

�����������������������������������

�����������������������������������������

�

�

����
��
������

(a)

���
���
���
���
���
���
���
���
���
�

����

���

����

����

����

����

���

����

����

���������������������������������������������������������������������

�

�����������������������������������

�

�

����
��
������

(b)

���
���
���
���
���
���
���
���
���
�

�

���

�

���

�

���

�
����������������������������������������������������������������������

�

�����������������������������������

�

�

����
����
����
�

����

�����

����

����
��
������

(c)

����
����
����
����
����
����
����

����

���

����

����

����

����

����

����

�������������������������������������

�

�

��������������������

�������������������

�����������������

�����������
���������

�����������������

(d)

Figure 3. (a): the performance of CD-CIF on different sample sizes; (b) and (c): The performances
of CD-CIF with various values of r on two typical sample sizes, i.e., 100 and 1200; (d) illustrates one
learning trajectory of the last 100 steps for ML (squares), CD-1 (triangles) and CD-CIF (circles).

Effects on Learning Trajectory: We use the 2D visualizing technology SNE [20] to investigate learning
trajectories and dynamical behaviors of three comparative algorithms. We start three methods with the
same parameter initialization. Then, each intermediate state is represented by a 55-dimensional vector
formed by its current parameter values. From Figure 3d, we can see that: (1) In the final 100 steps, the
three methods seem to end up staying in different regions of the parameter space, and CD-CIF confines
the parameter in a relatively thinner region compared to ML and CD-1; (2) The true distribution is
usually located on the side of CD-CIF, indicating its potential for converging to the optimal solution.
Note that the above claims are based on general observations, and Figure 3d is shown as an illustration.
Hence, we may conclude that CD-CIF regularizes the learning trajectories in a desired region of the
parameter space using the sample-specific CIF.

6. Conclusions

Different from the traditional EPI, the CIF principle proposed in this paper aims at finding a
way to derive computational models for universal cognitive systems by a dimensionality reduction

335


Entropy 2014, 16, 3670–3688

approach in parameter spaces: specifically, by preserving the confident parameters and reducing the
less confident parameters. In principle, the confidence of parameters should be assessed according
to their contributions to the expected information distance between the ideal distribution and its
fluctuated observations. This is called the distance-based CIF. For some coordinated systems, e.g.,
the mixed-coordinate system [ζ]l, the confidence of a parameter can also be directly evaluated by
its Fisher information, which establishes a connection with the inverse variance of any unbiased
estimate for the parameter via the Cramér–Rao bound. This is called the information-based CIF.
The criterion of information-based CIF (i.e., Fisher information) can be seen as an approximation to
distance-based CIF, since it neglects the influence of parameter scaling to the expected information
distance. However, considering the standard mixed-coordinates [ζ]l for the manifold of multivariate
binary distributions, it turns out that both distance-based CIF and information-based CIF entail the
same optimal submanifold M.
The CIF provides a strategy for the derivation of probabilistic models. The SBM is a specific
example in this regard.
It has been theoretically shown that the SBM can achieve a reliable
representation in parameter spaces by using the CIF principle.
The CIF principle can also be used to modify existing SBM training algorithms by incorporating
data information, such as CD-CIF. One interesting result shown in our experiments is that: although
CD-CIF is a biased algorithm, it could significantly outperform ML when the sample is insufficient.
This suggests that CIF gives us a reasonable criterion for utilizing confident information from the
underlying data, while ML lacks a mechanism to do so.
In the future, we will further develop the formal justification of CIF w.r.t various contexts (e.g.,
distribution families or models).

Acknowledgments: We would like to thank the anonymous reviewers for their valuable comments. We also
thank Mengjiao Xie and Shuai Mi for their helpful discussions. This work is partially supported by the Chinese
National Program on Key Basic Research Project (973 Program, Grant No. 2013CB329304 and 2014CB744604), the
Natural Science Foundation of China (Grant Nos. 61272265, 61070044, 61272291, 61111130190 and 61105072).

Appendix

Appendix A Proof of Proposition 1

Proof. By definition, we have:

gIJ = ∂2ψ(θ)

∂θI∂θJ

where ψ(θ) is defined by Equation (4). Hence, we have:

gIJ = ∂2(∑I θIηI − φ(η))

∂θI∂θJ
= ∂ηI

∂θJ

By differentiating ηI, defined by Equation (1), with respect to θJ, we have:

gIJ
=
∂ηI
∂θJ = ∂ ∑x XI(x)(exp{∑I θIXI(x) − ψ(θ)})

∂θJ

= ∑
x
XI(x)[XJ(x) − ηJ]p(x; θ) = ηI � J − ηIηJ

This completes the proof.

Appendix B Proof of Proposition 2

Proof. By definition, we have:

gIJ = ∂2φ(η)

∂ηI∂ηJ

336


Entropy 2014, 16, 3670–3688

where φ(η) is defined by Equation (4). Hence, we have:

gIJ
=
∂2(∑J θJηJ − ψ(θ))

∂ηI∂ηJ
= ∂θI

∂ηJ

Based on Equations (2) and (1), the θI and pK could be calculated by solving a linear equation of [p]
and [η], respectively. Hence, we have:

θI = ∑
K⊆I
(−1)|I−K|log(pK); pK = ∑
K⊆J
(−1)|J−K|ηJ

Therefore, the partial derivation of θI with respect to ηJ is:

gIJ = ∂θI

∂ηJ
= ∑
K

∂θI

∂pK
· ∂pK

∂ηJ
= ∑
K⊆I∩J
(−1)|I−K|+|J−K| · 1

pK

This completes the proof.

Appendix C Proof of Proposition 3

Proof. The Fisher information matrix of [ζ] could be partitioned into four parts: Gζ =

�
A
C
D
B

�

.

It can be verified that in the mixed coordinate, the θ-coordinate of order k is orthogonal to any
η-coordinate less than k-order, implying the corresponding element of the Fisher information matrix is
zero (C = D = 0) [23]. Hence, Gζ is a block diagonal matrix.
According to the Cramér–Rao bound [3], a parameter (or a pair of parameters) has a unique
asymptotically tight lower bound of the variance (or covariance) of the unbiased estimate, which is
given by the corresponding element of the inverse of the Fisher information matrix involving this
parameter (or this pair of parameters). Recall that Iη is the index set of the parameters shared by [η]
and [ζ]l and that Jθ is the index set of the parameters shared by [θ] and [ζ]l; we have (G−1
ζ )Iζ = (G−1
η )Iη

and (G−1
ζ )Jζ = (G−1
θ )Jθ, i.e., G−1
ζ
=

�
(G−1
η )Iη
0
0
(G−1
θ )Jθ

�

. Since Gζ is a block tridiagonal matrix, the

proposition follows.

Appendix D Proof of Proposition 4

Proof. Assume the Fisher information matrix of [θ] to be: Gθ =

�
U
X
XT
V

�

, which is partitioned

based on Iη and Jθ. Based on Proposition 3, we have A = U−1. Obviously, the diagonal elements
of U are all smaller than one. According to the succeeding Lemma A1, we can see that the diagonal
elements of A (i.e., U−1) are greater than one.
Next, we need to show that the diagonal elements of B are smaller than 1. Using the Schur
complement of Gθ, the bottom-right block of G−1
θ , i.e., (G−1
θ )Jθ, equals to (V − XTU−1X)−1. Thus, the
diagonal elements of B: Bjj = (V − XTU−1X)jj < Vjj < 1. Hence, we complete the proof.

Lemma A1. With a l × l positive definite matrix H, if Hii < 1, then (H−1)ii > 1, ∀i ∈ {1, 2, . . . , l}.

Proof. Since H is positive definite, it is a Gramian matrix of l linearly independent vectors v1, v2, . . . , vl,
i.e., Hij = ⟨vi, vj⟩ (⟨·, ·⟩ denotes the inner product). Similarly, H−1 is the Gramian matrix of l linearly
independent vectors w1, w2, . . . , wl and (H−1)ij = ⟨wi, wj⟩. It is easy to verify that ⟨wi, vi⟩ = 1, ∀i ∈
{1, 2, . . . , l}. If Hii < 1, we can see that the norm ∥vi∥ = √Hii < 1. Since ∥wi∥ × ∥vi∥ ≥ ⟨wi, vi⟩ = 1,
we have ∥wi∥ > 1. Hence, (H−1)ii = ⟨wi, wi⟩ = ∥wi∥2 > 1.

337


Entropy 2014, 16, 3670–3688

Appendix E Proof of Proposition 5

Proof. Let Bq be a ε-ball surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε},
where KL(·, ·) denotes the Kullback–Leibler divergence and ε is small. ζq is the coordinates of q(x). Let
q(x) + dq be a neighbor of q(x) uniformly sampled on Bq and ζq(x)+dq be its corresponding coordinates.
For a small ε, we can calculate the expected information distance between q(x) and q(x) + dq as follows:

EBq =
�
[(ζq(x)+dq − ζq)TGζ(ζq(x)+dq − ζq)]
1
2 dBq
(A1)

where Gζ is the Fisher information matrix at q(x).
Since Fisher information matrix Gζ is both positive definite and symmetric, there exists a singular
value decomposition Gζ = UTΛU where U is an orthogonal matrix and Λ is a diagonal matrix with
diagonal entries equal to the eigenvalues of Gζ (all ≥ 0).
Applying the singular value decomposition into Equation (A1), the distance becomes:

EBq=
�
[(ζq(x)+dq − ζq)TUTΛU(ζq(x)+dq − ζq)]
1
2 dBq
(A2)

Note that U is an orthogonal matrix, and the transformation U(ζq(x)+dq − ζq) is a norm-preserving rotation.
Now, we need to show that among all tailored k-dimensional submanifolds of S, [ζ]lt is the
one that preserves maximum information distance. Assume IT = {i1, i2, . . . , ik} is the index of k
coordinates that we choose to form the tailored submanifold T in the mixed-coordinates [ζ]. According
to the fundamental analytical properties of the surface of the hyper-ellipsoid and the orthogonality of
the mixed-coordinates, there exists a strict positive monotonicity between the expected information
distance EBq for T and the sum of eigenvalues of the sub-matrix (Gζ)IT, where the sum equals to the
trace of (Gζ)IT. That is, the greater the trace of (Gζ)IT, the greater the expected information distance
EBq for T.
Next, we show that the sub-matrix of Gζ specified by [ζ]lt gives a maximum trace. Based on
Proposition 4, the elements on the main diagonal of the sub-matrix A are lower bounded by one and
those of B upper bounded by one. Therefore, [ζ]lt gives the maximum trace among all sub-matrices of
Gζ. This completes the proof.

Author Contributions: Author Contributions
Theoretical study and proof: Yuexian Hou and Xiaozhao Zhao.
Conceived and designed
the experiments:
Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li.
Performed the
experiments: Xiaozhao Zhao. Analyzed the data: Xiaozhao Zhao, Yuexian Hou. Wrote the manuscript:
Xiaozhao Zhao, Dawei Song, Wenjie Li and Yuexian Hou. All authors have read and approved the
final manuscript.

Conflicts of Interest: Conflicts of Interest
The authors declare no conflict of interest.

References

1.
Wheeler, J.A. Time Today; Cambridge University Press: Cambridge, UK, 1994; pp. 1–29.
2.
Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, UK, 2004.
3.
Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta
Math. Soc. 1945, 37, 81–91.
4.
Frieden, B.R.; Gatenby, R.A. Principle of maximum Fisher information from Hardy’s axioms applied to
statistical systems. Phys. Rev. E 2013, 88, 042144.
5.
Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information—Theoretic
Approach; Springer: Berlin/Heidelberg, Germany, 2002.
6.
Vstovsky, G.V. Interpretation of the extreme physical information principle in terms of shift information.
Phys. Rev. E 1995, 51, 975–979.

338


Entropy 2014, 16, 3670–3688

7.
Amari, S.; Nagaoka, H.
Methods of Information Geometry; Translations of Mathematical Monographs;
Oxford University Press: Oxford, UK, 1993.
8.
Amari, S.; Kurata, K.; Nagaoka, H. Information geometry of Boltzmann machines. IEEE Trans. Neural Netw.
1992, 3, 260–271.
9.
Hou, Y.; Zhao, X.; Song, D.; Li, W. Mining pure high-order word associations via information geometry for
information retrieval. ACM Trans. Inf. Syst. 2013, 31, 12:1–12:32.
10.
Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–219.
11.
Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A learning algorithm for Boltzmann machines. Cogn. Sci. 1985,
9, 147–169.
12.
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006,
313, 504–507.
13.
Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational
Research: Copenhagen, Denmark, 1960.
14.
Bond, T.; Fox, C. Applying the Rasch Model: Fundamental Measurement in the Human Sciences; Psychology Press:
London, UK, 2013.
15.
Gibilisco, P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: Cambridge, UK, 2010.
16.
ˇCencov, N.N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Washington,
D.C., USA, 1982.
17.
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86.
18.
Buhlmann, P.; van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory And Applications;
Springer:
Berlin/Heidelberg, Germany, 2011.
19.
Bobrovsky, B.; Mayer-Wolf, E.; Zakai, M. Some classes of global Cramér-Rao bounds. Ann. Stat. 1987,
15, 1421–1438.
20.
Hinton, G.E. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002,
14, 1771–1800.
21.
Carreira-Perpinan, M.A.; Hinton, G.E.
On contrastive divergence learning.
In Proceedings of the
International Workshop on Artificial Intelligence and Statistics, Key West, FL, USA, 6–8 January 2005;
pp. 33–40.
22.
Gilks, W.R.; Richardson, S.; Spiegelhalter, D. Introducing markov chain monte carlo. In Markov Chain Monte
Carlo in Practice; Chapman and Hall/CRC: London, UK, 1996; pp. 1–19.
23.
Nakahara, H.; Amari, S.
Information geometric measure for neural spikes.
Neural Comput.
2002,
14, 2269–2316.

c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

339



MDPI
St. Alban-Anlage 66
4052 Basel

Switzerland
Tel. +41 61 683 77 34
Fax +41 61 302 89 18
www.mdpi.com

Entropy Editorial Office
E-mail: entropy@mdpi.com

www.mdpi.com/journal/entropy



MDPI  
St. Alban-Anlage 66 
4052 Basel 
Switzerland

Tel: +41 61 683 77 34 
Fax: +41 61 302 89 18

www.mdpi.com
ISBN 978-3-03897-633-2


