Information Geometry Geert Verdoolaege www.mdpi.com/journal/entropy Edited by Printed Edition of the Special Issue Published in Entropy Information Geometry Information Geometry Special Issue Editor Geert Verdoolaege MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade Special Issue Editor Geert Verdoolaege Ghent University Belgium Editorial Office MDPI St. Alban-Anlage 66 4052 Basel, Switzerland This is a reprint of articles from the Special Issue published online in the open access journal Entropy (ISSN 1099-4300) in 2014 (available at: https://www.mdpi.com/journal/entropy/special issues/ information-geometry) For citation purposes, cite each article independently as indicated on the article page online and as indicated below: LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Article Number, Page Range. ISBN 978-3-03897-632-5 (Pbk) ISBN 978-3-03897-633-2 (PDF) Cover image courtesy of Geert Verdoolaege. c⃝ 2019 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications. The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND. Contents About the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Preface to ”Information Geometry” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Shun-ichi Amari Information Geometry of Positive Measures and Positive-DefiniteMatrices: Decomposable Dually Flat Structure Reprinted from: Entropy 2014, 16, 2131–2145, doi:10.3390/e16042131 . . . . . . . . . . . . . . . . . 1 Harsha K. V. and Subrahamanian Moosath K S F-Geometry and Amari’s α−Geometry on a Statistical Manifold Reprinted from: Entropy 2014, 16, 2472–2487, doi:10.3390/e16052472 . . . . . . . . . . . . . . . . . 14 Frank Critchley and Paul Marriott Computational Information Geometry in Statistics: Theory and Practice Reprinted from: Entropy 2014, 16, 2454–2471, doi:10.3390/e16052454 . . . . . . . . . . . . . . . . . 29 Paul Vos and Karim Anaya-Izquierdo Using Geometry to Select One Dimensional Exponential Families That Are Monotone Likelihood Ratio in the Sample Space, Are Weakly Unimodal and Can Be Parametrized by a Measure of Central Tendency Reprinted from: Entropy 2014, 16, 4088–4100, doi:10.3390/e16074088 . . . . . . . . . . . . . . . . . 44 Guido Mont´ufar, Johannes Rauh and Nihat Ay On the Fisher Metric of Conditional Probability Polytopes Reprinted from: Entropy 2014, 16, 3207–3233, doi:10.3390/e16063207 . . . . . . . . . . . . . . . . . 56 Andr´e Klein Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes Reprinted from: Entropy 2014, 16, 2023–2055, doi:10.3390/e16042023 . . . . . . . . . . . . . . . . . 80 Keisuke Yano and Fumiyasu Komaki Asymptotically Constant-Risk Predictive Densities When the Distributions of Data and Target Variables Are Different Reprinted from: Entropy 2014, 16, 3026–3048, doi:10.3390/e16063026 . . . . . . . . . . . . . . . . . 110 Samuel Livingstone and Mark Girolami Information-Geometric Markov Chain Monte Carlo Methods Using Diffusions Reprinted from: Entropy 2014, 16, 3074–3102, doi:10.3390/e16063074 . . . . . . . . . . . . . . . . . 131 Hui Zhao and Paul Marriott Variational Bayes for Regime-Switching Log-Normal Models Reprinted from: Entropy 2014, 16, 3832–3847, doi:10.3390/e16073832 . . . . . . . . . . . . . . . . . 155 Frank Nielsen, Richard Nock and Shun-ichi Amari On Clustering Histograms with k-Means by Using Mixed α-Divergences Reprinted from: Entropy 2014, 16, 3273–3301, doi:10.3390/e16063273 . . . . . . . . . . . . . . . . . 169 Salem Said, Lionel Bombrun and Yannick Berthoumieu New Riemannian Priors on the Univariate Normal Model Reprinted from: Entropy 2014, 16, 4015–4031, doi:10.3390/e16074015 . . . . . . . . . . . . . . . . . 194 v Luigi Malag`o and Giovanni Pistone Combinatorial Optimization with Information Geometry: The Newton Method Reprinted from: Entropy 2014, 16, 4260–4289, doi:10.3390/e16084260 . . . . . . . . . . . . . . . . . 209 Domenico Felice, Carlo Cafaro and Stefano Mancini Information Geometric Complexity of a Trivariate Gaussian Statistical Model Reprinted from: Entropy 2014, 16, 2944–2958, doi:10.3390/e16062944 . . . . . . . . . . . . . . . . . 237 Alexandre Levada Learning from Complex Systems: On the Roles of Entropy and Fisher Information in Pairwise Isotropic Gaussian Markov Random Fields Reprinted from: Entropy 2014, 16, 1002–1036, doi:10.3390/e16021002 . . . . . . . . . . . . . . . . . 250 Masatoshi Funabashi Network Decomposition and Complexity Measures: An Information Geometrical Approach Reprinted from: Entropy 2014, 16, 4132–4167, doi:10.3390/e16074132 . . . . . . . . . . . . . . . . . 283 Roger Balian The Entropy-Based Quantum Metric Reprinted from: Entropy 2014, 16, 3878–3888, doi:10.3390/e16073878 . . . . . . . . . . . . . . . . . 315 Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li Extending the Extreme Physical Information to Universal Cognitive Models via a Confident Information First Principle Reprinted from: Entropy 2014, 16, 3670–3688, doi:10.3390/e16073670 . . . . . . . . . . . . . . . . . 324 vi About the Special Issue Editor Geert Verdoolaege obtained an M.Sc. degree in Theoretical Physics in 1999 and the Ph.D. in Engineering Physics in 2006, both at Ghent University (UGent, Belgium). His Ph.D. work concerned applications of Bayesian probability theory to plasma spectroscopy in fusion devices. He was a postdoctoral researcher in the field of computer vision at the University of Antwerp (2007–2008), working on probabilistic modeling of image textures using information geometry. From 2008 to 2010, he was with the Department of Data Analysis at UGent, where he worked on modeling and estimation of brain activity, based on functional magnetic resonance imaging. In 2010, he returned to the Department of Applied Physics at UGent, first as a postdoctoral assistant and from 2014 onwards, as a part-time assistant professor. Since 2013, he has held a cross-appointment as a researcher at the Laboratory for Plasma Physics of the Royal Military Academy (LPP-ERM/KMS) in Brussels. His research activities comprise development of data analysis techniques using methods from probability theory, machine learning and information geometry, and their application to nuclear fusion experiments. He also teaches a Master course on Continuum Mechanics at Ghent University. He serves on the editorial board of the multidisciplinary journal Entropy and is a member of the scientific committees of several conferences (IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis; International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering; Conference on Geometric Science of Information). In addition, he is a consulting expert in the International Tokamak Physics Activity (ITPA) Transport and Confinement Topical Group and member of the General Assembly of the European Fusion Education Network (FuseNet). vii Preface to ”Information Geometry” The mathematical field of information geometry originated from the observation the Fisher information can be used to define a Riemannian metric on manifolds of probability distributions. This led to a geometrical description of probability theory and statistics, which over the years has developed into a rich mathematical field with a broad range of applications in the data sciences. Moreover, similar to the concept of entropy, there are various connections to and applications of information geometry in statistical mechanics, quantum mechanics, and neuroscience. It has been a pleasure to act as a guest editor for this first Special Issue on information geometry in the journal Entropy. For me, as a physicist working on the development and application of data science techniques in the context of nuclear fusion experiments, the interdisciplinary character of information geometry has always been one of the main reasons for its appeal. There are, of course, many other domains in physics where geometrical notions play a key role, including classical mechanics, continuum mechanics (which I have been teaching at Ghent University for several years now), general relativity, and much of modern physics. This interplay between the beautiful and elegant formalism of differential geometry on the one hand and physics and data science on the other hand is both fascinating and inspiring. The variety of topics covered by this Special Issue is a reflection of this cross-fertilization between disciplines. “Information Geometry I” has been a great success, and although the papers were published already several years ago, it was decided that it was worthwhile to reprint the collection of papers in book form. Indeed, even though all papers present original research, many have a strong tutorial character, and we were honored to receive multiple contributions by authorities in the field. The papers have been structured according to their main subject area, or field of application, and we briefly discuss each of them in the following. We start with two papers related to the foundations of information geometry. We were very pleased to receive a contribution by one of the founders of the field of information geometry, prof. Shun-ichi Amari. In his paper, the dually flat structure of the manifold of positive measures is discussed, derived from a class of Bregman divergences. These so-called (ρ,τ)-divergences, originally proposed by J. Zhang, are defined in terms of two monotone, scalar functions (ρ and τ) and form a unique class of dually flat, decomposable divergences. This is extended to the set of positive-definite matrices, additionally requiring invariance of the divergence under matrix transformations. It is well known that such dually flat manifolds have computationally desirable properties in applications to classification and information retrieval. Harsha K. V. and Subrahamanian Moosath K. S. introduce F-geometry as a generalization of α-geometry, based on a representation of a probability density function through a function F. They then combine this with another function G to define a weighted expectation, from which an (F,G)-metric and connection are deduced. A condition for two of such structures to lead to dual connections is also derived. However, it was shown by Zhang (J. Zhang, Entropy 17, pp. 4485–4499, 2015) that this framework is equivalent to the (ρ,τ)-geometry introduced earlier by him. Although the present paper is slightly different in perspective, it should be read with this equivalence in mind. The next four papers deal with applications of information geometry in statistics. The paper by Frank Critchley and Paul Marriott presents an important research program aimed at rendering some of the most useful results of information geometry more accessible to statisticians in ix practical applications. Indeed, the formalism of differential geometry and tensor algebra can appear daunting at first sight and may present an obstacle to adoption of many useful results by practitioners. The paper describes a computational framework that facilitates implementation of results from information geometry, based on an embedding of various important statistical models in a (sufficiently large) simplex. Challenges related to extension of the framework to the infinite–dimensional case are touched upon as well. In the paper by Paul Vos and Karim Anaya-Izquierdo, the goal is to identify one-dimensional exponential families enjoying a number of properties that are convenient for statistical modeling, i.e., parametrization by a measure of central tendency, unimodality, and monotone likelihood ratio. The basis for the framework is the multinomial distribution, modeled geometrically by the simplex. The selection of exponential families with desirable properties is then based on a partitioning of the natural parameter space of the family of multinomial distributions by means of convex cones. Guido Mont´ufar and co-workers consider various possibilities to define natural Riemannian metrics on polytopes of stochastic matrices, which describe the conditional probability distribution of two categorical random variables. Inspired by the classical result regarding the uniqueness of the Fisher metric by requiring invariance under Markov morphisms, they define metrics derived from a natural class of stochastic maps between such polytopes, or, alternatively, through embeddings in various possible model spaces. They provide recommendations as to which metric to use, depending on the application. Andr´e Klein, in his article, provides a survey of several matrix algebraic properties of the Fisher information matrix corresponding to weakly stationary time series. The link with various structured matrices arising from a number of time series models is demonstrated. A statistical distance measure is built using the Fisher information matrix in the context of classical and quantum information. Finally, conditions are obtained for the Fisher information of a stationary process to obey certain forms of the Stein equation. We continue with three papers concerning applications of information geometry in Bayesian inference and simulation. Keisuke Yano and Fumiyasu Komaki, in their paper, construct constant-risk Bayesian predictive densities using the Kullback-Leibler loss function when the distributions of the data and the target variable to be predicted are different but have a common unknown parameter. Specifically, the issue of prior selection is investigated, and several applications are given. Samuel Livingstone and Mark Girolami provide an introduction to recent enhancements of Markov chain Monte Carlo simulation techniques inspired by information geometry. They apply this to the Metropolis–Hastings algorithm driven by Langevin diffusion, gradually transforming the ingredients to the setting of a Riemannian manifold equipped with a metric similar to the Fisher information metric. Pointers to various applications are given. The paper is written in a way that also makes it accessible to practitioners with little background in stochastic processes and geometry. The paper by Hui Zhao and Paul Marriott concerns Bayesian inference making use of variational methods for approximating the posterior distribution. In the context of inference for time series models that switch between different regimes, variational Bayes is shown to be a computationally attractive alternative to Markov chain Monte Carlo simulations. The geometry related to the projection of the posterior onto a computationally tractable family of distributions is elucidated by means of a simple example. This is followed by an application wherein it is shown that variational Bayes is successful in estimating the regime-switching model, including the number of regimes. Applications of information geometry in machine learning are represented by the following x three papers. The article by Frank Nielsen and colleagues considers κ-means histogram clustering, with applications to, e.g., information retrieval. Based on the α-divergences as similarity measures, clustering is performed using either the sided (asymmetric) or symmetrized divergence, or by means of the interesting notion of a mixed divergence. An important computational advantage is that the centroids based on the sided and mixed divergences have a closed-form expression. Next, the scheme is extended to algorithms with optimized initialization of cluster centroids, as well as soft clustering. Salem Said and co-workers present a class of distributions on the manifold of the univariate normal model equipped with the Fisher information metric. Expressed in terms of the Fisher-Rao distance, the distributions are used as priors for modeling the classes in Bayesian classification problems of normal distributions. Characteristics of this “Gaussian” distribution on the manifold are discussed, as well as estimation of its parameters and the posterior using the Laplace approximation. In an application to classification of image textures, the improved performance of these priors over conjugate priors is demonstrated. Luigi Malag`o and Giovanni Pistoni address optimization on manifolds of exponential distributions on a discrete state space using Newton’s method, which is based on second-order calculus. In particular, the goal is to find maxima of the expectation of a function with respect to the distribution (stochastic relaxation). Details of the computation are provided, including calculation of the Riemannian Hessian. A nonparametric formalism is used, with a view to extension to the infinite–dimensional case. The next three papers are related to the role of information geometry in complex systems research. Domenico Felice and colleagues consider the time-averaged volume explored by geodesics on a statistical manifold as an indicator of complexity of the entropic dynamics of a system. The parameters of the model play the role of macrovariables conveying information on the system’s microstate. Examples are given for the case of univariate, bivariate, and trivariate normal distributions, providing interesting results depending on correlations between the microvariables. Alexandre Levada investigates the role of entropy and Fisher information in pairwise isotropic Gaussian Markov random fields, acting as models for complex systems. Expressions for these quantities are derived and the evolution of the Fisher information, and entropy is studied as a function of the inverse temperature of the system. An interesting interpretation is given of asymmetries between these curves in terms of the arrow of time. Masatoshi Funabashi presents a framework for measuring statistical dependence between subsystems of a stochastic model, based on the model’s graph representation. A description in terms of the mixed coordinates of the system is used to quantify the complexity loss incurred by cutting an edge of the graph. In addition, a complexity measure is defined as a geometric mean of Kullback–Leibler divergences between decompositions of the system in terms of subsystems with fewer statistical dependencies. This quantifies the degree to which the system can be decomposed. The following paper concerns an application to physics, specifically quantum mechanics. Roger Balian gives an overview of a geometrical framework for measuring information loss in quantum systems resulting from the mixing of states. A Riemannian metric is defined, based on the von Neumann entropy, generating a mapping between states and observables. The metric is compared to other quantum metrics, as well as the Fisher–Rao metric, and various geometrical properties are derived. Applications are given to quantum information, as well as equilibrium and non-equilibrium quantum statistical mechanics. The final paper in the Special Issue is situated at the interface between physics and neuroscience. xi Xiaozhao Zhao and colleagues consider the principle of extreme physical information based on the Fisher information, which has been used before in an attempt to establish an information-theoretical basis for physical laws. They extend the idea to cognitive systems and aim at narrowing the gap between the information bound and the data information for such complex systems, by transforming the model to a simpler one. This is done by means of a dimensionality reduction technique, also based on the Fisher information. The approach is applied to derive the model for single-layer Boltzmann machines and interpret their learning algorithms. We are convinced that the varied collection of papers in this Special Issue will be useful for scientists who are new to the field, while providing an excellent reference for the more seasoned researcher. Furthermore, it is worth mentioning that the second Entropy Special Issue in this series, “Information Geometry II”, will also be published as a book, and that a third Special Issue is being prepared. We hope that the reader will enjoy browsing and reading through this collection of papers as much as we enjoyed guest editing this Special Issue “Information Geometry I”. Finally, I would like to thank the Editor-in-Chief of Entropy, Prof. Dr. Kevin H. Knuth, for suggesting the opportunity to guest-edit a Special Issue on information geometry. Furthermore, I wish to thank the editorial staff at MDPI for their great help with contacting authors, organizing paper reviews, and editing the original Special Issue in Entropy, as well as the reprinted version in the present book. Geert Verdoolaege Special Issue Editor xii entropy Article Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure Shun-ichi Amari RIKEN Brain Science Institute, Hirosawa 2-1, Wako-shi, Saitama 351-0198, Japan; E-Mail: amari@brain.riken.jp; Tel.: +81-48-467-9669; Fax: +81-48-467-9687 Received: 14 February 2014; in revised form: 9 April 2014 / Accepted: 10 April 2014 / Published: 14 April 2014 Abstract: Information geometry studies the dually flat structure of a manifold, highlighted by the generalized Pythagorean theorem. The present paper studies a class of Bregman divergences called the (ρ, τ)-divergence. A (ρ, τ)-divergence generates a dually flat structure in the manifold of positive measures, as well as in the manifold of positive-definite matrices. The class is composed of decomposable divergences, which are written as a sum of componentwise divergences. Conversely, a decomposable dually flat divergence is shown to be a (ρ, τ)-divergence. A (ρ, τ)-divergence is determined from two monotone scalar functions, ρ and τ. The class includes the KL-divergence, α-, β- and (α, β)-divergences as special cases. The transformation between an affine parameter and its dual is easily calculated in the case of a decomposable divergence. Therefore, such a divergence is useful for obtaining the center for a cluster of points, which will be applied to classification and information retrieval in vision. For the manifold of positive-definite matrices, in addition to the dually flatness and decomposability, we require the invariance under linear transformations, in particular under orthogonal transformations. This opens a way to define a new class of divergences, called the (ρ, τ)-structure in the manifold of positive-definite matrices. Keywords: information geometry; dually flat structure; decomposable divergence; (ρ, τ)-structure 1. Introduction Information geometry, originated from the invariant structure of a manifold of probability distributions, consists of a Riemannian metric and dually coupled affine connections with respect to the metric [1]. A manifold having a dually flat structure is particularly interesting and important. In such a manifold, there are two dually coupled affine coordinate systems and a canonical divergence, which is a Bregman divergence. The highlight is given by the generalized Pythagorean theorem and projection theorem. Information geometry is useful not only for statistical inference, but also for machine learning, pattern recognition, optimization and even for neural networks. It is also related to the statistical physics of Tsallis q-entropy [2–4]. The present paper studies a general and unique class of decomposable divergence functions in Rn+, the manifold of n-dimensional positive measures. This is the (ρ, τ)-divergences, introduced by Zhang [5,6], from the point of view of “representation duality”. They are Bregman divergences generating a dually flat structure. The class includes the well-known Kullback-Leibler divergence, α-divergence, β-divergence and (α, β)-divergence [1,7–9] as special cases. The merit of a decomposable Bregman divergence is that the θ-η Legendre transformation is computationally tractable, where θ and η are two affine coordinates systems coupled by the Legendre transformation. When one uses a dually flat divergence to define the center of a cluster of elements, the center is easily given by the arithmetic mean of the dual coordinates of the elements [10,11]. However, we need to calculate its primal coordinates. This is the θ-η transformation. Hence, our new type of divergences has an Entropy 2014, 16, 2131–2145; doi:10.3390/e16042131 www.mdpi.com/journal/entropy 1 Entropy 2014, 16, 2131–2145 advantage of calculating θ-coordinates for clustering and related pattern matching problems. The most general class of dually flat divergences, not necessarily decomposable, is further given in Rn+. They are the (ρ, τ) divergence. Positive-definite (PD) matrices appear in many engineering problems, such as convex programming, diffusion tensor analysis and multivariate statistical analysis [12–20]. The manifold, PDn, of n × n PD matrices form a cone, and its geometry is by itself an important subject of research. If we consider the submanifold consisting of only diagonal matrices, it is equivalent to the manifold of positive measures. Hence, PD matrices can be regarded as a generalization of positive measures. There are many studies on geometry and divergences of the manifold of positive-definite matrices. We introduce a general class of dually flat divergences, the (ρ, τ)-divergence. We analyze the cases when a (ρ, τ)-divergence is invariant under the general linear transformations, Gl(n), and invariant under the orthogonal transformations, O(n). They not only include many well-known divergences of PD matrices, but also give new important divergences. The present paper is organized as follows. Section 2 is preliminary, giving a short introduction to a dually flat manifold and the Bregman divergence. It also defines the cluster center due to a divergence. Section 3 defines the (ρ, τ)-structure in Rn+. It gives dually flat decomposable affine coordinates and a related canonical divergence (Bregman divergence). Section 4 is devoted to the (ρ, τ)-structure of the manifold, PDn, of PD matrices. We first study the class of divergences that are invariant under O(n). We further study a decomposable divergence that is invariant under Gl(n). It coincides with the invariant divergence derived from zero-mean Gaussian distributions with PD covariance matrices. They not only include various known divergences, but new remarkable ones. Section 5 discusses a general class of non-decomposable flat divergences and miscellaneous topics. Section 6 is the conclusions. 2. Preliminaries to Information Geometry of Divergence 2.1. Dually Flat Manifold A manifold is said to have the dually flat Riemannian structure, when it has two affine coordinate systems θ = � θ1, · · · , θn� and η = (η1, · · · , ηn) (with respect to two flat affine connections) together with two convex functions, ψ(θ) and ϕ(η), such that the two coordinates are connected by the Legendre transformations: η = ∇ψ(θ), θ = ∇ϕ(η), (1) where ∇ is the gradient operator. The Riemannian metric is given by: � gij(θ) � = ∇∇ψ(θ), � gij(η) � = ∇∇ϕ(η) (2) in the respective coordinate systems. A curve that is linear in the θ-coordinates is called a θ-geodesic, and a curve linear in the η-coordinates is called an η-geodesic. A dually flat manifold has a unique canonical divergence, which is the Bregman divergence defined by the convex functions, D[P : Q] = ψ (θP) + ϕ � ηQ � − θP · ηQ, (3) where θP is the θ-coordinates of P, ηQ is the η-coordinates of Q and θP · ηQ = ∑i � θi P � � ηQi � , where θi P and ηQi are components of θp and ηQ, respectively. The Pythagorean and projection theorems hold in a dually flat manifold: Pythagorean Theorem Given three points, P, Q, R, when the η-geodesic connecting P and Q is orthogonal to the θ-geodesic connecting Q and R with respect to the Riemannian metric, D[P : Q] + D[Q : R] = D[P : R]. (4) 2 Entropy 2014, 16, 2131–2145 Projection Theorem Given a smooth submanifold, S, let PS be the minimizer of divergence from P to S, PS = min Q∈S D[P : Q]. (5) Then, PS is the η-geodesic projection of P to S, that is the η-geodesic connecting P and PS is orthogonal to S. We have the dual of the above theorems where θ- and η-geodesics are exchanged and D[P : Q] is replaced by its dual D[Q : P]. 2.2. Decomposable Divergence A divergence, D[P : Q], is said to be decomposable, when it is written as a sum of component-wise divergences, D[P : Q] = n ∑ i=1 d � θi P, θi Q � , (6) where θi P and θi Q are the components of θP and θQ and d � θi P, θi Q � is a scalar divergence function. An f-divergence: Df [P : Q] = ∑ pi f � qi pi � (7) is a typical example of decomposable divergence in the manifold of probability distributions, where P = (p) and Q = (q) are two probability vectors with ∑ pi = ∑ qi = 1. A convex function, ψ(θ), is said to be decomposable, when it is written as: ψ(θ) = n ∑ i=1 ˜ψ � θi� (8) by using a scalar convex function, ˜ψ(θ). The Bregman divergence derived from a decomposable convex function is decomposable. When ψ(θ) is a decomposable convex function, its Legendre dual is also decomposable. The Legendre transformation is given componentwise as: ηi = ˜ψ′ (θi) , (9) where ′ is the differentiation of a function, so that it is computationally tractable. Its inverse transformation is also componentwise, θi = ˜ϕ′ (ηi) , (10) where ˜ϕ is the Legendre dual of ˜ψ. 2.3. Cluster Center Consider a cluster of points P1, · · · , Pm of which θ-coordinates are θ1, · · · , θm and η-coordinates are η1, · · · , ηm. The center, R, of the cluster with respect to the divergence, D[P : Q], is defined by: R = arg min Q m ∑ i=1 D [Q : Pi] . (11) By differentiating ∑ D [Q : Pi] by θ (the θ-coordinates of Q), we have: ∇ψ (θR) = 1 m m ∑ i=1 ηi. (12) Hence, the cluster-center theorem due to Banerjee et al. [10] follows; see also [11]: 3 Entropy 2014, 16, 2131–2145 Cluster-Center Theorem The η-coordinates ηR of the cluster center are given by the arithmetic average of the η-coordinates of the points in the cluster: ηR = 1 m m ∑ i=1 ηi. (13) When we need to obtain the θ-coordinates of the cluster center, it is given by the θ-η transformation from ηR, θR = ∇ϕ (ηR) . (14) However, in many cases, the transformation is computationally heavy and intractable when the dimensions of a manifold is large. The transformation is easy in the case of a decomposable divergence. This is motivation for considering a general class of decomposable Bregman divergences. 3. (ρ, τ) Dually Flat Structure in Rn + 3.1. (ρ, τ)-Coordinates of Rn + Let Rn+ be the manifold of positive measures over n elements x1, · · · , xn. A measure (or a weight) of xi is given by: ξi = m (xi) > 0 (15) and ξ = (ξ1, · · · , ξn) is a distribution of measures. When ∑ ξi = 1 is satisfied, it is a probability measure. We write: R+ n = {ξ |ξi > 0} (16) and ξ forms a coordinate system of Rn+. Let ρ(ξ) and τ(ξ) be two monotonically increasing differentiable functions. We call: θ = ρ(ξ), η = τ(ξ) (17) the ρ- and τ-representations of positive measure ξ. This is a generalization of the ±α representations [1] and was introduced in [5] for a manifold of probability distributions. See also [6]. By using these functions, we construct new coordinate systems θ and η of Rn+. They are given, for θ = � θi� and η = (ηi), by componentwise relations, θi = ρ (ξi) , ηi = τ (ξi) . (18) They are called the ρ- and τ-representations of ξ ∈ Rn+, respectively. We search for convex functions, ψρ,τ(θ) and ϕρ,τ(η), which are Legendre duals to each other, such that θ and η are two dually coupled affine coordinate systems. 3.2. Convex Functions We introduce two scalar functions of θ and η by: ˜ψρ,τ(θ) = � ρ−1(θ) 0 τ(ξ)ρ′(ξ)dξ, (19) ˜ϕρ,τ(η) = � τ−1(η) 0 ρ(ξ)τ′(ξ)dξ. (20) Then, the first and second derivatives of ˜ψρ,τ are: ˜ψ′ ρ,τ(θ) = τ(ξ), (21) ˜ψ′′ ρ,τ(θ) = τ′(ξ) ρ′(ξ) . (22) 4 Entropy 2014, 16, 2131–2145 Since ρ′(ξ) > 0, τ′(ξ) > 0, we see that ˜ψρ,τ(θ) is a convex function. So is ˜ϕρ,τ(η). Moreover, they are Legendre duals, because: ˜ψρ,τ(θ) + ˜ϕρ,τ(η) − θη = � ξ 0 τ(ξ)ρ′(ξ)dξ + � ξ 0 ρ(ξ)τ′(ξ)dξ − ρ(ξ)τ(ξ) (23) = 0. (24) We then define two decomposable convex functions of θ and η by: ψρ,τ(θ) = ∑ ˜ψρ,τ � θi� , (25) ϕρ,τ(η) = ∑ ˜ϕρ,τ (ηi) . (26) They are Legendre duals to each other. 3.3. (ρ, τ)-Divergence The (ρ, τ)-divergence between two points, ξ, ξ′ ∈ R+ n , is defined by: Dρ,τ � ξ : ξ′� = ψρ,τ (θ) + ϕρ,τ � η′� − θ · η′ (27) = n ∑ i=1 �� ξi 0 τ(ξ)ρ′(ξ)dξ + � ξ′ i 0 ρ(ξ)τ′(ξ)dξ − ρ (ξi) τ � ξ′ i �� , (28) where θ and η′ are ρ- and τ-representations of ξ and ξ′, respectively. The (ρ, τ)-divergence gives a dually flat structure having θ and η as affine and dual affine coordinate systems. This is originally due to Zhang [5] and a generalization of our previous results concerning the q and deformed exponential families [4]. The transformation between θ and η is simple in the (ρ, τ)-structure, because it can be done componentwise, θi = ρ � τ−1 (ηi) � , (29) ηi = τ � ρ−1 � θi�� . (30) The Riemannian metric is: gij(ξ) = τ′ (ξi) ρ′ (ξi) δij, (31) and hence Euclidean, because the Riemann-Christoffel curvature due to the Levi-Civita connection vanishes, too. The following theorem is new, characterizing the (ρ, τ)-divergence. Theorem 1. The (ρ, τ)-divergences form a unique class of divergences in Rn+ that are dually flat and decomposable. 3.4. Biduality: α-(ρ, τ) Divergence We have dually flat connections, � ∇ρ,τ, ∇∗ ρ,τ � , represented in terms of covariant derivatives, which are derived from Dρ,τ. This is called the representation duality by Zhang [5]. We further have the α-(ρ, τ) connections defined by: ∇(α) ρ,τ = 1 + α 2 ∇ρ,τ + 1 − α 2 ∇∗ ρ,τ. (32) The α-(−α) duality is called the reference duality [5]. Therefore, ∇(α) ρ,τ possesses the biduality, one concerning α and (−α), and the other with respect to ρ and τ. 5 Entropy 2014, 16, 2131–2145 The Riemann-Christoffel curvature of ∇(α) ρ,τ is: R(α) ρ,τ = 1 − α2 4 R(0) ρ,τ = 0 (33) for any α. Hence, there exists unique canonical divergence D(α) ρ,τ and α-(ρ, τ) affine coordinate systems. It is an interesting future problem to obtain their explicit forms. 3.5. Various Examples As a special case of the (ρ, τ)-divergence, we have the (α, β)-divergence obtained from the following power functions, ρ(ξ) = 1 αξα, τ(ξ) = 1 βξβ. (34) This was introduced by Cichocki, Cruse and Amari in [7,8]. The affine and dual affine coordinates are: θi = 1 α (ξi)α , ηi = 1 β (ξi)β (35) and the convex functions are: ψ(θ) = cα,β ∑ θ α+β i α , ϕ(η) = cβ,α ∑ η α+β β i , (36) where: cα,β = 1 β(α + β)α α+β α . (37) The induced (α, β)-divergence has a simple form, Dα,β[ξ : ξ′] = 1 αβ (α + β) ∑ � αξα+β i + βξ′α+β i − (α + β)ξα i ξ′β i � , (38) for ξ, ξ′ ∈ Rn+. It is defined similarly in the manifold, Sn, of probability distributions, but it is not a Bregman divergence in Sn. This is because the total mass constraint ∑ ξi = 1 is not linear in θ- or η-coordinates in general. The α-divergence is a special case of the (α, β)-divergence, so that it is a (ρ, τ)-divergence. By putting: ρ(ξ) = 2 1 − αξ 1−α 2 , τ(ξ) = 2 1 + αξ 1+α 2 , (39) we have: Dα � ξ : ξ′� = 4 1 − α2 ∑ �1 − α 2 ξi + 1 + α 2 ξ 1−α i 2 − ξα i � ξ′ i � 1+α 2 � . (40) The β-divergence [19] is obtained from: ρ(ξ) = ξ, τ(ξ) = 1 βξ1+β. (41) It is written as: Dβ � ξ : ξ′� = 1 β(β + 1) ∑ i � ξβ+1 i + (β + 1)ξ′ i − � ξ′ i �β+1 − (β + 1)ξi � ξ′ i �β� . (42) The β-divergence is special in the sense that it gives a dually flat structure, even in Sn. This is because u(ξ) is linear in ξ. 6 Entropy 2014, 16, 2131–2145 The classes of α-divergences and β-divergences intersect at the KL-divergence, and their duals are different in general. They are the only intersecting points of the two classes. When ρ(ξ) = ξ and τ(ξ) = U′(ξ) where U is a convex function, (ρ, τ)-divergence is Eguchi’s U-divergence [21]. Zhang already introduced the (α, β)-divergence in [5], which is not a (ρ, τ)-divergence in Rn+ and different from ours. We regret for our confusing the naming of the (α, β)-divergence. 4. Invariant, Flat Decomposable Divergences in the Manifold of Positive-Definite Matrices 4.1. Invariant and Decomposable Convex Function Let P be a positive-definite matrix and ψ(P) be a convex function. Then, a Bregman divergence is defined between two positive definite matrices, P and Q, by: D[P : Q] = ψ(P) − (Q) − ∇ (P) · (P − Q) (43) where ∇ is the gradient operator with respect to matrix P = � Pij � , so that ∇ψ(P) is a matrix and the inner product of two matrices is defined by: ∇ψ(Q) · P = tr {∇ψ(Q)P} , (44) where tr is the trace of a matrix. It induces a dually flat structure to the manifold of positive-definite matrices, where the affine coordinate system (θ-coordinates) is = P and the dual affine coordinate system (η-coordinates) is: H = ∇ψ(P). (45) A convex function, ψ(P), is said to be invariant under the orthogonal group O(n), when: ψ(P) = ψ � OTPO � (46) holds for any orthogonal transformation O, where OT is the transpose of O. An invariant function is written as a symmetric function of n eigenvalues λ1, · · · , λn of P. See Dhillon and Tropp [12]. When an invariant convex function of P is written, by using a convex function, f, of one variable, in the additive form: ψ(P) = ∑ f (λi) , (47) it is said to be decomposable. We have: ψ(P) = trf (P). (48) 4.2. Invariant, Flat and Decomposable Divergence A divergence D[P : Q] is said to be invariant under O(n), when it satisfies: D[P : Q] = D � OTPO : OTQO � . (49) When it is derived from a decomposable convex function, ψ(P), it is invariant, flat and decomposable. We give well-known examples of decomposable convex functions and the divergences derived from them: 7 Entropy 2014, 16, 2131–2145 (1) For f (λ) = (1/2)λ2, we have: ψ(P) = 1 2 ∑ λ2 i , (50) D[P : Q] = 1 2∥P − Q∥2, (51) where ∥P∥2 is the Frobenius norm: ∥P∥2 = ∑ P2 ij. (52) (2) For f (λ) = − log λ ψ(P) = − log (det |P|) , (53) D[P : Q] = tr � PQ−1� − log � det ���PQ−1��� � − n. (54) The affine coordinate system is P, and the dual coordinate system is P−1. The derived geometry is the same as that of multivariate Gaussian probability distributions with mean zero and covariance matrix P. (3) For f (λ) = λ log λ − λ, ψ(P) = tr (P log P − P) , (55) D[P : Q] = tr (P log P − P log Q − P + Q) . (56) This divergence is used in quantum information theory. The affine coordinate system is P, and the dual affine coordinate system is log P; and, ψ(P) is called the negative von Neuman entropy. 4.3. (ρ, τ)-Structure in Positive Definite Matrices We extend the (ρ, τ)-structure in the previous section to the matrix case and introduce a general dually flat invariant decomposable divergence in the manifold of positive-definite matrices. Let: Θ = ρ(P), H = τ(P) (57) be ρ- and τ-representations of matrices. We use two functions, ˜ψρ,τ(θ) and ˜ϕρ,τ(η), defined in Equations (19) and (20), for defining a pair of dually coupled invariant and decomposable convex functions, ψ(Θ) = tr ˜ψρ,τ {Θ} , (58) ϕ(H) = tr ˜ϕρ,τ {H} . (59) They are not convex with respect to P, but are convex with respect to Θ and H, respectively. The derived Bregman divergence is: D[P : Q] = ψ {Θ(P)} + ϕ {H(Q)} − Θ(P) · H(Q). (60) Theorem 2. The (ρ, τ)-divergences form a unique class of invariant, decomposable and dually flat divergences in the manifold of positive matrices. 8 Entropy 2014, 16, 2131–2145 The Euclidean, Gaussian and von Neuman divergences given in Equations (51), (54) and (56) are special examples of (ρ, τ)-divergences. They are given, respectively, by: (1) ρ(ξ) = τ(ξ) = ξ, (61) (2) ρ(ξ) = ξ, τ(ξ) = −1 ξ , (62) (3) ρ(ξ) = ξ, τ(ξ) = log ξ. (63) When ρ and τ are power functions, we have the (α, β)-structure in the manifold of positive-definite matrices. (4) (α-β)-divergence. By using the (α, β) power functions given by Equation (34), we have: ψ(Θ) = α α + βtr Θ α+β α = α α + βtr Pα+β, (64) ϕ(H) = β α + βtr H α+β β = β α + βtr Pα+β (65) so that the (α, β)-divergence of matrices is: D[P : Q] = tr � α α + βPα+β + β α + βQα+β − PαQβ � . (66) This is a Bregman divergence, where the affine coordinate system is Θ = Pα and its dual is H = Pβ. (5) The α-divergence is derived as: Θ(P) = 2 1 − αP 1−α 2 , (67) ψ(Θ) = 2 1 + αP, (68) Dα[P : Q] = 4 1 − α2 tr � −P 1−α 2 Q 1+α 2 + 1 − α 2 P + 1 + α 2 Q � . (69) The affine coordinate system is 2 1−αP 1−α 2 , and its dual is 2 1+αP 1+α 2 . (6) The β-divergence is derived from Equation (41) as: Dβ[P : Q] = 1 β(β + 1)tr � Pβ+1 + (β + 1)Q − Qβ+1 − (β + 1)PQβ� . (70) 4.4. Invariance Under Gl(n) We extend the concept of invariance under the orthogonal group to that under the general linear group, Gl(n), that is the set of invertible matrices, L, det |L| ̸= 0. This is a stronger condition. A divergence is said to be invariant under Gl(n), when: D[P : Q] = D � LTPL : LTQL � (71) holds for any L ∈ Gl(n). We identify matrix P with the zero-mean Gaussian distribution: p(x, P) = exp � −1 2xTP−1x − 1 2 log det |P| − c � , (72) 9 Entropy 2014, 16, 2131–2145 where c is a constant. We know that an invariant divergence belongs to the class of f-divergences in the case of a manifold of probability distributions, where the invariance means the geometry does not change under a one-to-one mapping of x to y. Moreover, the only invariant flat divergence is the KL-divergence [22]. These facts suggest the following conjecture. Proposition. The invariant, flat and decomposable divergence under Gl(n) is the KL-divergence given by: DKL[P : Q] = tr � PQ−1� − log � det ���PQ−1| � − n. (73) 5. Non-Decomposable Divergence We have focused on flat and decomposable divergences. There are many interesting non-decomposable divergences. We first discuss a general class of flat divergences in Rn+ and then touch upon interesting flat and non-flat divergences in the manifold of positive-definite matrices. 5.1. General Class of Flat Divergences in Rn + We can describe a general class of flat divergence in Rn+, which are not necessarily decomposable. This is introduced in [23], which studies the conformal structure of general total Bregman divergences ([11,13]). When Rn+ is endowed with a dually flat structure, it has a θ-coordinate system given by: θ = ρ(ξ) (74) which is not necessarily a componentwise function. Any pair of invertible θ = ρ(ξ) and convex function ψ(θ) defines a dually flat structure and, hence, a Bregman divergence in Rn+. The dual coordinates η = τ(ξ) are given by: η = ∇ψ(θ) (75) so that we have: η = τ(ξ) = ∇ψ {ρ(ξ)} . (76) This implies that a pair (ρ, τ) of coordinate systems can define dually coupled affine coordinates and, hence, a dually flat structure, when and only when η = τ � ρ−1(θ) � is a gradient of a convex function. This is different from the case of decomposable divergence, where any monotone pair of ρ(ξ) and τ(ξ) gives a dually flat structure. 5.2. Non-Decomposable Flat Divergence in PDn Ohara and Eguchi [15,16] introduced the following function: ψV(P) = V (det |P|) , (77) where V(ξ) is a monotonically decreasing scalar function. ψV is convex when and only when: 1 + V′′(ξ)ξ2 V′(ξ) < 1 n. (78) In such a case, we can introduce dually flat structure to PDn, where P is an affine coordinate system with convex ψV(P), and the dual affine coordinate system is: H = V′(det ∥P∥)P−1. (79) 10 Entropy 2014, 16, 2131–2145 The derived divergence is: DV[P : Q] = V(det |P) − V(det |Q)| (80) + V′(det |Q|)tr � Q−1(Q − P) � . (81) When V(ξ) = − log ξ, it reduces to the case of Equation (54), which is invariant under Gl(n) and decomposable. However, the divergence DV[P : Q] is not decomposable. It is invariant under O(n) and more strongly so under SGl(n) ⊂ Gl(n), defined by det |L| = ±1. 5.3. Flat Structure Derived from q-Escort Distribution A dually flat structure is introduced in the manifold of probability distributions [4] as: ˜Dα[p : q] = 1 1 − q 1 Hq(p) � 1 − ∑ p1−q i qq i � , (82) where: Hq(p) = ∑ pq i , (83) q = 1 + α 2 . (84) The dual affine coordinates are the q-escort distribution: [4] ηi = 1 Hq(p) pq i . (85) The divergence, ˜Dq, is flat, but not decomposable. We can generalize it to the case of PDn, ˜Dq[P : Q] = 1 1 − q 1 tr Pq � (1 − q) tr (P) + q tr (Q) − tr � P1−qQq�� . (86) This is flat, but not decomposable. 5.4. γ-Divergence in PDn The γ-divergence is introduced by Fujisawa and Eguchi [24]. It gives a super-robust estimator. It is interesting to generalize it to PDn, Dγ[P : Q] = 1 γ(γ − 1) � log tr Pγ − (γ − 1) log tr Qγ−1 − γ log tr PQγ−1� . (87) This is not flat nor decomposable. This is a projective divergence in the sense that, for any c, c′ > 0, Dγ � cP : c′Q � = Dγ[P : Q]. (88) Therefore, it can be defined in the submanifold of tr P = 1. 6. Concluding Remarks We have shown that the (ρ, τ)-divergence introduced by Zhang [5] is a general dually flat decomposable structure of the manifold of positive measures. We then extended it to the manifold of positive-definite matrices, where the criterion of invariance under linear transformations (in particular, under orthogonal transformations) were added. The decomposability is useful from the 11 Entropy 2014, 16, 2131–2145 computational point of view, because the θ-η transformation is tractable. This is the motivation for studying decomposable flat divergences. When we treat the manifold of probability distributions, it is a submanifold of the manifold of positive measures, where the total sum of measures are restricted to one. This is a nonlinear constraint in the θ or η coordinates, so that the manifold is not flat, but curved in general. Hence, our arguments hold in this case only when at least one of the ρ and τ functions are linear. The U-divergence [21] and β-divergence [19] are such cases. However, for clustering, we can take the average of the η-coordinates of member probability distributions in the larger manifold of positive measures and then project it to the manifold of probability distributions. This is called the exterior average, and the projection is simply a normalization of the result. Therefore, the (ρ, τ)-structure is useful in the case of probability distributions. The same situation holds in the case of positive-definite matrices. Quantum information theory deals with positive-definite Hermitian matrices of trace one [25,26]. We need to extend our discussions to the case of complex matrices. The trace one constraint is not linear with respect to θ- or η-coordinates, as is the same in the case of probability distributions. Many interesting divergence functions have been introduced in the manifold of positive-definite Hermitian matrices. It is an interesting future problem to apply our theory to quantum information theory. Conflicts of Interest: The author declares no conflicts of interest. References 1. Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society and Oxford University Press: Rhode Island, RI, USA, 2000. 2. Tsallis, C. Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World; Springer: Berlin/Heidelberg, Germany, 2009. 3. Naudts, J. Generalized Thermostatistics; Springer: Berlin/Heidelberg, Germany, 2011. 4. Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually-flat and conformal geometries. Physica A 2012, 391, 4308–4319. 5. Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. 6. Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy 2013, 15, 5384–5418. 7. Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. 8. Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. 9. Minami, M.; Eguchi, S. Robust blind source separation by beta-divergence. Neural Comput. 2002 14, 1859–1886. 10. Banerjee, A.; Merugu, S.; Dhillon I.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. 11. Liu, M.; Vemuri, B.C.; Amari, S.; Nielsen, F. Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Learn. 2012, 24, 3192–3212. 12. Dhillon, I.S.; Tropp, J.A. Matrix nearness problems with Bregman divergences. SIAM J. Matrix Anal. Appl. 2007, 29, 1120–1146. 13. Vemuri, B.C.; Liu, M.; Amari, S.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imaging 2011, 30, 475–483. 14. Ohara, A.; Suda, N.; Amari, S. Dualistic differential geometry of positive definite matrices and its applications to related problems. Linear Algebra Appl. 1996 247, 31–53. 15. Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by beta-divergence. Entropy 2013, 15, 4732–4747. 16. Ohara, A.; Eguchi, S. Geometry on positive definite matrices induced from V-potential functions. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 621–629. 12 Entropy 2014, 16, 2131–2145 17. Chebbi, Z.; Moakher, M. Means of Hermitian positive-definite matrices based on the log-determinant alpha-divergence function. Linear Algebra Appl. 2012, 436, 1872–1889. 18. Tsuda, K.; Ratsch, G.; Warmuth, M.K. Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 2005, 6, 995–1018. 19. Nock, R.; Magdalou, B.; Briys, E.; Nielsen, F. Mining matrix data with Bregman matrix divergences for portfolio selection. In Matrix Information Geometry; Nielsen, F., Bhatia, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Chapter 15, pp. 373–402. 20. Nielsen, F., Bhatia, R., Eds. Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013. 21. Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expo. 2006, 19, 197–216. 22. Amari, S. α-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. 23. Nock, R.; Nielsen, F.; Amari, S. On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 2014, submitted for publication. 24. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. 25. Petz, P. Monotone metrics on matrix spaces. Linear Algebra Appl. 1996, 244, 81–96. 26. Hasegawa, H. α-divergence of the non-commutative information geometry. Rep. Math. Phys. 1993, 33, 87–93. c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 13 entropy Article F-Geometry and Amari’s α−Geometry on a Statistical Manifold Harsha K. V. * and Subrahamanian Moosath K S * Indian Institute of Space Science and Technology, Department of Space, Government of India, Valiamala P.O, Thiruvananthapuram-695547, Kerala, India * E-Mails: harsha.11@iist.ac.in (K.V.H.); smoosath@iist.ac.in (K.S.S.M.); Tel.: +91-95-6736-0425 (K.V.H.); +91-94-9574-3148 (K.S.S.M.). Received: 13 December 2013; in revised form: 21 April 2014 / Accepted: 25 April 2014 / Published: 6 May 2014 Abstract: In this paper, we introduce a geometry called F-geometry on a statistical manifold S using an embedding F of S into the space RX of random variables. Amari’s α−geometry is a special case of F−geometry. Then using the embedding F and a positive smooth function G, we introduce (F, G)−metric and (F, G)−connections that enable one to consider weighted Fisher information metric and weighted connections. The necessary and sufficient condition for two (F, G)−connections to be dual with respect to the (F, G)−metric is obtained. Then we show that Amari’s 0−connection is the only self dual F−connection with respect to the Fisher information metric. Invariance properties of the geometric structures are discussed, which proved that Amari’s α−connections are the only F−connections that are invariant under smooth one-to-one transformations of the random variables. Keywords: embedding; Amari’s α−connections; F−metric; F−connections; (F, G)−metric; (F, G)−connections; invariance 1. Introduction Geometric study of statistical estimation has opened up an interesting new area called the Information Geometry. Information geometry achieved a remarkable progress through the works of Amari [1,2], and his colleagues [3,4]. In the last few years, many authors have considerably contributed in this area [5–9]. Information geometry has a wide variety of applications in other areas of engineering and science, such as neural networks, machine learning, biology, mathematical finance, control system theory, quantum systems, statistical mechanics, etc. A statistical manifold of probability distributions is equipped with a Riemannian metric and a pair of dual affine connections [2,4,9]. It was Rao [10] who introduced the idea of using Fisher information as a Riemannian metric in the manifold of probability distributions. Chentsov [11] introduced a family of affine connections on a statistical manifold defined on finite sets. Amari [2] introduced a family of affine connections called α−connections using a one parameter family of functions, the α−embeddings. These α−connections are equivalent to those defined by Chentsov. The Fisher information metric and these affine connections are characterized by invariance with respect to the sufficient statistic [4,12] and play a vital role in the theory of statistical estimation. Zhang [13] generalized Amari’s α−representation and using this general representation together with a convex function he defined a family of divergence functions from the point of view of representational and referential duality. The Riemannian metric and dual connections are defined using these divergence functions. In this paper, Amari’s idea of using α−embeddings to define geometric structures is extended to a general embedding. This paper is organized as follows. In Section 2, we define an affine connection called F−connection and a Riemannian metric called F−metric using a general embedding F of a statistical manifold S into the space of random variables. We show that F−metric is the Fisher Entropy 2014, 16, 2472–2487; doi:10.3390/e16052472 www.mdpi.com/journal/entropy 14 Entropy 2014, 16, 2472–2487 information metric and Amari’s α−geometry is a special case of F−geometry. Further, we introduce (F, G)−metric and (F, G)−connections using the embedding F and a positive smooth function G. In Section 3, a necessary and sufficient condition for two (F, G)−connections to be dual with respect to the (F, G)−metric is derived and we prove that Amari’s 0−connection is the only self dual F−connection with respect to the Fisher information metric. Then we prove that the set of all positive finite measures on X, for a finite X, has an F−affine manifold structure for any embedding F. In Section 4, invariance properties of the geometric structures are discussed. We prove that the Fisher information metric and Amari’s α−connections are invariant under both the transformation of the parameter and the transformation of the random variable. Further we show that Amari’s α−connections are the only F−connections that are invariant under both the transformation of the parameter and the transformation of the random variable. Let (X, B) be a measurable space, where X is a non-empty subset of R and B is the σ-field of subsets of X. Let RX be the space of all real valued measurable functions defined on (X, B). Consider an n−dimensional statistical manifold S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn}, with coordinates ξ = [ξ1, ..., ξn], defined on X. S is a subset of P(X), the set of all probability measures on X given by P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X); � X p(x)dx = 1}. (1) The tangent space to S at a point pξ is given by Tξ(S) = { n ∑ i=1 αi∂i / αi ∈ R} where ∂i = ∂ ∂ξi . (2) Define ℓ(x; ξ) = log p(x; ξ) and consider the partial derivatives { ∂ℓ ∂ξi = ∂iℓ ; i = 1, ...., n} which are called scores. For the statistical manifold S, ∂iℓ’s are linearly independent functions in x for a fixed ξ. Let T1 ξ (S) be the n-dimensional vector space spanned by n functions {∂iℓ ; i = 1, ...., n} in x. So T1 ξ (S) = { n ∑ i=1 Ai∂iℓ / Ai ∈ R}. (3) Then there is a natural isomorphism between these two vector spaces Tξ(S) and T1 ξ (S) given by ∂i ∈ Tξ(S) ←→ ∂iℓ(x; ξ) ∈ T1 ξ (S). (4) Obviously, a tangent vector A = ∑n i=1 Ai∂i ∈ Tξ(S) corresponds to a random variable A(x) = ∑n i=1 Ai∂iℓ(x; ξ) ∈ T1 ξ (S) having the same components Ai. Note that Tξ(S) is the differentiation operator representation of the tangent space, while T1 ξ (S) is the random variable representation of the same tangent space. The space T1 ξ (S) is called the 1-representation of the tangent space. Let A and B be two tangent vectors in Tξ(S) and A(x) and B(x) be the 1−representations of A and B respectively. We can define an inner product on each tangent space Tξ(S) by gξ(A, B) =< A, B >ξ = Eξ[A(x)B(x)] = � A(x)B(x)p(x; ξ)dx. (5) Especially the inner product of the basis vectors ∂i and ∂j is gij(ξ) = < ∂i, ∂j >ξ = Eξ[∂iℓ ∂jℓ] = � ∂iℓ(x; ξ)∂jℓ(x; ξ)p(x; ξ)dx. (6) 15 Entropy 2014, 16, 2472–2487 Note that g =<, > defines a Riemannian metric on S called the Fisher information metric. On the Riemannian manifold (S, g =<, >), define n3 functions Γijk by Γijk(ξ) = Eξ[(∂i∂jℓ(x; ξ))(∂kℓ(x; ξ))]. (7) These functions Γijk uniquely determine an affine connection ∇ on S by Γijk(ξ) =< ∇∂i∂j, ∂k >ξ . (8) ∇ is called the 1−connection or the exponential connection. Amari [2] defined a one parameter family of functions called the α−embeddings given by Lα(p) = � 2 1−α p 1−α 2 α ̸= 1 log p α = 1 (9) Using these, we can define n3 functions Γα ijk by Γα ijk = � ∂i∂jLα(p(x; ξ))∂kL−α(p(x; ξ))dx (10) These Γα ijk uniquely determine affine connections ∇α on the statistical manifold S by Γα ijk = < ∇α ∂i∂j, ∂k > (11) which are called ff−connections. 2. F−Geometry of a Statistical Manifold On a statistical manifold S, the Fisher information metric and exponential connection are defined using the log embedding. In a similar way, α−connections are defined using a one parameter family of functions, the α−embeddings. In general, we can give other geometric structures on S using different embeddings of the manifold S into the space of random variables RX. Let F : (0, ∞) −→ R be an injective function that is at least twice differentiable. Thus we have F′(u) ̸= 0, ∀ u ∈ (0, ∞). F is an embedding of S into RX that takes each p(x; ξ) �−→ F(p(x; ξ)). Denote F(p(x; ξ)) by F(x; ξ) and ∂iF can be written as ∂iF(x; ξ) = p(x; ξ)F′(p(x; ξ))∂iℓ(p(x; ξ)). (12) It is clear that ∂iF(x; ξ); i = 1, ..., n are linearly independent functions in x for fixed ξ since ∂iℓ(p(x; ξ)); i = 1, .., n are linearly independent. Let TF(pξ)F(S) be the n-dimensional vector space spanned by n functions ∂iF; i = 1, ...., n in x for fixed ξ. So TF(pξ)F(S) = { n ∑ i=1 Ai∂iF / Ai ∈ R} (13) Let the tangent space TF(pξ)(F(S)) to F(S) at the point F(pξ) be denoted by TF ξ (S). There is a natural isomorphism between the two vector spaces Tξ(S) and TF ξ (S) given by ∂i ∈ Tξ(S) ←→ ∂iF(x; ξ) ∈ TF ξ (S). (14) TF ξ (S) is called the F−representation of the tangent space Tξ(S). 16 Entropy 2014, 16, 2472–2487 For any A = ∑n i=1 Ai∂i ∈ Tξ(S), the corresponding A(x) = ∑n i=1 Ai∂iF ∈ TF ξ (S) is called the F−representation of the tangent vector A and is denoted by AF(x). Note that TF ξ (S) ⊆ TF(pξ)(RX). Since RX is a vector space, its tangent space TF(pξ)(RX) can be identified with RX. So TF ξ (S) ⊆ RX. Definition 1. F−expectation of a random variable f with respect to the distribution p(x; ξ) is defined as EF ξ ( f ) = � f (x) 1 p(F′(p))2 dx. (15) We can use this F−expectation to define an inner product in RX by < f, g >F ξ = EF ξ [ f (x)g(x)], (16) which induces an inner product on Tξ(S) by < A, B >F ξ = EF ξ [AF(x)BF(x)] ; A, B ∈ Tξ(S). (17) Proposition 1. The induced metric <, >F on S is the Fisher information metric g =<, > on S. Proof. For any basis vectors ∂i, ∂j ∈ Tξ(S) < ∂i, ∂j >F ξ = EF ξ [∂iF ∂jF] = � ∂iF ∂jF 1 p(F′(p))2 dx = � (p F′(p) ∂iℓ) (p F′(p) ∂jℓ) 1 p(F′(p))2 dx (18) = � ∂iℓ ∂jℓ p(x; ξ) dx = Eξ[∂iℓ ∂jℓ] = gij(ξ) = < ∂i, ∂j >ξ . So the metric <, >F on S induced by the embedding F of S into RX is the Fisher information metric g =<, > on S. We can induce a connection on S using the embedding F. Let πF |pξ : RX −→ TF ξ (S) be the projection map. Definition 2. The connection induced by the embedding F on S, the F−connection, is defined as ∇F ∂i∂j = πF |pξ(∂i∂jF) = ∑ n ∑ m gmn < ∂i∂jF, ∂mF >F ξ ∂n. (19) where [gmn(ξ)] is the inverse of the Fisher information matrix G(ξ) = [gmn(ξ)]. Note that the F−connections are symmetric. Lemma 1. The F−connection and its components can be written in terms of scores as ∇F ∂i∂j = ∑ n ∑ m gmnEξ � (∂i∂jℓ + (1 + pF′′(p) F′(p) )∂iℓ ∂jℓ)(∂mℓ) � ∂n (20) 17 Entropy 2014, 16, 2472–2487 and ΓF ijk(ξ) = Eξ � (∂i∂jℓ + (1 + pF′′(p) F′(p) )∂iℓ ∂jℓ)(∂kℓ) � (21) Proof. From Equation (12), we have ∂i∂jF = pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ. (22) Therefore < ∂i∂jF, ∂mF >F ξ = � ∂i∂jF ∂mF 1 p(F′(p))2 dx = � � pF′(p)∂i∂jℓ + [pF′(p) + p2F′′(p)] ∂iℓ ∂jℓ � ∂mℓ F′(p)dx (23) = � � ∂i∂jℓ ∂mℓ + (1 + pF′′(p) F′(p) )∂iℓ ∂jℓ ∂mℓ � pdx = Eξ � (∂i∂jℓ + (1 + pF′′(p) F′(p) )∂iℓ ∂jℓ)(∂mℓ) � . Hence we can write ∇F ∂i∂j = πF |pξ(∂i∂jF) = ∑ n ∑ m gmnEξ � (∂i∂jℓ + (1 + pF′′(p) F′(p) )∂iℓ ∂jℓ)(∂mℓ) � ∂n. (24) Then we have the Christoffel symbols of the F−connection Γn ij = ∑ m gmnEξ � (∂i∂jℓ + (1 + pF′′(p) F′(p) )∂iℓ ∂jℓ)(∂mℓ) � (25) and components of the F−connection are given by ΓF ijk(ξ) =< ∇F ∂i∂j, ∂k >ξ= Eξ � (∂i∂jℓ + (1 + pF′′(p) F′(p) )∂iℓ ∂jℓ)(∂kℓ) � . (26) Theorem 1. Amari’s α−geometry is a special case of the F−geometry. Proof. Let F(p) = Lα(p), Lα(p) is the α−embedding of Amari. The components Γα ijk of the α−connection are given by Γα ijk(ξ) = < ∇α ∂i∂j, ∂k >ξ = Eξ � (∂i∂jℓ + 1 − α 2 ∂iℓ ∂jℓ)(∂kℓ) � . (27) From Equation (26), when F(p) = Lα(p) we have F′(p) = L′ α(p) = p−( 1+α 2 ) (28) F′′(p) = L′′ α(p) = −1 + α 2 p−( 3+α 2 ). (29) 18 Entropy 2014, 16, 2472–2487 Then we get 1 + pF′′(p) F′(p) = 1 + pL′′ α(p) L′α(p) = 1 − α 2 (30) Hence ΓF ijk(ξ) =< ∇F ∂i∂j, ∂k >ξ = Eξ � (∂i∂jℓ + (1 + pF′′(p) F′(p) )∂iℓ ∂jℓ)(∂kℓ) � = Eξ � (∂i∂jℓ + 1 − α 2 ∂iℓ ∂jℓ)(∂kℓ) � (31) = Γα ijk(ξ) which are the components of the α−connection. Hence F−connection reduces to α−connection. Thus we obtain that α−geometry is a special case of F−geometry. Remark 1. Burbea [14] introduced the concept of weighted Fisher information metric using a positive continuous function. We use this idea to define weighted F−metric and weighted F−connections. Let G : (0, ∞) −→ R be a positive smooth function and F be an embedding, define (F, G)−expectation of a random variable with respect to the distribution pξ as EF,G ξ ( f ) = � f (x) G(p) p(F′(p))2 dx. (32) Define (F, G)−metric <, >F,G ξ in Tpξ(S) by < ∂i, ∂j >F,G ξ = EF,G ξ [∂iF ∂jF] = � ∂iF ∂jF G(p) p(F′(p))2 dx (33) = � ∂iℓ ∂jℓ G(p) p dx = Eξ[G(p) ∂iℓ ∂jℓ]. Define (F, G)−connection as ΓF,G ijk = < ∇F,G ∂i ∂j, ∂k >ξ = Eξ �� (∂i∂jℓ + (1 + pF′′(p) F′(p) )∂iℓ ∂jℓ)(∂kℓ) � (G(p)) � . (34) When G(p) = 1, (F, G)−connection reduces to the F−connection and the metric <, >F,G reduces to the Fisher information metric. This is a more general way of defining Riemannian metrics and affine connections on a statistical manifold. 3. Dual Affine Connections Definition 3. Let M be a Riemannian manifold with a Riemannian metric g. Two affine connections, ∇ and ∇∗ on the tangent bundle are said to be dual connections with respect to the metric g if Zg(X, Y) = g(∇ZX, Y) + g(X, ∇∗ ZY) (35) holds for any vector fields X, Y, Z on M. 19 Entropy 2014, 16, 2472–2487 Theorem 2. Let F, H be two embeddings of statistical manifold S into the space RX of random variables. Let G be a positive smooth function on (0, ∞). Then the (F, G)−connection ∇F,G and the (H, G)−connection ∇H,G are dual connections with respect to the (F, G)−metric iff the functions F and H satisfy H′(p) = G(p) pF′(p). (36) We call such an embedding H as a G−dual embedding of F. The components of the dual connection ∇H,G can be written as ΓH,G ijk = � � ∂i∂jℓ + ( pG′(p) G(p) − pF′′(p) F′(p) )∂iℓ ∂jℓ � ∂kℓ G(p)p dx. (37) Proof. ∇F,G and ∇H,G are dual connections with respect to the G−metric means, ∂k < ∂i, ∂j >F,G=< ∇F,G ∂k ∂i, ∂j >F,G + < ∂i, ∇H,G ∂k ∂j >F,G . (38) for any basis vectors ∂i, ∂j, ∂k ∈ Tξ(S). ∂k < ∂i, ∂j >F,G = � ∂k∂jℓ ∂iℓ pG(p)dx + � ∂k∂iℓ ∂jℓ pG(p)dx + � (1 + pG′(p) G(p) )∂iℓ ∂jℓ ∂kℓ pG(p)dx. (39) < ∇F,G ∂k ∂i, ∂j >F,G + < ∂i, ∇H,G ∂k ∂j >F,G = � ∂k∂iℓ ∂jℓ pG(p)dx + � 1 + pF′′(p) F′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx + � 1 + pH′′(p) H′(p) ∂iℓ ∂jℓ ∂kℓ pG(p)dx + � ∂k∂jℓ ∂iℓ pG(p)dx (40) Then the condition (38) holds iff � [2 + pF′′(p) F′(p) + pH′′(p) H′(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx = � [1 + pG′(p) G(p) ]∂iℓ ∂jℓ ∂kℓ pG(p)dx (41) ⇐⇒ [2 + pF′′(p) F′(p) + pH′′(p) H′(p) ] = 1 + pG′(p) G(p) . (42) ⇐⇒ 1 + pH′′(p) H′(p) = pG′(p) G(p) − pF′′(p) F′(p) (43) ⇐⇒ H′′(p) H′(p) = G′(p) G(p) − F′′(p) F′(p) − 1 p ⇐⇒ H′(p) = G(p) pF′(p). (44) Hence ∇F,G and ∇H,G are dual connections with respect to the (F, G)−metric iff Equation (36) holds. From Equation (43), we can rewrite the components of dual connection ∇H,G as ΓH,G ijk = � � ∂i∂jℓ + ( pG′(p) G(p) − pF′′(p) F′(p) )∂iℓ ∂jℓ � ∂kℓ G(p)p dx. (45) 20 Entropy 2014, 16, 2472–2487 Corollary 1. Amari’s 0−connection is the only self dual F−connection with respect to the Fisher information metric. Proof. From Theorem 2, for G(p) = 1 the F−connection ∇F and the H−connection ∇H are dual connections with respect to the Fisher information metric iff the functions F and H satisfy H′(p) = 1 pF′(p) (46) Thus the F−connection ∇F is self dual iff the embedding F satisfies the condition F′(p) = 1 pF′(p) ⇐⇒ F′(p) = p−( 1 2 ) ⇐⇒ F(p) = 2p 1 2 = L0(p). (47) That is, Amari’s 0−connection is the only self dual F−connection with respect to the Fisher information metric. So far, we have considered the statistical manifold S as a subset of P(X), the set of all probability measures on X. Now we relax the condition � p(x)dx = 1, and consider S as a subset of ˜P(X), which is defined by ˜P(X) := {p : X −→ R / p(x) > 0 (∀ x ∈ X); � X p(x)dx < ∞}. (48) Definition 4. Let M be a Riemannian manifold with a Riemannian metric g. Let ∇ be an affine connection on M. If there exists a coordinate system [θi] of M such that ∇∂i∂j = 0 then we say that ∇ is flat, or alternatively M is flat with respect to ∇, and we call such a coordinate system [θi] an affine coordinate system for ∇. Definition 5. Let S = {p(x; ξ) / ξ = [ξ1, ..., ξn] ∈ E ⊆ Rn} be an n−dimensional statistical manifold. If for some coordinate system [θi]; i = 1, ..., n ∂i∂jF(p(x; θ)) = 0 (49) then we can see from Equation (19) that [θi] is an F−affine coordinate system and that S = {pθ} is F−flat. We call such S as an F−affine manifold. The condition (49) is equivalent to the existence of the functions C, F1, .., Fn on X such that F(p(x; θ)) = C(x) + n ∑ i=1 θiFi(x) (50) Theorem 3. For any embedding F, ˜P(X) is an F−affine manifold for finite X. Proof. Let X = {x1, ...., xn} be a finite set constituted by n elements. Let Fi : X −→ R be the functions defined by Fi(xj) = δij for i, j = 1, .., n. Let us define n coordinates [θi] by θi = F(p(xi)) (51) Then we get F(p(x)) = ∑n i=1 θiFi(x). Therefore ˜P(X) is an F−affine manifold for any embedding F(p). Remark 2. Zhang [13] introduced ρ-representation, which is a generalization of α-representation of Amari. Zhang’s geometry is defined using this ρ-representation together with a convex function. Zhang also defined the ρ-affine family of density functions and discussed its dually flat structure. The F−geometry defined using a 21 Entropy 2014, 16, 2472–2487 general F-representation is different from the Zhang’s geometry. The metric defined in the F-embedding approach is the Fisher information metric and the Riemannian metric defined using the ρ-representation is different from the Fisher information metric. The F-connections defined are not in general dually flat and are different from the dual connections defined by Zhang. Remark 3. On a statistical manifold S, we introduced a dualistic structure (g, ∇F, ∇H), where g is the Fisher information metric and ∇F, ∇H are the dual connections with respect to the Fisher information metric. Since F-connections are symmetric, the manifold S is flat with respect to ∇F iff S is flat with respect to ∇H. Thus if S is flat with respect to ∇F, then (S, g, ∇F, ∇H) is a dually flat space. The dually flat spaces are important in statistical estimation [4]. 4. Invariance of the Geometric Structures For the statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn}, the parameters are merely labels attached to each point p ∈ S, hence the intrinsic geometric properties should be independent of these labels. Consequently, it is natural to consider the invariance properties of the geometric structures under suitable transformations of the variables in a statistical manifold. Here we can consider two kinds of invariance of the geometric structures; covariance under re-parametrization of the parameter of the manifold and invariance under the transformations of the random variable [15]. Now let us investigate the invariance properties of the F-geometric structures defined in Section 2. 4.1. Covariance under Re-Parametrization Let [θi] and [ηj] be two coordinate systems on S, which are related by an invertible transformation η = η(θ). Let us denote ∂i = ∂ ∂θi and ∂j = ∂ ∂ηj . Let the coordinate expressions of the metric g be given by gij =< ∂i, ∂j > and ˜gij =< ∂i, ∂j >. Let the components of the connection ∇ with respect to the coordinates [θi] and [ηj] be given by Γijk, ˜Γijk respectively. Then the covariance of the metric g and the connection ∇ under the re-parametrization means, ˜gij = ∑ m ∑ n ∂θm ∂ηi ∂θn ∂ηj gmn (52) ˜Γijk = ∑ m,n,h ∂θm ∂ηi ∂θn ∂ηj ∂θh ∂ηk Γmnh + ∑ m,h ∂θh ∂ηk ∂2θm ∂ηi∂ηj gmh (53) Lemma 2. The Fisher information metric g is covariant under re-parametrization. Proof. The components of the Fisher information metric with respect to the coordinate system [θi] are given by gij(θ) = < ∂i, ∂j >θ = � ∂ip(x; θ)∂jp(x; θ) 1 p(x; θ)dx. (54) Let ˜p(x; η) = p(x; θ(η)). Then the components of the Fisher information metric with respect to the coordinate system [ηj] are given by ˜gij(η) = < ∂i, ∂j >η = � ∂i ˜p(x; η)∂j ˜p(x; η) 1 ˜p(x; η)dx. (55) Since ∂i ˜p(x; η) = ∑ m ∂θm ∂ηi ∂p(x; θ(η)) ∂θm (56) 22 Entropy 2014, 16, 2472–2487 we can write ˜gij(η) = � ∂i ˜p(x; η)∂j ˜p(x; η) 1 ˜p(x; η)dx = � ∑ m ∂θm ∂ηi ∂p(x; θ) ∂θm ∑ n ∂θn ∂ηj ∂p(x; θ) ∂θn 1 p(x; θ)dx (57) = ∑ m ∑ n ∂θm ∂ηi ∂θn ∂ηj � ∂mp(x; θ)∂np(x; θ) 1 p(x; θ)dx. = � ∑ m ∑ n ∂θm ∂ηi ∂θn ∂ηj gmn(θ) � θ=θ(η) Lemma 3. The F−connection ∇F is covariant under re-parametrization. Proof. Let the components of ∇F with respect to the coordinates [θi] and [ηj] be given by Γijk, ˜Γijk respectively. Let ˜p(x; η) = p(x; θ(η)). Let us denote log p(x; θ) by ℓ(x; θ) and log ˜p(x; η) by ˜ℓ(x; η). The components of the F−connection ∇F with respect to the coordinate system [θi] are given by Γijk = � � ∂i∂jℓ(x; θ) + (1 + pF′′(p) F′(p) )∂iℓ(x; θ) ∂jℓ(x; θ) � ∂kℓ(x; θ)p(x; θ)dx (58) The components of ∇F with respect to the coordinate system [ηj] are given by ˜Γijk = � � ∂i∂j˜ℓ(x; η) + (1 + ˜pF′′( ˜p) F′( ˜p) )∂i˜ℓ(x; η) ∂j˜ℓ(x; η) � ∂k˜ℓ(x; η) ˜p(x; η)dx (59) We can write ∂i˜ℓ(x; η) = ∑ m ∂θm ∂ηi ∂ℓ(x; θ(η)) ∂θm (60) Then ∂i∂j˜ℓ(x; η) = ∑ m,n ∂θm ∂ηi ∂θn ∂ηj ∂2ℓ(x; θ(η)) ∂θm∂θn + ∑ m ∂2θm ∂ηi∂ηj ∂ℓ(x; θ(η)) ∂θm (61) ∂i˜ℓ(x; η) ∂j˜ℓ(x; η) = ∑ m,n ∂θm ∂ηi ∂θn ∂ηj ∂ℓ(x; θ(η)) ∂θm ∂ℓ(x; θ(η)) ∂θn (62) ∂k˜ℓ(x; η) = ∑ h ∂θh ∂ηk ∂ℓ(x; θ(η)) ∂θh (63) Hence we get ˜Γijk = � ∑ m,n,h ∂θm ∂ηi ∂θn ∂ηj ∂θh ∂ηk ∂2ℓ(x; θ(η)) ∂θm∂θn ∂ℓ(x; θ(η)) ∂θh p(x; θ(η))dx + � ∑ m,h ∂2θm ∂ηi∂ηj ∂θh ∂ηk ∂ℓ(x; θ(η)) ∂θm ∂ℓ(x; θ(η)) ∂θh p(x; θ(η))dx + (64) � (1 + pF′′(p) F′(p) ) ∑ m,n,h ∂θm ∂ηi ∂θn ∂ηj ∂θh ∂ηk ∂ℓ(x; θ(η)) ∂θm ∂ℓ(x; θ(η)) ∂θn ∂ℓ(x; θ(η)) ∂θh p(x; θ(η))dx 23 Entropy 2014, 16, 2472–2487 = ∑ m,n,h ∂θm ∂ηi ∂θn ∂ηj ∂θh ∂ηk � ∂2ℓ(x; θ(η)) ∂θm∂θn ∂ℓ(x; θ(η)) ∂θh p(x; θ(η))dx + ∑ m,h ∂2θm ∂ηi∂ηj ∂θh ∂ηk � ∂ℓ(x; θ(η)) ∂θm ∂ℓ(x; θ(η)) ∂θh p(x; θ(η))dx + ∑ m,n,h ∂θm ∂ηi ∂θn ∂ηj ∂θh ∂ηk � (1 + pF′′(p) F′(p) )∂ℓ(x; θ(η)) ∂θm ∂ℓ(x; θ(η)) ∂θn ∂ℓ(x; θ(η)) ∂θh p(x; θ(η))dx = ∑ m,n,h ∂θm ∂ηi ∂θn ∂ηj ∂θh ∂ηk Γmnh + ∑ m,h ∂θh ∂ηk ∂2θm ∂ηi∂ηj gmh Hence we showed that F−connections are covariant under re-parametrization of the parameter. The covariance under re-parametrization actually means that the metric and connections are coordinate independent. Hence we obtained that the F−geometry is coordinate independent. 4.2. Invariance Under the Transformation of the Random Variable Amari and Nagaoka [4] defined the invariance of Riemannian metric and connections on a statistical manifold under a transformation of the random variable as follows, Definition 6. Let S = {p(x; ξ) | ξ ∈ E ⊆ Rn} be a statistical manifold defined on a sample space X. Let x, y be random variables defined on sample spaces X, Y respectively and φ be a transformation of x to y. Assume that this transformation induces a model S′ = {q(y; ξ) | ξ ∈ E ⊆ Rn} on Y. Let λ : S −→ S′ be a diffeomorphism defined as λ(pξ) = qξ (65) Let g =<>, g′ =<>′ be two Riemannian metrics defined on S and S′ respectively. Let ∇, ∇ ′ be two affine connections on S and S′ respectively. Then the invariance properties are given by < X, Y >p = < λ∗(X), λ∗(Y) >′ λ(p) ∀ X, Y ∈ Tp(S) (66) λ∗(∇XY) = ∇ ′ λ∗(X)λ∗(Y) (67) where λ∗ is the push forward map associated with the map λ, which is defined by λ∗(X)λ(p) = (dλ)p(X) (68) Now we discuss the invariance properties of the F−geometry under suitable transformations of the random variable. Let us restrict ourselves to the case of smooth one-to-one transformations of the random variable that are in fact statistically interesting. Amari and Nagaoka [4] mentioned a transformation, the sufficient statistic of the parameter of the statistical model, which is widely used in statistical estimation. In fact the one-to-one transformations of the random variable are trivial examples of sufficient statistic. Consider a statistical manifold S = {p(x; ξ) | ξ ∈ E ⊆ Rn} defined on a sample space X. Let φ be a smooth one-to-one transformation of the random variable x to y. Then the density function q(y; ξ) of the induced model S′ takes the form q(y : ξ) = p(w(y); ξ)w′(y) (69) where w is a function such that x = w(y) and φ′(x) = 1 w′(φ(x)). Let us denote log q(y; ξ) by ℓ(qy) and log p(x; ξ) by ℓ(px). 24 Entropy 2014, 16, 2472–2487 Lemma 4. The Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one transformations of the random variable. Proof. Let φ be a smooth one-to-one transformation of the random variable x to y. From Equation (69) p(x; ξ) = q(φ(x); ξ)φ′(x) (70) ∂iℓ(qy) = ∂iℓ(pw(y)) (71) ∂iℓ(qφ(x)) = ∂iℓ(px) (72) The Fisher information metric g′ on the induced manifold S′ is given by g′ ij(qξ) = � Y ∂iℓ(qy) ∂jℓ(qy) q(y; ξ)dy = � X ∂iℓ(qφ(x)) ∂jℓ(qφ(x)) q(φ(x); ξ) φ′(x)dx (73) = � X ∂iℓ(px) ∂jℓ(px) p(x; ξ)dx = gij(pξ) which is the Fisher information metric on S. The components of Amari’s α−connections on the induced manifold S′ are given by ´Γα ijk(qξ) = � Y ∂i∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy + � Y 1 − α 2 ∂iℓ(qy) ∂jℓ(qy) ∂kℓ(qy) q(y; ξ)dy = � X ∂i∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx + � X 1 − α 2 ∂iℓ(qφ(x)) ∂jℓ(qφ(x)) ∂kℓ(qφ(x)) q(φ(x); ξ)φ′(x)dx (74) = � X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx + � X 1 − α 2 ∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx = Γα ijk(pξ) which are the components of Amari’s α−connections on the manifold S. Thus we obtained that the Fisher information metric and Amari’s α-connections are invariant under smooth one-to-one transformations of the random variable. Now we prove that α-connections are the only F−connections that are invariant under smooth one-to-one transformations of the random variable. Theorem 4. Amari’s α-connections are the only F−connections that are invariant under smooth one-to-one transformations of the random variable. 25 Entropy 2014, 16, 2472–2487 Proof. Let φ be a smooth one-to-one transformation of the random variable x to y. The components of the F−connection of the induced manifold S′ are ´ΓF ijk(qξ) = � Y � ∂i∂jℓ(qy) + (1 + qF′′(q) F′(q) )∂iℓ(qy) ∂jℓ(qy) � ∂kℓ(qy) q(y; ξ)dy = � X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx + (75) � X(1 + q(φ(x); ξ)F′′(q(φ(x); ξ)) F′(q(φ(x); ξ)) )∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx. and the components of the F−connection of the manifold S are ΓF ijk(pξ) = � X ∂i∂jℓ(px) ∂kℓ(px) p(x; ξ)dx + � X(1 + p(φ(x); ξ)F′′(p(x; ξ)) F′(p(x; ξ)) )∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx. (76) Then by equating the components ´ΓF ijk(qξ), ΓF ijk(pξ) of the F−connection, we get � q(φ(x); ξ)F′′(q(φ(x); ξ)) F′(q(φ(x); ξ)) ∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx = � p(x; ξ)F′′(p(x; ξ)) F′(p(x; ξ)) ∂iℓ(px) ∂jℓ(px) ∂kℓ(px) p(x; ξ)dx (77) Then it follows that the condition for F−connection to be invariant under the transformation φ is given by pF′′(p) F′(p) = k, (78) where k is a real constant. Hence it follows from the Euler’s homogeneous function theorem that the function F′ is a positive homogeneous function in p of degree k. So F′(λp) = λkF′(p) for λ > 0. (79) Since F′ is a positive homogeneous function in the single variable p, without loss of generality we can take, F′(p) = pk. (80) Therefore F(p) = � pk+1 k+1 k ̸= −1 log p k = −1 (81) Let k = −(1 + α) 2 , α ∈ R. (82) we get F(p) = � 2 1−α p 1−α 2 α ̸= 1 log p α = 1 (83) which is nothing but Amari’s α−embeddings Lα(p). Hence we obtain that Amari’s α−connections are the only F−connections that are invariant under smooth one-to-one transformations of the random variable. 26 Entropy 2014, 16, 2472–2487 Remark 4. In Section 2, we defined (F, G)-connections using a general embedding function F and a positive smooth function G. We can show that (F, G)-connection is invariant under smooth one-to-one transformation of the random variable when G(p) = c, where c is a real constant and F(p) = Lα(p) (proof is similar to that of Theorem 4). The notion of (F, G)−metric and (F, G)−connection provides a more general way of introducing geometric structures on a manifold. We were able to show that the Fisher information metric (up to a constant) and Amari’s α−connections are the only metric and connections belonging to this class that are invariant under both the transformation of the parameter and the one-to-one transformation of the random variable. 5. Conclusions The Fisher information metric and Amari’s α−connections are widely used in the theory of information geometry and have an important role in the theory of statistical estimation. Amari’s α−connections are defined using a one parameter family of functions, the α−embeddings. We generalized this idea to introduce geometric structures on a statistical manifold S. We considered a general embedding function F of S into RX and obtained a geometric structure on S called the F−geometry. Amari’s α−geometry is a special case of F−geometry. A more general way of defining Riemannian metrics and affine connections on a statistical manifold S is given using a positive continuous function G and the embedding F. Amari’s α−geometry is the only F−geometry that is invariant under both the transformation of the parameter and the random variable or equivalently under the sufficient statistic. We can relax the condition of invariance under the sufficient statistic and can consider other statistically significant transformations as well, which then gives an F−geometry other than α−geometry that is invariant under these statistically significant transformations. We believe that the idea of F−geometry can be used in the further development of the geometric theory of q-exponential families. We look forward to studying these problems in detail later. Acknowledgments: We are extremely thankful to Shun-ichi Amari for reading this article and encouraging our learning process. We would like to thank the reviewer who mentioned the references [13,16] that are of great importance in our future work. Author Contributions: The authors contributed equally to the presented mathematical framework and the writing of the paper. Conflicts of Interest: The authors declare no conflicts of interest. References 1. Amari, S. Differential geometry of curved exponential families-curvature and information loss. Ann. Statist. 1982, 10, 357–385. 2. Amari, S. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics, Volume 28; Springer-Verlag: New York, NY, USA, 1985. 3. Amari, S.; Kumon, M. Differential geometry of Edgeworth expansions in curved exponential family. Ann. Inst. Statist. Math. 1983, 35, 1–24. 4. Amari, S.; Nagaoka, H. Methods of Information Geometry, Translations of Mathematical Monographs; Oxford University Press: Oxford, UK, 2000. 5. Barndorff-Nielsen, O.E.; Cox, D.R.; Reid, N. The role of differential geometry in statistical theory. Internat. Statist. Rev. 1986, 54, 83–96. 6. Dawid, A.P. A Discussion to Efron’s paper. Ann. Statist. 1975, 3, 1231–1234. 7. Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency). Ann. Statist. 1975, 3, 1189–1242. 8. Efron, B. The geometry of exponential families. Ann. Statist. 1978, 6, 362–376. 9. Murray, M.K.; Rice, R.W. Differential Geometry and Statistics; Chapman & Hall: London, UK, 1995. 10. Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta. Math. Soc. 1945, 37, 81–91. 27 Entropy 2014, 16, 2472–2487 11. Chentsov, N.N. Statistical Decision Rules and Optimal Inference; Transted in English, Translation of the Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 1982. 12. Corcuera, J.M.; Giummole, F. A characterization of monotone and regular divergences. Ann. Inst. Statist. Math. 1998, 50, 433–450. 13. Zhang, J. Divergence function, duality and convex analysis. Neur. Comput. 2004, 16, 159–195. 14. Burbea, J. Informative geometry of probability spaces. Expo Math. 1986, 4, 347–378. 15. Wagenaar, D.A. Information Geometry for Neural Networks. Available online: http://www.danielwagenaar.net/res/papers/98-Wage2.pdf (accessed on 13 December 2013). 16. Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually flat and conformal geometries. Physica A 2012, 391, 4308–4319. c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 28 entropy Article Computational Information Geometry in Statistics: Theory and Practice Frank Critchley 1 and Paul Marriott 2,* 1 Department of Mathematics and Statistics, The Open University, Walton Hall, Milton Keynes, Buckinghamshire MK7 6AA, UK; E-Mail: f.critchley@open.ac.uk 2 Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada * E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567. Received: 27 March 2014; in revised form: 25 April 2014 / Accepted: 29 April 2014 / Published: 2 May 2014 Abstract: A broad view of the nature and potential of computational information geometry in statistics is offered. This new area suitably extends the manifold-based approach of classical information geometry to a simplicial setting, in order to obtain an operational universal model space. Additional underlying theory and illustrative real examples are presented. In the infinite-dimensional case, challenges inherent in this ambitious overall agenda are highlighted and promising new methodologies indicated. Keywords: information geometry; computational geometry; statistical foundations 1. Introduction The application of geometry to statistical theory and practice has seen a number of different approaches developed. One of the most important can be defined as starting with Efron’s seminal paper [? ] on statistical curvature and subsequent landmark references, including the book by Kass and Vos [? ]. This approach, a major part of which has been called information geometry, continues today, a primary focus being invariant higher-order asymptotic expansions obtained through the use of differential geometry. A somewhat representative example of the type of result it generates is taken from [? ], where the notation is defined: Example 1. The bias correction of a first-order efficient estimator, ˆβ, is defined by: ba(β) = − 1 2n gaa′ � gbcΓ(−1) a′bc + gκλh(−1) κλa′ � , and has the property that if ˆβ∗ := ˆβ − b(β) then: Eβ( ˆβ∗ − β) = O(n−3/2). The strengths usually claimed of such a result are that, for a worker fluent in the language of information geometry, it is explicit, insightful as to the underlying structure and of clear utility in statistical practice. We agree entirely. However, the overwhelming evidence of the literature is that, while the benefits of such inferential improvements are widely acknowledged in principle, in practice, the overhead of first becoming fluent in information geometry prevents their routine use. As a result, a great number of powerful results of practical importance lay severely underused, locked away behind notational and conceptual bars. This paper proposes that this problem can be addressed computationally by the development of what we call computational information geometry. This gives a mathematical and numerical Entropy 2014, 16, 2454–2471; doi:10.3390/e16052454 www.mdpi.com/journal/entropy 29 Entropy 2014, 16, 2454–2471 computational framework in which the results of information geometry can be encoded as “black-box" numerical algorithms, allowing direct access to their power. Essentially, this works by exploiting the structural properties of information geometry, which are such that all formulae can be expressed in terms of four fundamental building blocks: defined and detailed in Amari [? ], these are the +1 and −1 geometries, the way that these are connected via the Fisher information and the foundational duality theorem. Additionally, computational information geometry enables a range of methodologies and insights impossible without it; notably, those deriving from the operational, universal model space, which it affords; see, for example, [? ? ? ]. The paper is structured as follows. Section 2 looks at the case of distributions on a finite number of categories where the extended multinomial family provides an exhaustive model underlying the corresponding information geometry. Since the aim is to produce a computational theory, a finite representation is the ultimate aim, making the results of this section of central importance. The paper also emphasises how the simplicial structures introduced here are foundational to a theory of computational information geometry. Being intrinsically constructive, a simplicial approach is useful both theoretically and computationally. Section 3 looks at how simplicial structures, defined for finite dimensions, can be extended to the infinite dimensional case. 2. Finite Discrete Case 2.1. Introduction This section shows how the results of classical information geometry can be applied in a purely computational way. We emphasise that the framework developed here can be implemented in a purely algorithmic way, allowing direct access to a powerful information geometric theory of practical importance. The key tool, as explained in [? ], is the simplex: Δk := � ß = (ß0, ß1, . . . , ßk)⊤ : ßi ≥ 0 , k ∑ i=0 ßi = 1 � , (1) with a label associated with each vertex. Here, k is chosen to be sufficiently large, so that any statistical model—by which we mean a sample space, a set of probability distributions and selected inference problem—can be embedded. The embedding is done in such a way that all the building blocks of information geometry (i.e., manifold, affine connections and metric tensor) can be numerically computed explicitly. Within such a simplex, we can embed a large class of regular exponential families; see [? ] for details. This class includes exponential family random graph models, logistic regression, log-linear and other models for categorical data analysis. Furthermore, the multinomial family on k + 1 categories is naturally identified with the relative interior of this space, int(Δk), while the extended family, Equation (??), is a union of distributions with different support sets. This paper builds on the theory of information geometry following that introduced by [? ] via the affine space construction introduced by [? ] and extended by [? ]. Since this paper concentrates on categorical random variables, the following definitions are appropriate. Consider a finite set of disjoint categories or bins B = {Bi}i∈A. Any distribution over this finite set of categories is defined by a set, {πi}i∈A, which defines the corresponding probabilities. With “mix” connoting mixtures of distributions, we have: Definition 1. The −1-affine space structure over distributions on B := {Bi}i∈A is (Xmix, Vmix, +) where: Xmix = � {xi}i∈A| ∑ i∈A xi = 1 � , Vmix = � {vi}i∈A| ∑ i∈A vi = 0 � and the addition operator, +, is the usual addition of sequences. 30 Entropy 2014, 16, 2454–2471 In Definition ??, the space of (discretised) distributions is a −1-convex subspace of the affine space, (Xmix, Vmix, +). A similar affine structure for the +1-geometry, once the support has been fixed, can be derived from the definitions in [? ]. 2.2. Examples Examples ?? and ?? are used for illustration. The second of these is a moderately high dimensional family, where the way that the boundaries of the simplex are attached to the model is of great importance for the behaviour of the likelihood and of the maximum likelihood estimate. In general, working in a simplex, boundary effects mean that standard first order asymptotic results can fail, while the much more flexible higher order methods can be very effective. The other example is a continuous curved exponential family, where both higher order asymptotic sampling theory results and geometrically-based dimension reduction are described. Example 2. The paper [? ] models survival times for leukaemia patients. These times, recorded in days, start at the time of diagnosis, and there are 43 observations; see [? ] for details. We further assume that the data is censored at a fixed value. It was observed that a censored exponential distribution gives a reasonable, but not exact, fit. As discussed in [? ], this gives a one-dimensional curved exponential family inside a two-dimensional regular exponential family of the form: exp � λ1x + λ2y − log � 1 λ2 � eλ2t − 1 � + eλ1+λ2t �� , (2) where y = min(z, t) and x = I(z ≥ t), and the embedding map is given by (λ1(θ), λ2(θ)) = (− log θ, −θ). As shown in [? ], the loss due to discretisation can be made arbitrarily small for all information geometry objects. Thus, for example, using this computational approach, it is straightforward to compute the bias correction described in Example ??. Each of the terms in the asymptotic bias, i.e., the metric, gij, its inverse, gij, the Christoffel symbols, Γ(−1) ijk , and curvature term, h(−1), can be directly numerically coded as appropriate finite difference approximations to derivatives. Thus, “black-box” code can directly calculate the numerical value of the asymptotic bias, and this numerical value can then be used by those who are not familiar with information geometry. For example this calculation establishes the fact that, with this particular data set, the sample size is such that the bias is inferentially unimportant. 1 2 3 4 Figure 1. Undirected graphical model showing the cyclic graph of order four. Example 3. The paper [? ] discusses an undirected graphical model based on the cyclic graph of order four, shown in Figure ??, with binary random variables at each node. Without any constraints, there are 16 possible values for the graph, so model space can be thought of as a 15-dimensional simplex, including the relative 31 Entropy 2014, 16, 2454–2471 boundary. However, the conditional independence relations encoded by the graph impose linear constraints in the natural parameters of the exponential family. Thus, the resultant model is a lower dimensional full exponential family and its closure. As described in [? ], the four cycle model is a seven dimensional exponential family, which is a +1-affine subspace of the +1-affine structure of the 15-dimensional simplex. The model can be written in the form: ⎛ ⎝ πi exp � ∑8 h=1 ηhvhi � ∑15 j=0 πj exp � ∑8 h=1 ηhvhj � ⎞ ⎠ 15 i=0 (3) for a given set of linearly independent vectors {vh}8 h=1. The existence of the maximum likelihood estimate for η = (ηh) will depend on how the limit points of Model (??) meet the observed face of Δ15; that is, the span of the vertices (bins) having positive counts. Thus, a key computational task is to learn how a full exponential family, defined by a representation of the form of (??), is attached to boundary sub-simplices of the high-dimensional embedding simplex. In order to visualise the geometric aspects of this problem, consider a lower dimensional version. Define a two-dimensional full exponential family by the vectors v1 = (1, 2, 3, 4), v2 = (1, 4, 9, −1) and the uniform distribution base point, πi, embedded in the three-dimensional simplex. The two-dimensional family is defined by the +1-affine space through (0.25, 0.25, 0.25, 0.25) spanned by the space of vectors of the form: α(1, 2, 3, 4) + β(1, 4, 9, −1) = (α + β, 2α + 4β, 3α + 9β, 4α − β). Consider directions from the origin obtained by writing α = θβ, giving, for each θ, a one-dimensional, full exponential family parameterized by β in the direction β(θ + 1, 2θ + 4, 3θ + 9, 4θ − 1). The aspect of this vector, which determines the connection to the boundary, is the rank order of its elements. For example, suppose the first component was the maximum and the last the minimum. Then, as β → ±∞, this one-dimensional family will be connected to the first and fourth vertex of the embedding four simplex, respectively. Note that changing the value of θ changes the rank structure, as illustrated in Figure ??. This plot shows the four element-wise linear functions of θ (dashed lines) and the salient overall feature of their rank order; that is, their upper and lower envelopes (solid lines). From this analysis of the envelopes of a set of linear functions, it can be seen that the function 2θ + 4 is redundant. The consequence of this is shown in Figure ??, which shows a direct computation of the two-dimensional family. It is clear that, indeed, only three of the four vertexes have been connected by the model. In general, the problem of finding the limit points in full exponential families inside simplex models is a problem of finding redundant linear constraints. As shown in [? ], this can be converted, via convex duality, into the problem of finding extremal points in a finite dimensional affine space. In the four-cycle model, this technique can construct all sub-simplices containing limit points of the four-cycle model. For example, it can be shown that all of the 16 vertices are part of the boundary. Once the boundary points have been identified as necessary and sufficient, conditions for the existence of the maximum likelihood in the +1-parameters can easily be found computationally [? ]. 32 Entropy 2014, 16, 2454–2471 ��� ��� �� � � �� �� ��� ��� ��� � �� �� �� Envelope of linear functions � ��������������� Figure 2. The envelope of a set of linear functions. Functions, dashed lines; envelope, solid lines. Figure 3. Attaching a two-dimensional example to the boundary of the simplex. 2.3. Tensor Analysis and Numerical Stability One of the most powerful set of results from classical information geometry is the way that geometrically-based tensor analysis is perfect for use in multi-dimensional higher order asymptotic analysis; see [? ] or [? ]. The tensorial formulation does, however, present a couple of problems in practice. For many, its very tight and efficient notational aspects can obscure rather than enlighten, while the resulting formulae tend to have a very large number of terms, making them rather cumbersome to work with explicitly. These are not problems at all for the computational approach described in this paper. Rather, the clarity of the tensorial approach is ideal for coding, where large numbers of additive terms, of course, are easy to deal with. Two more fundamental issues, which the global geometric approach of this paper highlights, concern numerical stability. The ability to invert the Fisher information matrix is vital in most tensorial 33 Entropy 2014, 16, 2454–2471 formulae, and so understanding its spectrum, discussed in Section ??, is vital. Secondly, numerical underflow and overflow near boundaries require careful analysis, and so, understanding the way that models are attached to the boundaries of the extended multinomial models is equally important. The four-cycle model, to which we now return, illustrates computational information geometry doing this effectively. Example 4. The multivariate Edgeworth approximation to the sampling distribution of part of the sufficient statistic for the four-cycle model is shown in Figure ??. Using the techniques described above, a point near the boundary of the 15-simplex has been selected as the data generation process. For illustration, we focus on the marginal distribution of two components of the sufficient statistic, though any number could have been chosen. The boundary forces constraints on the range of the sufficient statistics, shown by the dashed line in the plot. The points, jittered for clarity, show the distribution computed by simulation. It is typical that such boundary constraints prevent standard first order methods from performing well, but the greater flexibility of higher order methods can be seen to work well here. As discussed above, methods, such as the multivariate Edgeworth expansion, can be strongly exploited in a computational framework, such as ours. Note, the discretization that can be observed in the figure is extensively discussed in [? ]. � �� �� �� �� � � �� �� �� �������������������� �������������������� Figure 4. Using the Edgeworth expansion near the boundary of four-cycle model. 2.4. Spectrum of Fisher Information We focus now on the second numerical issue identified above. In any multinomial, the Fisher information matrix and its inverse are explicit. Indeed, the 0-geodesics and the corresponding geodesic distance are also explicit; see [? ] or [? ]. However, since the simplex glues together multinomial structures with different supports and the computational theory is in high dimensions, it is a fact that the Fisher information matrix can be arbitrarily close to being singular. It is therefore of central interest that the spectral decomposition of the Fisher information itself has a very nice structure, as shown below. Example 5. Consider a multinomial distribution based on 81 equal width categories on [−5, 5], where the probability associated to a bin is proportional to that of the standard normal distribution for that bin. The Fisher information for this model is an 80 × 80 matrix, whose spectrum is shown in Figure ??. By inspection, it can be seen that there are exponentially small eigenvalues, so that while the matrix is positive definite, it is also arbitrarily close to being singular. Furthermore, it can be seen that the spectrum has the shape of a half-normal density function and that the eigenvalues seem to come in pairs. These facts are direct consequences of the general results below. 34 Entropy 2014, 16, 2454–2471 With π−0 denoting the vector of all bin probabilities, except π0, we can write the Fisher information matrix (in the +1 form) as N times: I(π) := diag(π−0) − π−0πT −0. This has an explicit spectral decomposition, which can be computed by using interlacing eigenvalue results (see for example [? ], Chapter 4). In particular, if the diagonal matrices, diag(π1, . . . , πk) and diag(λ1Im1| · · · |λgImg), agree up to a row-and-column permutation, where g > 1 and λ1 > · · · > λg > 0, then I(π) has ordered spectrum: λ1 > ˜λ1 > · · · > λg > ˜λg ≥ 0, (4) with ˜λg > 0 ⇐⇒ π0 > 0, each λi having multiplicity mi − 1, while each ˜λg is simple. 0 20 40 60 80 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Eigenvalues rank Eigenvalues Figure 5. Spectrum of the Fisher information matrix of a discretised normal distribution. We give a complete account of the spectral decomposition (SpD) of I(π). There are four cases to consider, the last having the generic spectrum of (??). Without loss, after permutation, assume now π1 ≥ · · · ≥ πk. The four cases are: Case 1 For some l < k, the last k − l elements of π−0 vanish: the sub-case l = 0 ⇐⇒ π0 = 1 ⇐⇒ I(π) = 0 is trivial. Otherwise, writing π+ = (π1, . . . , πl)T and Π+ = diag(π+), the SpD of: I(π) = � Π+ − π+πT+ 0 0 0 � follows at once from that of Π+ − π+πT+, given below. Case 2 k = 1: this case is trivial. Case 3 k > 1, π = λ1k, λ > 0: the SpD of I(π) is: λCk + λ(1 − kλ)Jk where Ck = Ik − Jk and Jk = k−11k1T k . Here, λ has multiplicity k − 1 and eigenspace [Span(1k)]⊥, while ˜λ := λ(1 − kλ) has multiplicity one and eigenspace Span(1k). In particular, since 1 − π0 = kλ, it follows that: I(π) is singular ⇐⇒ π0 = 0. 35 Entropy 2014, 16, 2454–2471 Case 4 π−0 = (λ11T m1| . . . |λg1T mg)T, g > 1 and λ1 > · · · > λg > 0: This is the generic case, having the spectrum of (??) above. Denoting by Om the zero matrix of order m × m and by P(ν) the rank one orthogonal projector onto Span(ν), (ν ̸= 0), the SpD is: g ∑ i=1,mi>1 λidiag(Omi−, Cmi, Omi−) + g ∑ i=1 ˜λiP ⎛ ⎝ � λ1 ˜λi − λ1 1T m1, . . . , λg ˜λi − λg 1T mg �T⎞ ⎠ , where: mi− = ∑{mj|j < i}, mi+ = ∑{mj|j > i} and the ˜λi are the zeros of: h(˜λ) := 1 + g ∑ i=1 miλ2 i ˜λ − λi = (1 − g ∑ i=1 miλi) + ˜λ � g ∑ i=1 miλi ˜λ − λi � . In particular, {˜λi : i = 1, · · · , g} are simple eigenvalues satisfying (??) while, whenever mi > 1, λi, is also an eigenvalue having multiplicity mi − 1. Further, expanding det(I(π)), we again find: I(π) is singular ⇐⇒ π0 = 0, so that �λg > 0 ⇔ π0 > 0, as claimed. Finally, we note that each �λi (i < g) is typically (much) closer to λi than to λi+1. For, considering the graph of x → 1/x, h ((λi + λi+1)/2 + δ(λi − λi+1)/2) (−1 < δ < +1) is well-approximated by: 1 − 2miλ2 i (λi − λi+1)(1 − δ) + 2mi+1λ2 i+1 (λi − λi+1)(1 + δ) whose unique zero δ∗ over (−1, 1) is positive whenever, as will typically be the case, mi = mi+1 (both will usually be one), while (miλi + mi+1λi+1) < 1/2. Indeed, a straightforward analysis shows that, for any mi and mi+1, δ∗ = 1 + O(λi) as λi → 0. 2.5. Total Positivity and Local Mixing Mixture modelling is an exemplar of a major area of statistics in which computational information geometry enables distinctive methodological progress. The −1-convex hull of an exponential family is of great interest, mixture models being widely used in many areas of statistical science. In particular, they are explored further in [? ]. Here, we simply state the main result, a simple consequence of the total positivity of exponential families [? ], that, generically, convex hulls are of maximal dimension. In this result, “generic” means that the +1 tangent vector, which defines the exponential family as having components that are all distinct. Theorem 1. The −1-convex hull of an open subset of a generic one-dimensional exponential family is of full dimension. Proof. For any (πi) ∈ Δk with each πi > 0, θ0 < · · · < θk and s0 < · · · < sk, let B = (π(θ0), ..., π(θk)) have general element: πi(θj) := πi exp[siθj − ψ(θj)]. Further, let �B = B − π(θ0)1T k+1, whose general column is π(θj) − π(θ0). Then, it suffices to show that �B has rank k. However, using [? ] (p. 33), Rank(�B) = Rank(B) − 1, so that: Rank(�B) = k ⇔ B is nonsingular ⇔ B∗ is nonsingular, 36 Entropy 2014, 16, 2454–2471 where B∗ = (exp[siθj]). It suffices, then, to recall [? ] that K(x, y) = exp(xy) is strictly total positive (of order ∞), so that det B∗ > 0. 3. Infinite Dimensional Structure This section will start to explore the question of whether the simplex structure, which describes the finite dimensional space of distributions, can extend to the infinite dimensional case. We examine some of the differences with the finite dimensional case, illustrating them with clear, commonly occurring examples. 3.1. Infinite Dimensional Information Geometry: A Review In the previous sections, the underlying computational space is always finite dimensional. This section looks at issues related to an infinite dimensional extension of the theory in that paper. There is a great deal of literature concerning infinite dimensional statistical models. The discussion here concentrates on information geometric, parametrisation and boundary issues. The information geometry theory of Amari [? ] has a geometric foundation, where statistical models (typically full and curved exponential families) have a finite dimensional manifold structure. When considering the extension to infinite dimensional cases, Amari notes the problem of finding an “adequate topology” [? ] (p. 93). There has to be very interesting work following up this topological challenge. By concentrating on distributions with a common support, the paper [? ] uses the geometry of a Banach manifold, where local patches on the manifold are modelled by Banach spaces, via the concept of an Orlicz space. This gives a structure that is analogous to an infinite dimensional exponential family, with mean and natural parameters and including the ability to define mixed parametrisations. One drawback of this Banach structure, as pointed out in [? ], is that the likelihood function with finite samples is not continuous on the manifold. Fukumizu uses a reproducing kernel Hilbert space structure rather than a Banach manifold, which is a stronger topology. There are strong connections between the approach taken in [? ] and the material in Section ??, we note two issues here: (1) a focus on the finite nature of the data; and (2) using a Hilbert structure defined by a cumulant generating function. The approaches differ in that [? ] uses a manifold approach rather than the simplicial complex as the fundamental geometric object. There is also other work that explicitly used infinite dimensional Hilbert spaces in statistics, a good reference being [? ]. In this paper, in contrast to previous authors, a simplicial, rather than a manifold-based, approach is taken. This allows distributions with varying support, as well as closures of statistical families to be included in the geometry. Another difference in approach is the way in which geometric structures are induced by infinite dimensional affine spaces rather than by using an intrinsic geometry. This approach was introduced by [? ] and extended by [? ]. Spaces of distributions are convex subsets of the affine spaces, and their closure within the affine space is key to the geometry. In exponential families, the −1-affine structure is often called the mean parametrisation, and using moments as parameters is one very important part of modelling. In the infinite dimensional case, the use of moments as a parameter system is related to the classical moment problem—when does there exist a (unique) distribution whose moments agree with a given sequence?—which has generated a vast literature in its own right; see [? ? ? ]. In general terms, the existence of a solution to the moment problem is connected to positivity conditions on moment matrices. Such conditions have been used in connection to the infinite dimensional geometry of mixture models [? ]. Uniqueness, however, is a much more subtle problem: sufficient conditions can be formulated in terms of the rate of growth of the moments [? ]. Counter examples to general uniqueness results include the log-normal distribution [? ]. The geometry of the Fisher information is also much more complex in general spaces of distributions than in exponential families. Simple mixture models, including two-component mixtures of exponential distributions [? ], can have “infinite” expected Fisher information, which gives rise to 37 Entropy 2014, 16, 2454–2471 non-standard inference issues. Similar results on infinitely small (and large) eigenvalues of covariance operators are also noted in [? ]. Since the Fisher information is a covariance, the fact that it does not exist for certain distributions or that its spectrum can be unbounded above or arbitrarily close to zero is not a surprise. However, these observations do need to be taken into account when considering the information geometry of infinite dimensional spaces. The rest of this section looks at the topology and geometry of the infinite dimensional simplex and gives some illustrative examples, which, in particular, show the need for specific Hilbert space structures, discussed in the final section. 3.2. Topology For simplicity and concreteness, in this section, we will be looking at models for real valued random variables. In this paper, we restrict attention to the cases where the sample space is R+ or R and has been discretised to a countably infinite set of bins, Bi, with i ∈ N or Z, respectively. In the finite case, the basic object is the standard simplex, Δk, with k + 1 bins. We generalise this to countable unions of such objects. Of these, one is of central importance, denoted by Δemp or simply Δ, because it is the smallest object that contains all possible empirical distributions. Definition 2. For any finite subset of bins, indexed by I ⊂ N or Z, denote ΔI = � x = (xi)i∈I : xi ≥ 0 , ∑ i∈I xi = 1 � . We take the union of all such sets � |I|<∞ ΔI, where |I| denotes the number of elements of the index set. This can always be written as: Δ = � x = (xi)i∈Z : ∑ i∈Z xi = 1, xi ≥ 0 and only finitely many xi > 0 � . In what follows, it is important to note that for any given statistical inference problem, the sample size, n, is always finite, even if we frequently use asymptotic approximations, where n → ∞. Thus, the data, as represented by the empirical distribution, naturally lie in the space, Δ. However, many models, used in the given inference problem, will have support over all bins, so the models most naturally lie in the “boundary” constructed using the closures of the set. These objects are subsets of sequence spaces, and the corresponding topologies can be constructed from the Banach spaces, ℓp, p ∈ [1, ∞]. The following results follow directly from explicit calculations, where we note that in this section, since all terms are non-negative, convergence always means absolute convergence. In particular, arbitrary rearrangements of series do not affect the existence of limits or their values. Example 6. Consider the sequence of “uniform distributions” x(n) = ( 1 n, . . . , 1 n, 0, . . . ) as elements of Δ. This has an ℓp limit of the zero sequence for p ∈ (1, ∞]. Proposition 1. The ℓp extreme points of Δ, for p ∈ (1, ∞], are the zero sequence and the sequences, ffii (i ∈ Z), with one as the i − th element and zero elsewhere. For p ∈ [1, ∞], let Δp ⊂ ℓp denote the ℓp closure of Δ. Theorem 2. (a) Δ1 = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi = 1} . (b) Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} . (c) For p ∈ (1, ∞), Δp = Δ∞ = {x = (xi)i∈Z : xi ≥ 0, ∑i∈I xi ≤ 1} . 38 Entropy 2014, 16, 2454–2471 Proof. (a) It is immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ1. Conversely, if ¯x is a limit point, then all its elements must be non-negative. Finally, if ∑∞ i=1 ¯xi is not bounded above by one, then there exists N, such that ∑N i=1 ¯xi > 1 + ϵ for some ϵ > 0. Hence, ∑∞ i=1 | ¯xi − x(n) i | ≥ ∑N i=1 | ¯xi − x(n) i | ≥ ∑N i=1 ¯xi − ∑N i=1 x(n) i > ϵ for all n, which contradicts convergence. If ∑∞ i=1 ¯xi < 1 − ϵ, then ∑∞ i=1 | ¯xi − x(n) i | ≥ ∑∞ i=1 x(n) i − ∑∞ i=1 ¯xi > ϵ, which again contradicts convergence. (b) It is again immediate that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi = 1} ⊆ Δ∞. However, by Example ??, the zero sequence is also in Δ∞, so that {x = (xi)i∈Z : xi ≥ 0, ∑i∈Z xi ≤ 1} ⊆ Δ∞. Conversely, by contradiction, it is easy to see that all elements of the closure must have non- negative elements. Finally, for any ¯x ∈ Δ∞, if ∑∞ i=1 ¯xi is not bounded above by one, there exists N, such that ∑N i=1 ¯xi > 1 + ϵ for some ϵ > 0. For any sequence of points, x(n) in Δ, we have that ∑N i=1 x(n) i ≤ 1, so that, for i = 1, . . . , N, the maximum value of |x(n) i − ¯xi| > ϵ/N. Hence, for all sequences, x(n), we have ∥x(n) − ¯x∥∞ > ϵ/N, which contradicts ¯x being in the closure. (c) This follows essentially the same argument as (b) by noting in the case where ∑∞ i=1 ¯xi is not bounded above by one, we have: ∥x(n) − ¯x∥p p ≥ N ∑ i=1 | ¯xi − x(n) i |p ≥ N max i=1,...N |x(n) i − ¯xi|p > N1−pϵp for any sequences, x(n), which contradicts ¯x being in the closure. It is immediate that the spaces, Δ and Δ1, are convex subsets of ℓ1 and that Δ∞ is a convex set in ℓ∞. 3.3. Geometry In the same way as for the finite case, the −1-geometry can be defined using an affine space structure using the following definition. Definition 3. Let I be a countable index set which is a subset of Z. The −1-affine space structure over distributions is (Xmix, Vmix, +), where: Xmix = � x = (xi)|∑ I xi = 1,∑ I |xi| < ∞ � , Vmix = � v = vi|∑ I vi = 0,∑ I |vi| < ∞, � , and x + v = (xi + vi). In order to define the +1-geometric structure, we also follow the approach used in the finite case. Initially, to understand the +1- structure, consider the case where all distributions have a common support, i.e., assume πi > 0 for all i. We follow here the approach of [? ]. Definition 4. Consider the set of non-negative measures on N or Z and the equivalence relation defined by: {ai} ∼ {bi} ⇐⇒ ∃λ > 0 s.t. ∀i ai = λbi. The equivalences classes of this are the points in the +1 geometry. These points can be further partitioned into sets with the same support, i.e., supp(< a >) = {i : ai > 0}, where this is clearly well-defined. On sets of +1-points with the same support, we can define the +1-geometry in the same way as in the finite case. With “exp” connoting an exponential family distribution, we have: 39 Entropy 2014, 16, 2454–2471 Definition 5. For a given index set, I, define Xexp to be all +1-points whose support equals I, and define the vector space Vexp = {vi, i ∈ I} with the operation, ⊕, defined by: < xi > ⊕vi = ⟨xi exp(vi)⟩ , is an affine space. The +1-affine structure is then defined by (Xexp, Vexp, ⊕). Theorem 3. If a and b lie in Δ (or Δ1) and have the same support, then C(ρ) = ∑(aρ i b(1−ρ) i ) < ∞ for ρ ∈ [0, 1]. Hence, aρ i b(1−ρ) i C(ρ) ∈ Δ (or Δ1). Proof. Since a, b are absolutely convergent, the sequence, max(ai, bi), is also. Since we have: 0 ≤ min(ai, bi) ≤ aρ i b1−ρ i ≤ max(ai, bi) it follows that C(ρ) < ∞, and we have the result. This result shows that sets in Δ1 with the same support are +1-convex, just as the faces in the finite case are. 3.4. Examples In order to get a sense of how the +1-geometry works, let us consider a few illustrative examples. Example 7. If we denote the discretised standard normal density by a and the discretised Cauchy density by b and consider the path: aρ i b(1−ρ) i C(ρ) , the normalising constant is shown in Figure ??. We see that at ρ = 0 (the Cauchy distribution), we have that the derivative of the normalising constant (i.e., the mean of the sufficient statistic) is tending to infinity. At the other end (ρ = 1), the model can be extended in the sense that the distribution exists for values greater than one. ��� ��� ��� ��� ��� ��� ���� ���� ���� ���� ���� ���� ���� ���� ���������������� �������������������� Figure 6. Normalising constant for normal-Cauchy exponential mixing example. Thus, in this example, the path joining the two distributions is an extended, rather than natural, exponential family, since we have to include the boundary point where the mean is unbounded. 40 Entropy 2014, 16, 2454–2471 Example 8. Let us return to Example ??, but now without the censoring. Thus, now, there is a countably infinite set of bins, and so, we can investigate its embedding in the infinite simplex. As discussed in [? ], we shall discretise the continuous distribution by computing the probabilities associated to bins [ci, ci+1], i = 1, 2, · · · . For the exponential model, Exp(θ), the bin probabilities are simply: πi(θ) = exp(−θci) − exp(−θci+1). Using this, the model will lie in the infinite simplex on the positive half line with the index set I = N. First, consider the case where we have a uniform choice of discretisation, where cn = n × ϵ for some fixed, ϵ > 0. In this case, the bin probabilities can be written as an exponential family: πn(θ) = exp � −θϵn + log(1 − e−θϵ) � for θ > 0. This gives a +1-geodesic though {πi(θ0)} in the direction {ϵ × n} of the form: πn(θ0) exp � −λϵn + log � 1 − e−(λ+θ0)ϵ 1 − e−θ0ϵ �� (5) for λ > −θ0. In the case where λ → −θ0, the limiting distribution is the zero measure in Δ∞, and at the other extreme, where λ → ∞, the limiting distribution is the atomic distribution in the first bin, a distribution with a different support than πi(θ0). However, unlike the finite case, there is no guarantee that, for a given “direction”, {ti}, there exists a +1-geodesic starting at {πi(θ0)}, since we require the convergence of the normalising constant: ∞ ∑ i=0 πi(θ0) exp(λti) < ∞. From this example, we see that the limit points of exponential families can lie in the space, Δ∞, but not in Δ1. The next example shows that limits do not have to exist at all. Example 9. Consider the family whose bin probabilities, πi ∈ Δ∞, are proportional to a discretised standard normal with bins of constant width. The exponential family, which is proportional to πi exp(θi), does not have an ℓ∞ limit, as it is discretised normal with mean θ. The natural parameter space here is (−∞, ∞). The last illustrative example is from [? ] and shows that even for simple models, the Fisher information for the parameters of interest need not be finite. Example 10. Let us consider a simple example of a two-component mixture of (discretised) exponential distributions: (1 − ρ)πi(θ0 + λ) + ρπi(θ0) (6) the tangent vector in the ρ-direction is: πi(θ0) − πi(θ0 + λ) = πi(θ0) � 1 − e−λϵnC � for a positive constant, C. The corresponding squared length, with respect to the Fisher information, is: ∞ ∑ n=0 � 1 − e−λϵnC �2 πi(θ0) . As an example, consider θ0 = 1; then, this term will be infinite for λ ≤ −0.5. 41 Entropy 2014, 16, 2454–2471 3.5. Hilbert Space Structures Following these examples, we can consider the Hilbert space structure of exponential families inside the infinite simplex with the following results. Definition 6. Define the functions, S(·), by S({vi}, ß) = supθ {θ| ∑I πi exp(θvi) < ∞}, the function being set to ∞ when the set is unbounded. Furthermore, define for a given {πi} ∈ ¯Δ∞, the set: V(ß) = {{vi}|S({vi}, ß > 0} , and Vc(ß) = {{vi}| ± {vi} ∈ V(ß)} . The spaces, Vc(ß), correspond to the directions in which the +1-geodesic and, so, the corresponding exponential families are well-defined and have particularly “nice” geometric structures. Theorem 4. For ß, define a Hilbert space by: H(ß) := � {vi}|∑ v2 i πi < ∞ � with inner product: ⟨{vi}, {wi}⟩ß = ∑ viwiπi, and corresponding norm || · ||ß. Under these conditions: (i) Vc(ß) is a subspace of H(ß), and (ii) the set V(ß) is a convex cone. Proof. (i) First, if {vi} ∈ Vc(ß), then by definition, the moment generating function: ∑ exp(θvi)πi, is finite for θ in an open set containing θ = 0. Hence, have both: ∑ viπi < ∞, and ∑ v2 i πi < ∞. Thus, {vi} ∈ H(ß). The fact that it is a subspace follows from (ii) below. (ii) It is immediate that V(ß) is a cone. Convexity follows from the Cauchy–Schwartz inequality, since for all {vi}, {v∗ i } ∈ V(ß) and λ ∈ [0, 1], it follows that: � ∑ πie θ 2 (λvi+(1−λ)v∗ i )�2 = � ∑ �√πie θ 2 λvi � �√πie θ 2 (1−λ)v∗ i ��2 ≤ � ∑ πieθλvi � � ∑ πieθ(1−λ)v∗ i � , and, so, is finite for a strictly positive value of θ, hence � λvi + (1 − λ)v∗ i � ∈ V(ß). Hence, this result illustrates the point above regarding the existence of “nice” geometric structure in the sense of Amari’s information geometry developed for finite dimensional exponential families. Infinite dimensional families have a richer structure; for example, they include the possibility of having an infinite Fisher information; see Examples ?? and ??. Acknowledgments: The authors would like to thank Karim Anaya-Izquierdo and Paul Vos for many helpful discussions and the UK’s Engineering and Physical Sciences Research Council (EPSRC) for the support of grant number EP/E017878/. Author Contributions: All authors contributed to the conception and design of the study, the collection and analysis of the data and the discussion of the results. All authors read and approved the final manuscript. Conflicts of Interest: The authors declare no conflict of interest. 42 Entropy 2014, 16, 2454–2471 References 1. Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency). Ann. Stat. 1975, 3, 1189–1242. 2. Kass, R.E.; Vos, P.W. Geometrical Foundations of Asymptotic Inference; John Wiley & Sons: London, UK, 1997. 3. Amari, S.-I. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics; Springer-Verlag Inc.: New York, NY, USA, 1985; Volume 28. 4. Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Foundations. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 311–318. 5. Anaya-Izquierdo, K.; Critchley, F.; Marriott, P.; Vos, P. Computational Information Geometry: Mixture Modelling. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 319–326. 6. Anaya-Izquierdo, K.; Critchley, F.; Marriott, P. When are first order asymptotics adequate? A diagnostic. Stat 2014, 3, 17–22. 7. Murray, M.K.; Rice, J.W. Differential Geometry and Statistics; Chapman & Hall: London, UK, 1993. 8. Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 95–97. 9. Hand, D.J.; Daly, F.; Lunn, A.D.; McConway, K.J.; Ostrowski, E. A Handbook of Small Data Sets; Chapman and Hall: London, UK, 1994. 10. Bryson, M.C.; Siddiqui, M.M. Survival times: Some criteria for aging. J. Am. Stat. Assoc. 1969, 64, 1472–1483. 11. Marriott, P.; West, S. On the geometry of censored models. Calcutta Stat. Assoc. Bull. 2002, 52, 567–576. 12. Geiger, D.; Heckerman, D.; King, H.; Meek, C. Stratified exponential families: Graphical models and model selection. Ann. Stat. 2001, 29, 505–529. 13. Edelsbrunner, H. Algorithms in Combinatorial Geometry; Springer-Verlag: NewYork, NY, USA, 1987. 14. Barndorff-Nielsen, O.E.; Cox, D.R. Asymptotic Techniques for Use in Statistics; Chapman & Hall: London, UK, 1989. 15. McCullagh, P. Tensor Methods in Statistics; Chapman & Hall: London, UK, 1987. 16. Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge Universtiy Press: Cambridge, UK, 1985. 17. Karlin, S. Total Positivity; Stanford University Press: Stanford, CA, USA, 1968; Volume I. 18. Householder, A.S. The Theory of Matrices in Numerical Analysis; Dover Publications: Dover, DE, USA, 1975. 19. Pistone, G.; Rogantin, M.P. The exponential statistical manifold: Mean parameters, orthogonality and space transformations. Bernoulli 1999, 5, 571–760. 20. Fukumizu, K. Infinite dimensional exponential families by reproducing kernel Hilbert spaces. In Proceedings of the 2nd International Symposium on Information Geometry and its Applications, Tokyo, Japan, 12–16 December 2005. 21. Small, C.G.; McLeish, D.L. Hilbert Space Methods in Probability and Statistical Inference; John Wiley & Sons: London, UK, 1994. 22. Akhiezer, N.I. The Classical Moment Problem; Hafner: New York, NY, USA, 1965. 23. Stoyanov, J.M. Counter Examples in Probability; John Wiley & Sons: London, UK, 1987. 24. Gut, A. On the moment problem. Bernoulli 2002, 8, 407–421. 25. Lindsay, B.G. Moment matrices: Applications in mixtures. Ann. Stat. 1989, 17, 722–740. 26. Li, P.; Chen, J.; Marriott, P. Non-finite Fisher information and homogeneity: An EM approach. Biometrika 2009, 96, 411–426. c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 43 entropy Article Using Geometry to Select One Dimensional Exponential Families That Are Monotone Likelihood Ratio in the Sample Space, Are Weakly Unimodal and Can Be Parametrized by a Measure of Central Tendency Paul Vos 1 and Karim Anaya-Izquierdo 2,* 1 Department of Biostatistics, East Carolina University, Greenville, NC 27858, USA; E-Mail: vosp@ecu.edu 2 Department of Mathematical Sciences, University of Bath, Bath BA27AY, UK * E-Mail: kai21@bath.ac.uk; Tel: +44-1225-384644 Received: 30 April 2014; in revised form: 30 June 2014 / Accepted: 14 July 2014 / Published: 18 July 2014 Abstract: One dimensional exponential families on finite sample spaces are studied using the geometry of the simplex Δ◦ n−1 and that of a transformation Vn−1 of its interior. This transformation is the natural parameter space associated with the family of multinomial distributions. The space Vn−1 is partitioned into cones that are used to find one dimensional families with desirable properties for modeling and inference. These properties include the availability of uniformly most powerful tests and estimators that exhibit optimal properties in terms of variability and unbiasedness. Keywords: simplex; cone; exponential family; monotone likelihood ratio; unimodal; duality 1. Introduction The motivation for the constructions in this paper begins with a sample from a one dimensional space that is discrete. We allow for a continuous sample space but assume that this has been suitably discretized into n bins. The simplest underlying structure for the probability assigned to these bins is given by the multinomial distribution. The collection of all multinomial distributions can be identified with the n − 1 simplex Δn−1. We use the geometry of the simplex along with a transformation of its interior Δ◦ n−1 to search for one dimensional subspaces that have good properties for modeling and for inference. In particular, we want families that can be parameterized by the mean, have only unimodal distributions, have desirable test characteristics (such as providing uniformly most powerful unbiased tests) and estimation properties (such as unbiasedness and small variability). The boundary of the (n − 1) dimensional simplex Δn−1 can be written as the union of simplexes of dimension (n − 2). This process can be repeated on the simplexes of lower dimension until the boundary consists of the vertices of the original simplex. This construction has statistical relevance to the possible supports for the probability distributions considered on the n bins. We obtain a dual decomposition for a transformation Vn−1 (defined in Equation (5) in Section 5) of Δ◦ n−1; it is dual in that the result can be obtained by replacing simplexes with cones. The statistical relevance of the conical decomposition is to the possible modes for all the distributions on the n bins. Since Vn−1 is the natural parameter space for the distributions in Δ◦ n−1, one dimensional exponential families are lines in Vn−1 and these can be related to the cones that partition Vn−1. One result is that the limiting distribution for any one dimensional exponential family in Δ◦ n−1 is the uniform distribution whose support is determined by the cone that contains the limiting values of the line corresponding to the exponential family. Entropy 2014, 16, 4088–4100; doi:10.3390/e16074088 www.mdpi.com/journal/entropy 44 Entropy 2014, 16, 4088–4100 While one parameter exponential families can be defined quite generally by choosing a sufficient statistic, it can be useful to start with the sufficient statistics from well-known families such as the binomial, Poisson, negative binomial, normal, inverse Gaussian, and Gamma distribution. These exponential families have good modeling and inferential properties that we try to maintain by limiting the extent to which the sufficient statistic is modified. These restrictions lead to considering vectors in Vn−1 that lie in a cone. Examples of how to construct these cones are given. 2. Motivating Examples One dimensional exponential families such as the binomial or Poisson are the workhorse of parametric inference because of their excellent statistical properties. However, being one dimensional means they do not always fit data very well so an extension to a two (or higher) dimensional exponential family can be pursued in order to preserve the nice inferential structure. An issue with such extension is that, for each extra natural parameter added, we need to choose a new sufficient statistic and this choice can substantially change the shape of the corresponding density functions. For example densities can pass from being unimodal to have multiple modes for some parameter values. To see this, consider the following examples. Example 1. Altham [1] considered the so-called multiplicative generalization of the binomial distribution with corresponding density f (x; p, φ) = �n x � px(1 − p)n−xφx(x−n)/C(p, φ) (1) where C is the normalizing constant and where clearly the binomial is recovered when φ = 1. By reparametrizing using θ1 = log(p/(1 − p)) and θ2 = log(φ) this density can be expressed in exponential form as f (x; θ1, θ2) = h(x) exp(θ1 x + θ2 T(x) − K(θ1, θ2)) (2) where T(x) = x(x − n) is the added sufficient statistic and h(x) = (n x) where dependence on n has been ignored. Note that the same family is obtained if T(x) = x2 is added as a sufficient statistic instead of x(x − n). If n = 127 and (θ1, θ2) = (−0.0122, 0.018) then density (2) is bimodal as shown in the left panel of Figure 1. The mean μ of this distribution is 50. Also plotted is the corresponding binomial density with the same mean or equivalently with θ1 = log(50/(127 − 50)) = −0.4318 and θ2 = 0. � �� �� �� �� ��� ���� ���� ���� ���� ���� � � �� �� �� �� ��� ���� ���� ���� ���� ���� � Figure 1. Binomial density (thick in both panels). Multiplicative binomial density (left panel and thin) and double binomial density (right panel and thin). All densities have the same mean μ = 50 and n = 127. Variance of the multiplicative and double binomial densities is equal. 45 Entropy 2014, 16, 4088–4100 As explained by Lovison [2], this distribution has the feature of being under- or over-dispersed with respect to the binomial depending on θ2 being negative or positive, respectively. Furthermore, using the mixed parametrization (μ, θ2) (see [3] for details) it is easy to see that this distribution can be parametrized so that one parameter controls dispersion independently of the mean. In fact, for a fixed mean μ, as θ2 → −∞ f (x; θ1, θ2) tends to a two point distribution (with support points at the extremes x = 0 and x = n) or to a degenerate distribution on x = μ when θ2 → ∞. Example 2. Double exponential families [4] are two parameter exponential families that extend standard unidimensional exponential families such as the binomial and the Poisson. Similar to the multiplicative binomial in Example 1, the extra parameter involved in double exponential families controls the variance independently of the mean. The density for the so-called double binomial family can be written in the form (2) with T(x) = x log � x n � + (n − x) log � 1 − x n � h(x) = (n x) and with the particular restriction that θ2 < 1 (see [4] for details). The range θ2 < 0 generates underdispersion and θ2 ∈ [0, 1) generates overdispersion with respect to the binomial. As shown on the right panel of Figure 1, the double binomial density can also be multimodal where the double binomial density shown has the same mean and variance as the multiplicative binomial shown in the left panel. These examples show that while extending exponential families can lead to useful modeling properties such as overdispersion, the extension can also result in distributions that are not suitable for modeling. We are interested in the relationship between geometric properties of one dimensional families and the modeling properties of their distributions. 3. Sample Space and Distribution-valued Random Variables We consider first the general case where the sample space for a single observation X1 consists of n bins Sn = {B1, B2, . . . , Bn−1, Bn} . We consider the space of all probability distributions P on this sample space Sn. Each probability distribution in P is defined by the n-tuple p whose ith component is pi = Pr(Bi) so that P can be identified with the n − 1 simplex Δn−1 = {p ∈ Rn : pi ≥ 0 ∀i, 1′p = 1} where 1 in 1′p is the vector 1 ∈ Rn each of whose components is 1. We will slightly abuse the notation by using p to name a point in Δn−1, and hence in Rn, as well as the corresponding distribution in P. The sample space for a random sample of size N from a distribution p0 ∈ Δn−1 is X N n = {x : x is an n vector of nonnegative integers that sum to N} . There is simple relationship between X N n and the simplex that we obtain by dividing each component of x by N. Although the sample space X N n can be viewed as formed by compositional data, we will follow a different approach to handle this kind of data compared with the classical approach described by Aitchison [5] because the data we consider have additional structure. In Figure 2 the sample space for the sample of size N = 10 is displayed using open circles. The vertices correspond to the case where all 10 values fall in a single bin. The other points correspond to the less extreme cases. Let p0 be any point in Δn−1. By mapping the multinomial random variable of counts X to Δn−1, we obtain the random distribution �P = X/N whose values are multinomial 46 Entropy 2014, 16, 4088–4100 distributions each having number of cases N and probability vector X/N. Identifying X N n -valued random variables with distribution-valued random variables provides a natural means for comparing data with probability models using the Kullback–Leibler (KL) divergence. We can compare distributions in Δn−1 using the KL divergence D : P × P �→ R D(p1, p2) = ∑ p1 log (p1/p2) = H(p1, p2) − H(p1) where H(p1, p2) = − ∑ p1 log(p2) and H(p1) = H(p1, p1) is the entropy of p1. Note that the arguments to D and H are distributions while the logarithm and ratios are defined on points in Rn. Following Wu and Vos [6], the variance of the random distribution �P is defined to be Varp0( �P) = min p∈Δn−1 Ep0D( �P, p) and its mean is defined to be Ep0( �P) = arg min p∈Δn−1 Ep0D( �P, p). Note that the expectation on the right hand side of the equations above are for real-valued random variables while the expectation on the left hand side of the second equation is for a distribution-valued random variable. Figure 2. Simplex for n = 3 bins and sample space for N = 10 observations. It is not difficult to show that Ep0 �P = p0 so that �P can be considered an unbiased estimator for p0. Details are in [6], which also shows that the KL risk can be decomposed into bias-squared and variance terms: Ep0D( �P, q) = D(p0, q) + Varp0( �P). The distributional variance is related to the entropy Varp0( �P) = Ep0D( �P, p0) = H(p0) − Ep0 H( �P). Note that for N = 1, H( �P) = 0 so that for a single observation the random distribution �P taking values on the vertices of Δn−1 has variance equal to the entropy of p0. For inference, p0 is unknown but we specify a subspace M ⊂ Δn−1 that contains p0, or at least has distributions that are not too different from p0. Estimates can be obtained by choosing a parameterization for M, say θ, and then considering real-valued functions ˆθ and evaluating these in 47 Entropy 2014, 16, 4088–4100 terms of bias and variance. Bias and variance are useful descriptions when θ describes a feature of the distribution that is of inherent interest. However, if θ is simply a parameterization, or if there are other features that are also of interest, then these quantities are less useful. For inference regarding the distribution p0 we can use a distribution-valued estimator �PM where the subscript indicates that the estimator is defined to account for the fact that p0 ∈ M. We will not pursue the details of distribution-valued estimators here; we mention these only because all the subspaces we consider will be exponential families and in this case the maximum likelihood estimator has important properties in terms of distribution variance and distribution bias: when M is an exponential family, the maximum likelihood estimator is distribution unbiased, and it uniquely minimizes the distribution variance among the class of all distribution unbiased estimators. Furthermore, when p0 ̸∈ M then the maximum likelihood estimator is the unique unbiased minimum distribution variance estimator of the distribution in M that is closest (in terms of KL) to p0. Extensions of one dimensional exponential families that do not result in exponential families will not enjoy these properties of maximum likelihood estimation. Details of these results that hold for sample spaces more general than Sn are in [7]. 4. Simplices Δs One dimensional exponential families on Sn are curves in Δn−1 whose properties will depend on their location within various subspaces of Δn−1. An important collection of subspaces will be indexed by the subsets of Sn. For notational convenience we take Bi to the integer i. Using integers is suggestive of an ordering and a scale structure but at this point these are only being used to indicate distinct bins. For each s ⊂ Sn, Δs = � p ∈ Rn : pi ≥ 0 ∀i ∈ s, pi = 0 ∀i ∈ sc, 1′p = 1 � where sc = {i ∈ Sn : i ̸∈ s}. Note that ΔSn = Δn−1. The interior of Δs is Δ◦ s = � p ∈ Δs : pi > 0 ∀i ∈ s � . As probability distributions in P, Δ◦ s corresponds to the set of all distributions having support s. There is a simple and obvious relationship between the dimension of Δs, |Δs|, and the cardinality of s, |s|, which holds for all nonempty s ⊂ Sn |Δs| + 1 = |s|. The boundary of Δs is defined as ∂Δs = {p ∈ Δs : p ̸∈ Δ◦ s } so that Δs = Δ◦ s ⊎ ∂Δs where ⊎ indicates the sets in the union are disjoint. The boundary ∂Δs can be written as the union of all simplices of dimension one less than that Δs ∂Δs = � s′:s′⊂s, |s′|=|s|−1 Δs′ (3) This boundary property for Δs holds because the simplex Sn consists of all possible subsets. Each nonempty s ∈ Sn specifies one of the possible supports for distribution P ∈ Pn Δs = � s′:s′⊂s Δ◦ s′ (4) 48 Entropy 2014, 16, 4088–4100 where we set Δ∅ = ∅. 5. Cones Λs The set of all nonempty subsets of the sample space provides a partition of Δn−1 based on the support of the distributions in P. The elements in the partition are simplices whose dimension is one less than the cardinality of the indexing set. In most cases we will consider models having support Sn, that is, models corresponding to Δ◦ n−1. If we use subsets s to define the mode rather than support, we obtain a partition of P◦, the distributions in P having support Sn. This partition can be expressed using convex cones in an n − 1 dimensional plane Vn−1. The dimension of the cones are n minus the cardinality of the indexing set and the relationship between interiors of cones and their boundaries is analogous to that for simplices expressed in Equations (3) and (4). Let Vn−1 = � v ∈ Rn : 1′v = 0 � (5) be the subspace of Rn of dimension n − 1 of all vectors that sum to zero. For each nonempty s ∈ Sn define Λs = � v ∈ Vn−1 : vi ≥ vj ∀i ∈ s, ∀j ∈ Sn � . It is easily checked that Λs is a convex cone v1, v2 ∈ Λs =⇒ a1v1 + a2v2 ∈ Λs ∀a1, a2 ∈ [0, ∞) . The dimension of Λs is |Λs| = n − |s| since each point in j ∈ sc provides a basis vector bj whose ith component is 1 if i ∈ s or i = j and is zero otherwise and |sc| = n − |s|. The interior of Λs is Λ◦ s = � v ∈ Λs : vi > vj ∀i ∈ s, ∀j ∈ sc� , the boundary is ∂Λs = {v ∈ Λs : v ̸∈ Λ◦ s } , so that Λs = Λ◦ s ⊎ ∂Λs by definition. Note ΛSn = Λ◦ Sn = 0 ∈ Vn−1 ⊂ Rn where the first equality holds because the conditions in the definition of Λ◦ s hold vacuously since i ∈ Sc n = ∅ adds no restriction. Likewise, we can extend the definition of Λs to include s = ∅ and since i ∈ ∅ adds no restriction Λ∅ = Λ◦ ∅ = Vn−1. Note that Λ∅ depends on the cardinality of the set Sn. Since we are considering n fixed, we will not show this dependence in the notation. Corresponding to Equation (3) we have for all nonempty s that the boundary of the cone Λs is the union of all cones having dimension one less than the dimension of Λs ∂Λs = � s′:s⊂s′, |s′|=|s|+1 Λs′. (6) Corresponding to Equation (4) we have Λs = � s′:s⊂s′ Λ◦ s′ (7) The relationship between the simplices Δ and cones Λ is more easily seen if we suppress the sets that index these objects. Let Δ and Δ∗ be any two simplices and let Λ and Λ∗ be any two convex 49 Entropy 2014, 16, 4088–4100 cones. We only consider cones and simplices that correspond to a nonempty subset of Sn. Then the Equations (6) and (7) for the convex cones are obtained by simply replacing Δ in Equations (3) and (4) with Λ: ∂Δ = � Δ∗:|Δ∗|=|Δ|−1 Δ∗, ∂Λ = � Λ∗:|Λ∗|=|Λ|−1 Λ∗ (8) Δ = � Δ∗⊂Δ Δ◦ ∗, Λ = � Λ∗⊂Λ Λ◦ ∗ (9) Equation (9) also holds for the empty set since Δ∅ = ∅ and Λ∅ = Vn−1. 6. Vn−1 and P◦ There is a natural bijection φ between Vn−1 and Δ◦ n−1 defined by φ(p) = log(p) − m(p)1 where log(p) is the vector with ith component log(pi) and m(p) is defined so that 1′φ(p) = 0. The inverse is ϕ(v) = k−1(v) exp(v) where exp(v) is the vector with ith component exp(vi) and k(v) is defined so that 1′ exp(v) = 1. Each cone Λ◦ s in the partition Vn−1 = � Λ◦ s corresponds to one of the 2n − 1 possible modes for any distribution having support Sn since vi > vj if and only if ϕi(v) > ϕj(v). 7. Vn−1 and Exponential Families in P◦ We define a line by a pair of vectors v0, v1 ∈ Vn−1 with v1 ̸= 0 ℓ = ℓ(t) = {v ∈ Vn−1 : v = v0 + tv1, t ∈ R} Note that v0 and v1 are not unique. Applying the inverse transformation ϕ to points in ℓ gives probability densities ϕ(v0 + tv1) = exp(v0 + tv1) 1′ exp(v0 + tv1) (10) which have the exponential family form with t playing the role of the natural parameter. Therefore, the space Vn−1 is easily recognized as the natural parameter space for the distributions Δ◦ n−1 so that each line ℓ in Vn−1 corresponds to a one dimensional exponential family. For each line ℓ(t) there is a value tmax such that {ℓ(t) : t ≥ tmax} is contained in one of the cones Λ◦ s where s is the subset of Sn with the property that vi 1 ≥ vj 1 for all i ∈ s for vectors v1 ∈ Λ◦ x. For each line ℓ(t) there is a value tmin such that {ℓ(t) : t ≤ tmin} is contained in one of the cones Λ◦ s′ where s′ is the subset of Sn with the property that vi 1 ≤ vj 1 for all i ∈ s′ for vectors v1 ∈ Λ◦ x. The cones Λ◦ s and Λ◦ s′ are disjoint and will be called the extremal cones for ℓ. There is at least one other cone Λ◦ s′′ such that ℓ ∩ Λ◦ s′′ ̸= ∅. Any one dimensional exponential family ℓ(t) can be described by an ordered sequence of disjoint cones � Λ◦ s1, Λ◦ s2, . . . , Λ◦ sk � 50 Entropy 2014, 16, 4088–4100 where k = k(ℓ) will depend on the family. These are simply the cones that are traversed by ℓ(t) between its extremal cones. We take Λ◦ sk to be the cone that contains ℓ(t) for all sufficiently large t. Equation (6) for cones means that ∂Λsi ⊂ Λsj for j = i + 1 or j = i − 1 The ordered sequence of cones provides an ordered sequence of unique subsets of Sn (s1, s2, . . . , sk) that we call the modal profile for ℓ as these are the modes realized by the exponential family ℓ(t) between its extremal cones that have modes s1 and sk. Each point on a line ℓ(t) in Vn−1 corresponds to a distribution having support Sn. As t goes to −∞ (+∞) ϕ(ℓ(t)) goes to a distribution having support s1 (sk). In fact, these are the uniform distribution on these supports. For every s ⊂ Sn other than ∅ and Sn, the uniform distribution on s is a limiting distribution for some one dimensional exponential family in P◦. Figure 3 shows Vn−1 for the two dimensional simplex shown in Figure 2. The three rays are the one dimensional cones and the spaces between these cones are the two dimensional cones. The origin is the zero dimensional cone. The sample values on the boundary of Δ2 are not in V2. Note that the one dimensional cones are line segments in Δ2. �� �� � � � �� �� � � � �� Figure 3. V2 for n = 3 bins and sample space for N = 10 observations that are in the interior of Δ2. 8. Ordered Bins and the Monotone Likelihood Ratio Property Let the bins be ordered and assign the first n integers to the bins to reflect this ordering. We seek to define exponential families that have a modal profile of the form ({1} , {1, 2} , {2} , {2, 3} , . . . , {n − 1, n} , {n}) (11) or a contiguous sub-collection of this profile. Extensions to three or more contiguous modes are clearly possible but not discussed here. From the definition of modal profile, it follows that a family with modal profile (11) will have the property that the mode is a non-decreasing function of t. In addition to this property for the mode, we want the likelihood ratio for any two members of the family to provide the same ordering structure 51 Entropy 2014, 16, 4088–4100 as that of the bins. A family that satisfies this condition is said to have the monotone likelihood ratio property with respect to x where x takes the values of the bin labels: 1, 2, . . . , n. Let pθ1 and pθ2 be two distributions in a one dimensional family parameterized by θ and let pθ2/pθ1 be the n-vector with components pj θ2/pj θ1 for 1 ≤ j ≤ n. This family has monotone likelihood ratio if for all θ1 < θ2 and j < j′ pj θ2 pj θ1 < pj′ θ2 pj′ θ1 . A family with this property avoids the problem situation where in general the data in the higher numbered bins are evidence for pθ2 but in going from a particular bin, say j0 to j0 + 1, the likelihood ratio actually decreases. Exponential families such as the binomial and Poisson have this monotone likelihood ratio property for the bin labels. The monotone likelihood ratio property can be extended to allow for likelihood ratios that are monotone in some function of x. An important advantage of families with the monotone likelihood ratio property is the existence of uniformly most powerful tests. To ensure that our exponential families have the monotone likelihood ratio property we consider vectors in the cone Λ↑ ⊂ Λn Λ↑ = � v : vi < vj, i < j � . From Equation (10), the exponential family indexed by θ is k(θ) exp(v0 + θv1) pj θ2 pj θ1 = k(θ2) k(θ1) exp � (θ2 − θ1) vj 1 � so that the likelihood ratio is monotone in j if v1 ∈ Λ↑. 9. Selecting Vectors in Λ↑ In order to choose n-dimensional vectors v ∈ Λ↑ we will consider a set of infinite dimensional vectors f. Let ¯f : R �→ R and consider f = ¯f |Z where Z is the set of integers. The function f is represented by a doubly infinite sequence f = . . . , f j−1, f j, f j+1, . . . and we denote the set of all such functions as F = � f : f j ∈ R ∀ j ∈ Z � . While it is not necessary to consider functions ¯f to define f, these functions are useful to describe properties of f, which can be thought of as a discretized version of ¯f. Define the gradient of f as the function ∇ whose jth component is (∇ f )j = f j − f j−1 The simplest functions in F are the constant functions F0 = � f ∈ F : f j = f j′ ∀j, j′ ∈ Z � . The next simplest functions are those whose gradient is constant. We call these first order functions and denote the set of these as F1 = { f ∈ F : ∇ f ∈ F0} . 52 Entropy 2014, 16, 4088–4100 Functions in F1 are such that changes from one bin to the next bin is the same for all bins. That is, these functions describe constant change. We can write the functions in F1 explicitly as F1 = � f ∈ F : f j = aj + b, a, b ∈ R � which shows that each f ∈ F1 is the discretized version of a function ¯f whose graph is a line in R × R. We obtain a vector v from f by defining the jth component of v as vj = f j − n ∑ 1 f i . From this definition we see that the intercept b of f does not affect v and that the slope is a scaling factor so that the restriction to first order functions results in a single direction in Λ↑. This direction defines the one dimensional cone defined by the vector with vj = j − (n + 1)/2. Additional directions can be obtained from the second order functions F2 = { f ∈ F : ∇ f ∈ F1} . If f ∈ F2 then (∇2 f )j = a for some a ∈ R and for all j ∈ Z. Using the fact that (∇2 f )j = (∇(∇ f ))j = ( f j − f j−1) − ( f j−1 − f j−2) = f j + f j−2 − 2f j−1 the second order functions can be written explicitly as F2 = � f ∈ F : f j = a 2 j(j + 1) + bj + c, a, b, c ∈ R � . In order for the vector v obtained from f ∈ F2 to be in Λ↑ we need (∇ f )j ≥ 0 for j = 1, 2, . . . , n. With f j = (a/2)j(j + 1) + bj + c we have (∇ f )j = aj + b so that for a > 0 we require b ≥ −a and for a < 0 we require b ≥ −an. Since we are concerned with the direction rather than the magnitude we can take a = ±1 and the value of c is chosen so the sum of the components is zero. The second order vectors in Λ↑ consists of the cone defined by the vectors v20 and v21 having components defined by (n − 1)(v20)j = 1 2 j(j + 1) − j − c20 (n − 1)(v21)j = −1 2 j(j + 1) + nj − c21 Notice that this cone contains v1 since v1 is proportional to v20 + v21. Many discrete one dimensional exponential families (e.g., binomial, negative binomial, and Poisson) use the vector v1. Furthermore, many continuous one dimensional exponential families use the continuous function f used to define v1: normal with σ known, and the gamma and inverse Gaussian distributions with known shape parameter (the shape parameter is the non-scale parameter). The cone defined by v20 and v21 allows us to perturb the v1 direction to obtain related exponential families that we would expect to have similar properties. Figure 4 shows v20 and v21 as well as v1 = 0.5v20 + 0.5v21. Other vectors can be used to define cones around v1. Looking at common exponential families we see that log(x) and x−1 are sufficient statistics so that these suggest taking ¯f (x) = log(x) or ¯f (x) = 1/x. These can be further generalized to ¯f (x; λ), which can be the power family or some other family of transformations. The vectors v f0 and v f1 are defined using the discretized f with the constraints that v f0, v f1 ∈ Λ↑ and 0.5v f0 + 0.5v f1 = v1. 53 Entropy 2014, 16, 4088–4100 An exponential family with sufficient statistic x can be modified by choosing a function ¯f (x) and 0 ≤ α ≤ 1 where α = 0.5 corresponds to the original exponential family and other values perturb this direction. We denote this vector as v f α so that v0 + tv f α is the natural parameter of the modified family. Figure 4 shows the components of the vectors v20 and v21. � �� �� �� �� ��� ��� ���� ���� ��� ��� ��� ����������� ����� ��������������� ��� ��� Figure 4. Components of the vectors v20 and v21 for n = 128 bins. Since v0 is common to each exponential family with natural parameter ℓ(t) = v0 + tv f α, the monotone likelihood ratio property will hold even if v0 ̸∈ Λ↑. Initial choices for v0 are suggested by the Poisson, binomial, and negative binomial distributions: (vPoisson)j = − log Γ(j) + c /∈ Λ↑ (vbinomial)j = log Γ(n) − log Γ(j) − log Γ(n − j) + c /∈ Λ↑ (vneg.bin.)j = log Γ(j + r) − log Γ(j) + c ∈ Λ↑ where c is a constant chosen so that the components sum to 1, n is the number of bins, and r is a positive real constant. Author Contributions: This paper was initiated by the first author but all sections reflect a collaborative effort. Both authors have read and approved the final manuscript. Conflicts of Interest: The authors declare no conflict of interest. References 1. Altham, P.M.E. Two Generalizations of the Binomial Distribution. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1978, 27, 162–167. 2. Lovison, G. An alternative representation of Altham’s multiplicative-binomial distribution. Stat. Probab. Lett. 1998, 36, 415–420. 3. Brown, L. Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory; IMS Lecture Notes; Institute of Mathematical Statistics: Hayward, CA, USA, 1986. 4. Efron, B. Double Exponential Families and Their Use in Generalized Linear Regression. J. Am. Stat. Assoc. 1986, 81, 709–721. 5. Aitchison, J. The Statistical Analysis of Compositional Data; Chapman and Hall: London, UK, 1986. 54 Entropy 2014, 16, 4088–4100 6. Wu, Q.; Vos, P. Decomposition of Kullback–Leibler risk and unbiasedness for parameter-free estimators. J. Stat. Plan. Inference 2012, 142, 1525–1536. 7. Vos, P.; Wu, Q. Maximum Likelihood Estimators Uniformly Minimize Distribution Variance among Distribution Unbiased Estimators in Exponential Families. Bernoulli 2014, submitted. c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 55 entropy Article On the Fisher Metric of Conditional Probability Polytopes Guido Montúfar 1,*, Johannes Rauh 1 and Nihat Ay 1,2,3 1 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig 04103, Germany; E-Mails: jrauh@mis.mpg.de (J.R.); nay@mis.mpg.de (N.A.) 2 Department of Mathematics and Computer Science, Leipzig University, PF 10 09 20, Leipzig 04009, Germany 3 Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA * E-Mail: montufar@mis.mpg.de; Tel.: +49-341-9959-521. Received: 31 March 2014; in revised form: 18 May 2014 / Accepted: 29 May 2014 / Published: 6 June 2014 Abstract: We consider three different approaches to define natural Riemannian metrics on polytopes of stochastic matrices. First, we define a natural class of stochastic maps between these polytopes and give a metric characterization of Chentsov type in terms of invariance with respect to these maps. Second, we consider the Fisher metric defined on arbitrary polytopes through their embeddings as exponential families in the probability simplex. We show that these metrics can also be characterized by an invariance principle with respect to morphisms of exponential families. Third, we consider the Fisher metric resulting from embedding the polytope of stochastic matrices in a simplex of joint distributions by specifying a marginal distribution. All three approaches result in slight variations of products of Fisher metrics. This is consistent with the nature of polytopes of stochastic matrices, which are Cartesian products of probability simplices. The first approach yields a scaled product of Fisher metrics; the second, a product of Fisher metrics; and the third, a product of Fisher metrics scaled by the marginal distribution. Keywords: Fisher information metric; information geometry; convex support polytope; conditional model; Markov morphism; isometric embedding; natural gradient 1. Introduction The Riemannian structure of a function’s domain has a crucial impact on the performance of gradient optimization methods, especially in the presence of plateaus and local maxima. The natural gradient [1] gives the steepest increase direction of functions on a Riemannian space. For example, artificial neural networks can often be trained by following some function’s gradient on a space of probabilities. In this context, it has been observed that following the natural gradient with respect to the Fisher information metric, instead of the Euclidean metric, can significantly alleviate the plateau problem [1,2]. The Fisher information metric, which is also called Shahshahani metric [3] in biological contexts, is broadly recognized as the natural metric of probability spaces. An important argument was given by Chentsov [4], who showed that the Fisher information metric is the only metric on probability spaces for which certain natural statistical embeddings, called Markov morphisms, are isometries. More generally, Chentsov’s Theorem characterizes the Fisher metric and α-connections of statistical manifolds uniquely (up to a multiplicative constant) by requiring invariance with respect to Markov morphisms. Campbell [5] gave another proof that characterizes invariant metrics on the set of non-normalized positive measures, which restrict to the Fisher metric in the case of probability measures (up to a multiplicative constant). In this paper, we explore ways of defining distinguished Riemannian metrics on spaces of stochastic matrices. Entropy 2014, 16, 3207–3233; doi:10.3390/e16063207 www.mdpi.com/journal/entropy 56 Entropy 2014, 16, 3207–3233 In learning theory, when modeling the policy of a system, it is often preferred to consider stochastic matrices instead of joint probability distributions. For example, in robotics applications, policies are optimized over a parametric set of stochastic matrices by following the gradient of a reward function [6,7]. The set of stochastic matrices can be parametrized in many ways, e.g., in terms of feedforward neural networks, Boltzmann machines [8] or projections of exponential families [9]. The information geometry of policy models plays an important role in these applications and has been studied by Kakade [2], Peters and co-workers [10–12], and Bagnell and Schneider [13], among others. A stochastic matrix is a tuple of probability distributions, and therefore, the space of stochastic matrices is a Cartesian product of probability simplices. Accordingly, in applications, usually a product metric is considered, with the usual Fisher metric on each factor. On the other hand, Lebanon [14] takes an axiomatic approach, following the ideas of Chentsov and Campbell, and characterizes a class of invariant metrics of positive matrices that restricts to the product of Fisher metrics in the case of stochastic matrices. We will consider three different approaches discussed in the following. In the first part, we take another look at Lebanon’s approach for characterizing a distinguished metric on polytopes of stochastic matrices. However, since the maps considered by Lebanon do not map stochastic matrices to stochastic matrices, we will use different maps. We show that the product of Fisher metrics can be characterized by an invariance principle with respect to natural maps between stochastic matrices. In the second part, we consider an approach that allows us to define Riemannian structures on arbitrary polytopes. Any polytope can be identified with an exponential family by using the coordinates of the polytope vertices as observables. The inverse of the moment map then defines an embedding of the polytope in a probability simplex. This embedding can be used to pull back geometric structures from the probability simplex to the polytope, including Riemannian metrics, affine connections, divergences, etc. This approach has been considered in [9] as a way to define low-dimensional families of conditional probability distributions. More general embeddings can be defined by identifying each exponential family with a point configuration, B, together with a weight function, ν. Given B and ν, the corresponding exponential family defines geometric structures on the set (conv B)◦, which is the relative interior of the convex support of the exponential family. Moreover, we can define natural morphisms between weighted point configurations as surjective maps between the point sets, which are compatible with the weight functions. As it turns out, the Fisher metric on (conv B)◦ can be characterized by invariance under these maps. In the third part, we return to stochastic matrices. We study natural embeddings of conditional distributions in probability simplices as joint distributions with a fixed marginal. These embeddings define a Fisher metric equal to a weighted product of Fisher metrics. This result corresponds to the Definitions commonly used in robotics applications. All three approaches give very similar results. In all cases, the identified metric is a product metric. This is a sensible result, since the set of k × m stochastic matrices is a Cartesian product of probability simplices Δm−1 × · · · × Δm−1 = Δk m−1, which suggests using the product metric of the Fisher metrics defined on the factor simplices, Δm−1. Indeed, this is the result obtained from our second approach. The first approach yields that same result with an additional scaling factor of 1/k. Only when stochastic matrices of different sizes are compared, the two approaches differ. The third approach yields a product of Fisher metrics scaled by the marginal distribution that defines the embedding. Which metric to use depends on the concrete problem and whether a natural marginal distribution is defined and known. In Section 7, we do a case study using a reward function that is given as an expectation value over a joint distribution. In this simple example, the weighted product metric gives the best asymptotic rate of convergence, under the assumption that the weights are optimally chosen. In Section 8, we sum up our findings. The contents of the paper is organized as follows. Section 2 contains basic Definitions around the Fisher metric and concepts of differential geometry. In Section 3, we discuss the Theorems of Chentsov, Campbell and Lebanon, which characterize natural geometric structures on the probability simplex, 57 Entropy 2014, 16, 3207–3233 on the set of positive measures and on the cone of positive matrices, respectively. In Section 4, we study metrics on polytopes of stochastic matrices, which are invariant under natural embeddings. In Section 5, we define a Riemannian structure for polytopes, which generalizes the Fisher information metric of probability simplices and conditional models in a natural way. In Section 6, we study a class of weighted product metrics. In Section 7, we study the gradient flow with respect to an expectation value. Section 8 contains concluding Remarks. In Appendix A, we investigate restrictions on the parameters of the metrics characterized in Sections 3 and 4 that make them positive definite. Appendix B contains the proofs of the results from Section 4. 2. Preliminaries We will consider the simplex of probability distributions on [m] := {1, . . . , m}, m ≥ 2, which is given by Δm−1 := {(pi)i ∈ Rm : pi ≥ 0, ∑i pi = 1}. The relative interior of Δm−1 consists of all strictly positive probability distributions on [m], and will be denoted Δ◦ m−1. This is a subset of Rm+, the cone of strictly positive vectors. The set of k × m row-stochastic matrices is given by Δk m−1 := {(Kij)ij ∈ Rk×m : (Kij)j ∈ Δm−1 for all i ∈ [k]} and is equal to the Cartesian product ×i∈[k] Δm−1. The relative interior (Δk m−1)◦ is a subset of Rk×m + , the cone of strictly positive matrices. Given two random variables X and Y taking values in the finite sets [k] and [m], respectively, the conditional probability distribution of Y given X is the stochastic matrix K = (P(y|x))x∈[k],y∈[m] with rows (P(y|x))y∈[m] ∈ Δm−1 for all x ∈ [k]. Therefore, the polytope of stochastic matrices Δk m−1 is called a conditional polytope. The tangent space of Rn+ at a point p ∈ Rn+, denoted by TpRn+, is the real vector space spanned by the vectors ∂1, . . . , ∂n of partial derivatives with respect to the n components. The tangent space of Δ◦ n−1 at a point p ∈ Δ◦ n−1 ⊂ Rn+ is the subspace TpΔ◦ n−1 ⊂ TpRn+ consisting of the vectors: u = ∑ i ui∂i ∈ TpRn + with ∑ i ui = 0. (1) The Fisher metric on the positive probability simplex Δ◦ n−1 is the Riemannian metric given by: g(n) p (u, v) = n ∑ i=1 uivi pi , for all u, v ∈ TpΔ◦ n−1. (2) The same formula (2) also defines a Riemannian metric on Rn+, which we will denote by the same symbol. This, however, is not the only way in which the Fisher metric can be extended from Δ◦ n−1 to Rn+. We will discuss other extensions in the next section (see Campbell’s Theorem, Theorem 2). Consider a smoothly parametrized family of probability distributions M = {(p(x; θ))x∈[n] : θ ∈ Ω} ⊆ Δ◦ n−1, where Ω ⊆ Rd is open. Then, g(n) induces a Riemannian metric on M. Denote by ∂θi = ∂ ∂θi the tangent vector corresponding to the partial derivative with respect to θi, for all i ∈ [d]. Then, the Fisher matrix has coordinates: gM θ (∂θi, ∂θj) = ∑ x∈[n] p(x; θ)∂ log p(x; θ) ∂θi ∂ log p(x; θ) ∂θj , for all i, j ∈ [d], for all θ ∈ Ω. (3) Here, it is not necessary to assume that the parameters θi are independent. In particular, the dimension of M may be smaller than d, in which case the matrix is not positive definite. If the map Ω → M, θ �→ p(·; θ) is an embedding (i.e., a smooth injective map that is a diffeomorphism onto its image), then gM θ defines a Riemannian metric on Ω, which corresponds to the pull-back of g(n). Consider an embedding f : E → E′. The pull-back of a metric g′ on E′ through f is defined as: ( f ∗g′)p(u, v) := g′ f (p)( f∗u, f∗v), for all u, v ∈ TpE, (4) 58 Entropy 2014, 16, 3207–3233 where f∗ denotes the push-forward of TpE through f, which in coordinates is given by: f∗ : TpE → Tf (p)E′; ∑ i ui∂θi �→ ∑ j ∑ i ui ∂ fj(p) ∂θi ∂θ′ j, (5) where {∂θi}i spans TqE and {∂θ′ j}j spans Tf (p)E′. An embedding f : E → E′ of two Riemannian manifolds (E, g) and (E′, g′) is an isometry iff: gp(u, v) = ( f ∗g′)p(u, v), for all p ∈ E and u, v ∈ TpE. (6) In this case, we say that the metric g is invariant with respect to f (and g′). 3. The Results of Campbell and Lebanon One of the theoretical motivations for using the Fisher metric is provided by Chentsov’s characterization [4], which states that the Fisher metric is uniquely specified, up to a multiplicative constant, by an invariance principle under a class of stochastic maps, called Markov morphisms. Later, Campbell [5] considered the characterization problem on the space Rn+ instead of Δ◦ n−1. This simplifies the computations, since Rn+ has a more symmetric parametrization. Definition 1. Let 2 ≤ m ≤ n. A (row) stochastic partition matrix (or just row-partition matrix) is a matrix Q ∈ Rm×n of non-negative entries, which satisfies ∑j∈Ai′ Qij = δii′ for an m block partition {A1, . . . , Am} of [n]. The linear map defined by: Rm + → Rn +; p �→ p · Q (7) is called a congruent embedding by a Markov mapping of Rm+ to Rn+ or just a Markov map, for short. An example of a 3 × 5 row-partition matrix is: Q = ⎛ ⎜ ⎝ 1/2 0 1/2 0 0 0 1/3 0 2/3 0 0 0 0 0 1 ⎞ ⎟ ⎠ . (8) Markov maps preserve the 1-norm and restrict to embeddings Δ◦ m−1 → Δ◦ n−1. Theorem 1 (Chentsov’s Theorem.). • Let g(m) be a Riemannian metric on Δ◦ m−1 for m ∈ {2, 3, . . .}. Let this sequence of metrics have the property that every congruent embedding by a Markov mapping is an isometry. Then, there is a constant C > 0 that satisfies: g(m) p (u, v) = C∑ i uivi pi . (9) • Conversely, for any C > 0, the metrics given by Equation (9) define a sequence of Riemannian metrics under which every congruent embedding by a Markov mapping is an isometry. The main result in Campbell’s work [5] is the following variant of Chentsov’s Theorem. Theorem 2 (Campbell’s Theorem.). • Let g(m) be a Riemannian metric on Rm+ for m ∈ {2, 3, . . .}. Let this sequence of metrics have the property that every embedding by a Markov mapping is an isometry. Then: g(m) p (∂i, ∂j) = A(|p|) + δijC(|p|)|p| pi , (10) 59 Entropy 2014, 16, 3207–3233 where |p| = ∑m i=1 pi, δij is the Kronecker delta, and A and C are C∞ functions on R+ satisfying C(α) > 0 and A(α) + C(α) > 0 for all α > 0. • Conversely, if A and C are C∞ functions on R+ satisfying C(α) > 0, A(α) + C(α) > 0 for all α > 0, then Equation (10) defines a sequence of Riemannian metrics under which every embedding by a Markov mapping is an isometry. The metrics from Campbell’s Theorem also define metrics on the probability simplices Δ◦ m−1 for m = 2, 3, . . .. Since the tangent vectors v = ∑i vi∂i ∈ TpΔ◦ m−1 satisfy ∑i vi = 0, for any two vectors u, v ∈ TpΔ◦ m−1, also ∑i ∑j Auivj = 0 for any A. In this case, the choice of A is immaterial, and the metric becomes Chentsov’s metric. Remark 1. Observe that Chentsov’s Theorem is not a direct implication of Campbell’s Theorem. However, it can be deduced from it by the following arguments. Suppose that we have a family of Riemannian simplices (Δ◦ m−1, g(m)) for m ∈ {2, 3, . . .}, and suppose that they are isometric with respect to Markov maps. If we can extend every g(m) to a Riemannian metric ˜g(m) on Rm+ in such a way that the resulting spaces (Rm+, ˜g(m)) are still isometric with respect to Markov maps, then Campbell’s Theorem implies that g(m) is a multiple of the Fisher metric. Such metric extensions can be defined as follows. Consider the diffeomorphism: Δ◦ m−1 × R+ ∼= Rm +, (p, r) �→ r · p. (11) Any tangent vector u ∈ T(p,r)Rm+ can be written uniquely as u = up + ur∂r, where up is tangent to rΔ◦ m−1. Since each Markov map f preserves the one-norm | · |, its push-forward f∗ maps the tangent vector ∂r ∈ T(p,r)Rm+ to the corresponding tangent vector ∂r ∈ Tf (p,r)Rm+; that is, f∗u = f∗up + ur∂r. Therefore, ˜g(m) (p,r)(u, v) := g(m) p (up, vp) + urvr (12) is a metric on Rm+ that is invariant under f. In what follows, we will focus on positive matrices. In order to define a natural Riemannian metric, we can use the identification Rk×m + ∼= Rkm + and apply Campbell’s Theorem. This leads to metrics of the form: g(k,m) M (∂ij, ∂kl) = A(|M|) + δikδjlC(|M|)/Mij, (13) where ∂ij = ∂ ∂Mij and |M| = ∑ij Mij. However, a disadvantage of this approach is that the action of general Markov maps on Rkm + has no natural interpretation in terms of the matrix structure. Therefore, Lebanon [14] considered a special class of Markov maps defined as follows. Definition 2. Consider a k × l row-partition matrix R and a collection of m × n row-partition matrices Q = {Q(1), . . . , Q(k)}. The map: Rk×m + → Rl×n + ; M �→ R⊤(M ⊗ Q) (14) is called a congruent embedding by a Markov morphism of Rk×m + to Rl×n + in [15]. We will refer to such an embedding as a Lebanon map. Here, the row product M ⊗ Q is defined by: (M ⊗ Q)ab = (M · Q(a))ab, for all a ∈ [k], b ∈ [n]; (15) that is, the a-th row of M is multiplied by the matrix Q(a). In a Lebanon map, each row of the input matrix M is mapped by an individual Markov mapping Q(i), and each resulting row is copied and scaled by an entry of R. This kind of map preserves the sum of all matrix entries. Therefore, with the identification Rk×m + ∼= Rkm + , each Lebanon map restricts 60 Entropy 2014, 16, 3207–3233 to a map Δ◦ mk−1 → Δ◦ nl−1. The set Δ◦ mk−1 can be identified with the set of joint distributions of two random variables. Lebanon maps can be regarded as special Markov maps that incorporate the product structure present in the set of joint probability distributions of a pair of random variables. In Section 4, we will give an interpretation of these maps. Contrary to what is stated in [15], a Lebanon map does not map (Δk m−1)◦ to (Δl n−1)◦, unless k = l. Therefore, later, we will provide a characterization for the metrics on (Δk m−1)◦ in terms of invariance under other maps (which are not Markov nor Lebanon maps). The main result in Lebanon’s work [15, Theorems 1 and 2] is the following. Theorem 3 (Lebanon’s Theorem.). • For each k ≥ 1, m ≥ 2, let g(k,m) be a Riemannian metric on Rk×m + in such a way that every Lebanon map is an isometry. Then: g(k,m) M (∂ab, ∂cd) = A(|M|) + δac � B(|M|) |Ma| + δbd C(|M|) Mab � (16) for some differentiable functions A, B, C ∈ C∞(R+). • Conversely, let {(Rk×m + , g(k,m))} be a sequence of Riemannian manifolds, with metrics g(k,m) of the form (16) for some A, B, C ∈ C∞(R+). Then, every Lebanon map is an isometry. Lebanon does not study the question under which assumptions on A, B, C ∈ C∞(R+) the formula (16) does indeed define a Riemannian metric. This question has the following simple answer, which we will prove in Appendix A: Proposition 1. The matrix (16) is positive definite if and only if C(|M|) > 0, B(|M|) + C(|M|) > 0 and A(|M|) + B(|M|) + C(|M|) > 0. The class of metrics (16) is larger than the class of metrics (13) derived in Campbell’s Theorem. The reason is that Campbell’s metrics are invariant with respect to a larger class of embeddings. The special case with A(|M|) = 0, B(|M|) = 0 and C(|M|) = 1 is called product Fisher metric, g(k,m) M (∂ab, ∂cd) = δacδbd 1 Mab . (17) Furthermore, if we restrict to (Δk m−1)◦, the functions A and B do not play any role. In this case |M| = k, and we obtain the scaled product Fisher metric: g(k,m) M (∂ab, ∂cd) = δacδbd C(k) Mab , (18) where C(k) : N → R+ is a positive function. As mentioned before, Lebanon’s Theorem does not give a characterization of invariant metrics of stochastic matrices, since Lebanon maps do not preserve the stochasticity of the matrices. However, Lebanon maps are natural maps on the set Δ◦ mk−1 of positive joint distributions. In the same way as Chentsov’s Theorem can be derived from Campbell’s Theorem (see Remark 1), we obtain the following Corollary: 61 Entropy 2014, 16, 3207–3233 Corollary 1. • Let {(Δ◦ km−1, g(k,m)): k ≥ 1, m ≥ 2} be a double sequence of Riemannian manifolds with the property that every Lebanon map is an isometry. Then: g(k,m) P (u, v) = B∑ a ∑ b,c uabuac |Pa| + C∑ a ∑ b uabvab Pab , for each P ∈ Δ◦ km−1, (19) for some constants B, C ∈ R with C > 0 and B + C > 0, where |Pa| = ∑b Pab. • Conversely, let {(Δ◦ km−1, g(k,m))} be a sequence of Riemannian manifolds with metrics g(k,m) of the form of Equation (19) for some B, C ∈ R. Then, every Lebanon map is an isometry. Observe that these metrics agree with (a multiple of) the Fisher metric only if B = 0. The case B = 0 can also be characterized; note that Lebanon maps do not treat the two random variables symmetrically. Switching the two random variables corresponds to transposing the joint distribution matrix P. When exchanging the role of the two random variables, the Lebanon map becomes P �→ (P⊤ ⊗ Q)⊤R. We call such a map a dual Lebanon map. If we require invariance under both Lebanon maps and their duals in Theorem 3 or Corollary 1, the statements remain true with the additional restriction that B = 0 (as a function or constant, respectively). 4. Invariance Metric Characterizations for Conditional Polytopes According to Chentsov’s Theorem (Theorem 1), a natural metric on the probability simplex can be characterized by requiring the isometry of natural embeddings. Lebanon follows this axiomatic approach to characterize metrics on products of positive measures (Theorem 3). However, the maps considered by Lebanon dissolve the row-normalization of conditional distributions. In general, they do not map conditional polytopes to conditional polytopes. Therefore, we will consider a slight modification of Lebanon maps, in order to obtain maps between conditional polytopes. 4.1. Stochastic Embeddings of Conditional Polytopes A matrix of conditional distributions P(Y|X) in Δk m−1 can be regarded as the equivalence class of all joint probability distributions P(X, Y) ∈ Δkm−1 with conditional distribution P(Y|X). Which Markov maps of probability simplices are compatible with this equivalence relation? The most obvious examples are permutations (relabelings) of the state spaces of X and Y. In information theory, stochastic matrices are also viewed as channels. For any distribution of X, the stochastic matrix gives us a joint distribution of the pair (X, Y) and, hence, a marginal distribution of Y. If we input a distribution of X into the channel, the stochastic matrix determines what the distribution of the output Y will be. Channels can be combined, provided the cardinalities of the state spaces fit together. If we take the output Y of the first channel P(Y|X) and feed it into another channel P(Y′|Y) then we obtain a combined channel P(Y′|X). The composition of channels corresponds to ordinary matrix multiplication. If the first channel is described by the stochastic matrix K and the second channel by Q, then the combined channel is described by K · Q. Observe that in this case, the joint distribution P (considered as a normalized matrix P ∈ Δkm−1) is transformed similarly; that is, the joint distribution of the pair (X, Y′) is given by P · Q. More general maps result from compositions where the choice of the second channel depends on the input of the first channel. In other words, we have a first channel that takes as input X and gives as output Y, and we have another channel that takes as input (X, Y) and gives as output Y′; we are interested in the resulting channel from X to Y′. The second channel can be described by a collection of stochastic matrices Q = {Q(i)}i. If K describes the first channel, then the combined channel is described by the row product K ⊗ Q (see Definition 2). Again, the joint distribution of (X, Y′) arises in a similar way as P ⊗ Q. 62 Entropy 2014, 16, 3207–3233 We can also consider transformations of the first random variable X. Suppose that we use X as the input to a channel described by a stochastic matrix R. In this case, the joint distribution of the output X′ of the channel and Y is described by R⊤X. However, in general, there is not much that we can say about the conditional distribution of Y given X′. The result depends in an essential way on the original distribution of X. However, this is not true in the special case that the channel is “not mixing”, that is, in the case that R is a stochastic partition matrix. In this case, the conditional distribution P(Y|X′) is described by R⊤K, where R is the corresponding partition indicator matrix, where all non-zero entries of R are replaced by one. In other words, each state of X corresponds to several states of X′, and the corresponding row of K is copied a corresponding number of times. To sum up, if we combine the transformations due to Q and R, then the joint probability distribution transforms as P �→ R⊤(P ⊗ Q) and the conditional transforms as K �→ R⊤(K ⊗ Q). In particular, for the joint distribution, we obtain the Definition of a Lebanon map. Figure 1 illustrates the situation. X Y X′ Y′ K K′ R Q joint distributions: P′ = R⊤(P ⊗ Q) conditional distributions: K′ = R⊤(K ⊗ Q) Figure 1. An interpretation for Lebanon maps and conditional embeddings. The variable X′ is computed from X by R, and Y′ is computed from X and Y by Q. Finally, we will also consider the special case where the partition of R (and R) is homogeneous, i.e., such that all blocks have the same size. For example, this describes the case where there is a third random variable Z that is independent of Y given X. In this case, the conditional distribution satisfies P(Y|X) = P(Y|X, Z), and R describes the conditional distribution of (X, Z) given X. Definition 3. A (row) partition indicator matrix is a matrix R ∈ {0, 1}k×l that satisfies: Rij = � 1, if j ∈ Ai, 0, else, (20) for a k block partition {A1, . . . , Ak} of [l]. For example, the 3 × 5 partition indicator matrix corresponding to Equation (8) is: R = ⎛ ⎜ ⎝ 1 0 1 0 0 0 1 0 1 0 0 0 0 0 1 ⎞ ⎟ ⎠ . (21) Definition 4. Consider a k × l partition indicator matrix R and a collection of m × n stochastic partition matrices Q = {Q(i)}k i=1. We call the map: f : Rk×m + → Rl×n + ; M �→ R⊤(M ⊗ Q) (22) a conditional embedding of Rk×m + in Rl×n + . We denote the set of all such maps by ˆF l,n k,m. If R is the partition indicator matrix of a homogeneous partition (with partition blocks of equal cardinality), then we call f a homogeneous conditional embedding. We denote the set of all such homogeneous conditional embeddings by F l,n k,m and assume that l is a multiple of k. 63 Entropy 2014, 16, 3207–3233 Conditional embeddings preserve the 1-norm of the matrix rows; that is, the elements of ˆF l,n k,m map (Δk m−1)◦ to (Δl n−1)◦. On the other hand, they do not preserve the 1-norm of the entire matrix. Conditional embeddings are Markov maps only when k = l, in which case they are also Lebanon maps. 4.2. Invariance Characterization Considering the conditional embeddings discussed in the previous section, we obtain the following metric characterization. Theorem 4. • Let g(k,m) denote a metric on Rk×m + for each k ≥ 1 and m ≥ 2. If every homogeneous conditional embedding f ∈ F l,n k,m is an isometry with respect to these metrics, then: g(k,m) M (∂ab, ∂cd) = A k2 + δac � k B k2 + δbd |M| Mab C k2 � , for all M ∈ Rk×m + , (23) for some constants A, B, C ∈ R, where ∂ab = ∂ ∂Mab and |M| = ∑ab Mab. • Conversely, given the metrics defined by Equation (23) for any non-degenerate choice of constants A, B, C ∈ R, each homogeneous conditional embedding f ∈ F l,n k,m, k ≤ l, m ≤ n is an isometry. • Moreover, the tensors g(k,m) from Equation (23) are positive-definite for all k ≥ 1 and m ≥ 2 if and only if C > 0, B + C > 0 and A + B + C > 0. The proof of Theorem 4 is similar to the proof of the Theorems of Chentsov, Campbell and Lebanon. Due to its technical nature, we defer it to Appendix B. Now, for the restriction of the metric g(k,m) to (Δk m−1)◦, we have the following. In this case, |M| = k. Since tangent vectors v = ∑ab vab∂ab ∈ TM(Δk m−1)◦ satisfy ∑b vab = 0 for all a, the constants A and B become immaterial, and the metric can be written as: g(k,m) M (u, v) = ∑ ab |M|uabvab Mab C k2 = ∑ ab uabvab Mab C k , for all u, v ∈ TM(Δk m−1)◦. (24) This metric is a specialization of the metric (18) derived by Lebanon (Theorem 3). The statement of Theorem 4 becomes false if we consider general conditional embeddings instead of homogeneous ones: Theorem 5. There is no family of metrics g(k,m) on Rk×m + (or on (Δk m−1)◦) for each k ≥ 1 and m ≥ 2, for which every conditional embedding f ∈ ˆF l,n k,m is an isometry. This negative result will become clearer from the perspective of Section 6: as we will show in Theorem 7, although there are no metrics that are invariant under all conditional embeddings, there are families of metrics (depending on a parameter, ρ) that transform covariantly (that is, in a well-defined manner) with respect to the conditional embeddings. We defer the proof of Theorem 5 to Appendix B. 5. The Fisher Metric on Polytopes and Point Configurations In the previous section, we obtained distinguished Riemannian metrics on Rk×m + and (Δk m−1)◦ by postulating invariance under natural maps. In this section, we take another viewpoint based on general considerations about Riemannian metrics on arbitrary polytopes. This is achieved by embedding each polytope in a probability simplex as an exponential family. We first recall the necessary background. In Section 5.2, we then present our general results, and in Section 5.3, we discuss the special case of conditional polytopes. 64 Entropy 2014, 16, 3207–3233 5.1. Exponential Families and Polytopes Let X be a finite set and A ∈ Rd×X a matrix with columns ax indexed by x ∈ X . It will be convenient to consider the rows Ai, i ∈ [d] of A as functions Ai : X → R. Finally, let ν: X → R+. The exponential family EA,ν is the set of probability distributions on X given by: p(x; θ) = exp(θ⊤ax + log(ν(x)) − log(Z(θ))), for all x ∈ X , for all θ ∈ Rd, (25) with the normalization function Z(θ) = ∑x′∈X exp(θ⊤ax′ + log(ν(x′))). The functions Ai are called the observables and ν the reference measure of the exponential family. When the reference measure ν is constant, ν(x) = 1 for all x ∈ X , we omit the subscript and write EA. A direct calculation shows that the Fisher information matrix of EA,ν at a point θ ∈ Rd has coordinates: gEA,ν θ (∂θi, ∂θj) = covθ(Ai, Aj), for all i, j ∈ [d]. (26) Here, covθ denotes the covariance computed with respect to the probability distribution p(·; θ). The convex support of EA,ν is defined as: conv A := conv{ax : x ∈ X } = � Ep[A]: p ∈ Δ|X |−1 � = � Ep[A]: p ∈ EA,ν � , (27) where conv S is the set of all convex combinations of points in S. The moment map μ : p ∈ Δn−1 �→ A · p ∈ Rd restricts to a homeomorphism EA,ν → conv A; see [16]. Here, EA,ν denotes the Euclidean closure of EA,ν. The inverse of μ will be denoted by μ−1 : conv A → EA,ν ⊆ Δn−1. This gives a natural embedding of the polytope conv A in the probability simplex Δ|X |−1. Note that the convex support is independent of the reference measure ν. See [17] for more details. 5.2. Invariance Fisher Metric Characterizations for Polytopes Let P ∈ Rd be a polytope with n vertices a1, . . . , an. Let A = (a1, . . . , an) be the matrix with columns ai ∈ Rd for all i ∈ [n]. Then, EA ⊆ Δ◦ n−1 is an exponential family with convex support P. We will also denote this exponential family by EP. We can use the inverse of the moment map, μ−1, to pull back geometric structures on Δ◦ n−1 to the relative interior P◦ of P. Definition 5. The Fisher metric on P◦ is the pull-back of the Fisher metric on EA ⊆ Δ◦ n−1 by μ−1. Some obvious questions are: Why is this a natural construction? Which maps between polytopes are isometries between their Fisher metrics? Can we find a characterization of Chentsov type for this metric? Affine maps are natural maps between polytopes. However, in order to obtain isometries, we need to put some additional constraints. Consider two polytopes P ∈ Rd, P′ ∈ Rd′ and an affine map φ : Rd → Rd′ that satisfies φ(P) ⊆ P′. A natural condition in the context of exponential families is that φ restricts to a bijection between the set vert(P) of vertices of P and the set vert(P′) of vertices of P′. In this case, EP′ ⊆ EP ⊆ Δ◦ n−1. Moreover, the moment map μ′ of P′ factorizes through the moment map μ of P: μ′ = φ ◦ μ. Let φ−1 = μ ◦ μ′−1. Then, the following diagram commutes: P◦ EP Δ◦ n−1 P′◦ EP′ μ−1 μ′−1 φ−1 (28) 65 Entropy 2014, 16, 3207–3233 It follows that φ−1 is an isometry from P′◦ to its image in P◦. Observe that the inverse moment map itself arises in this way: In the diagram (28), if P is equal to Δn−1, then the upper moment map μ−1 is the identity map, and φ−1 equals the inverse moment map μ′−1 of P′. The constraint of mapping vertices to vertices bijectively is very restrictive. In order to consider a larger class of affine maps, we need to generalize our construction from polytopes to weighted point configurations. Definition 6. A weighted point configuration is a pair (A, ν) consisting of a matrix A ∈ Rd×n with columns a1, . . . , an and a positive weight function ν : {1, . . . , n} → R+ assigning a weight to each column ai. The pair (A, ν) defines the exponential family EA,ν. The (A, ν)-Fisher metric on (conv A)◦ is the pull-back of the Fisher metric on Δ◦ n−1 through the inverse of the moment map. We recover Definition 5 as follows. For a polytope P, let A be the point configuration consisting of the vertices of P. Moreover, let ν be a constant function. Then, EP = EA,ν, and the two Definitions of the Fisher metric on P◦ coincide. The following are natural maps between weighted point configurations: Definition 7. Let (A, ν), (A′, ν′) be two weighted point configurations with A = (ai)i ∈ Rd×n and A′ = (a′ j)j ∈ Rd′×n′. A morphism (A, ν) → (A′, ν′) is a pair (φ, σ) consisting of an affine mapφ : Rd → Rd′ and a surjective map σ : {1, . . . , n} → {1, . . . , n′} with φ(ai) = a′ σ(i) andν′(a′ j) = α ∑i:σ(i)=j ν(ai), where α > 0 is a constant that does not depend on j. Consider a morphism (φ, σ) : (A, ν) → (A′, ν′). For each j ∈ [n′], let Aj = {i : φ(ai) = a′ j}. Then, (A1, . . . , An′) is a partition of [n]. Define a matrix Q ∈ Rn′×n by: Qji = ⎧ ⎨ ⎩ ν(i) ∑i′∈Aj ν(i′), if i ∈ Aj, 0, else. (29) Then, Q is a Markov mapping, and the following diagram commutes: (conv A)◦ EA,ν Δ◦ n−1 (conv A′)◦ EA′,ν′ Δ◦ n′−1 μ−1 μ′−1 φ−1 Q (30) By Chentsov’s Theorem (Theorem 1), Q is an isometric embedding. It follows that φ−1 also induces an isometric embedding. This shows the first part of the following Theorem: Theorem 6. • Let (φ, σ) : (A, ν) → (A′, ν′) be a morphism of weighted point configurations. Then, φ−1 : (conv A′)◦ → (conv A)◦ is an isometric embedding with respect to the Fisher metrics on (conv A)◦ and (conv A′)◦. • Let gA,ν be a Riemannian metric on (conv A)◦ for each weighted point configuration (A, ν). If every morphism (φ, σ) : (A, ν) → (A′, ν′) of weighted point configurations induces an isometric embedding φ−1 : (conv A′)◦ → (conv A)◦, then there exists a constant α ∈ R+ such that gA,ν is equal to α times the (A, ν)-Fisher metric. 66 Entropy 2014, 16, 3207–3233 Proof. The first statement follows from the discussion before the Theorem. For the second statement, we show that under the given assumptions, all Markov maps are isometric embeddings. By Chentsov’s Theorem (Theorem 1), this implies that the metrics gP agree with the Fisher metric whenever P is a simplex. The statement then follows from the two facts that the metric on P◦ or (conv A)◦ is the pull-back of the Fisher metric through the inverse of the moment map and that μ−1 is itself a morphism. Observe that Δn−1 = conv In = conv{e1, . . . , en} is a polytope, and Δ◦ n−1 is the corresponding exponential family. Consider a Markov embedding Q : Δ◦ n′−1 → Δ◦ n−1, p �→ p · Q. Let ν(i) = ∑j Qji be the value of the unique non-zero entry of Q in the i-th column. This defines a morphism and an embedding as follows: Let A be the matrix that arises from Q by replacing each non-zero entry by one. We define φ as the linear map represented by the matrix A, and define σ : [n] → [n′] by σ(j) = i if and only if aj = ei, that is, σ(j) indicates the row i in which the j-th column of A is non-zero. Then, (φ, σ) is a morphism (In, ν) → (In′, 1), and by assumption, the inverse φ−1 is an isometric embedding Δ◦ n′−1 → Δ◦ n−1. However, φ−1 is equal to the Markov map Q. This shows that all Markov maps are isometric embeddings, and so, by Chentsov’s Theorem, the statement holds true on the simplices. Theorem 6 defines a natural metric on (Δk m−1)◦ that we want to discuss in more detail next. 5.3. Independence Models and Conditional Polytopes Consider k random variables with finite state spaces [n1], . . . , [nk]. The independence model consists of all joint distributions p ∈ Δ∏i∈[k] ni−1 of these variables that factorize as: p(x1, . . . , xk) = ∏ i∈[k] pi(xi), for all x1 ∈ [n1], . . . , xk ∈ [nk], (31) where pi ∈ Δni−1 for all i ∈ [k]. Assuming fixed n1, . . . , nk, we denote the independence model by Ek. It is the Euclidean closure of an exponential family (with observables of the form δiyi). The convex support of Ek is equal to the product of simplices Pk := Δn1−1 × · · · × Δnk−1. The parametrization (31) corresponds to the inverse of the moment map. We can write any tangent vector u ∈ T(p1,...,pk)P◦ k of this open product of simplices as a linear combination u = ∑i∈[k] ∑xi∈[ni] uixi∂i,xi, where ∑xi∈[ni] vixi = 0 for all i ∈ [k]. Given two such tangent vectors, the Fisher metric is given by: gPk (p1,...,pk)(u, v) = ∑ i∈[k] ∑ xi∈[ni] uixivixi pi(xi) . (32) Just as the convex support of the independence model is the Cartesian product of probability simplices, the Fisher metric on the independence model is the product metric of the Fisher metrics on the probability simplices of the individual variables. If n1 = · · · = nk =: n, then Pk = Δk n−1 can be identified with the set of k × n stochastic matrices. The Fisher metric on the product of simplices is equal to the product of the Fisher metrics on the factors. More generally, if P = Q1 × Q2 is a Cartesian product, then the Fisher metric on P◦ is equal to the product of the Fisher metrics on Q◦ 1 and Q◦ 2. In fact, in this case, the inverse of the moment map of P can be expressed in terms of the two moment map inverses μ1 : Q1 → EQ1 ⊆ Δm1−1 and μ2 : Q2 → EQ2 ⊆ Δm2−1 and the moment map ˜μ of the independence model Δm1−1 × Δm2−1, by: μ−1(q1, q2) = ˜μ−1(μ−1 1 (q1), μ−1 2 (q2)). (33) Therefore, the pull-back by μ−1 factorizes through the pull-back by ˜μ−1, and since the independence model carries a product metric, the product of polytopes also carries a product metric. 67 Entropy 2014, 16, 3207–3233 Let us compare the metric g(k,m) K from Equation (24), with the Fisher metric gPk (K1,...,Kk) from Equation (32) on the product of simplices P◦ = (Δk m−1)◦. In both cases, the metric is a product metric; that is, it has the form: g = g1 + · · · + gk, (34) where gi is a metric on the i-th factor Δ◦ m−1. For g Δk Km−1 , gi is equal to the Fisher metric on Δ◦ m−1. However, for g(k,m) K , gi is equal to 1/k times the Fisher metric on Δ◦ m−1. Since this factor only depends on k, it only plays a role if stochastic matrices of different sizes are compared. The additional factor of 1/k can be interpreted as the uniform distribution on k elements. This is related to another more general class of Riemannian metrics that are used in applications; namely, given a function K ∈ Δk m−1 → ρK ∈ Rk+, it is common to use product metrics with gi equal to ρK(i) times the Fisher metric on Δ◦ m−1. When K has the interpretation of a channel or when K describes the policy by which a system reacts to some sensor values, a natural possibility is to let ρK be the stationary distribution of the channel input or of the sensor values, respectively. We will discuss this approach in Section 6. 6. Weighted Product Metrics for Conditional Models In this section, we consider metrics on spaces of stochastic matrices defined as weighted sums of the Fisher metrics on the spaces of the matrix rows, similar to Equation (34). This kind of metric was used initially by Amari [1] in order to define a natural gradient in the supervised learning context. Later, in the context of reinforcement learning, Kakade [2] defined a natural policy gradient based on this kind of metric, which has been further developed by Peters et al. [10]. Related applications within unsupervised learning have been pursued by Zahedi et al. [18]. Consider the following weighted product Fisher metric: gρ,m K = ∑ a ρK(a)g(m),a Ka , for all K ∈ (Δk m−1)◦, (35) where g(m),a Ka denotes the Fisher metric of Δ◦ m−1 at the a-th row of K and ρK ∈ Δ◦ k−1 is a probability distribution over a associated with each K ∈ (Δk m−1)◦. For example, the distribution ρK could be the stationary distribution of sensor values observed by an agent when operating under a policy described by K. In the following, we will try to illuminate the properties of polytope embeddings that yield the metric (35) as the pull-back of the Fisher information metric on a probability simplex. We will focus on the case that ρK = ρ is independent of K. There are two direct ways of embedding Δk n−1 in a probability simplex. In Section 5, we used the inverse of the moment map of an exponential family, possibly with some reference measure. This embedding is illustrated in the left panel of Figure 2. If we have given a fixed probability distribution ρ ∈ Δ◦ k−1, there is a second natural embedding ψρ : Δk m−1 → Δk·m−1 defined as follows: ψρ(K)(x, y) = ρ(x)Kx,y for all x ∈ [k], y ∈ [m]. (36) If ρ is the distribution of a random variable X and K ∈ Δk m−1 is the stochastic matrix describing the conditional distribution of another variable Y given X, then ψρ(K) is the joint distribution of X and Y. Note that ψρ is an affine embedding. See the right panel of Figure 2 for an illustration. The pull-back of the Fisher metric on Δ◦ km−1 through ψρ is given by: g(km) ψρ(K)(ψρ∗u, ψρ∗v) =∑ a,b ∑ c,d ∑ i,j ρ(i)Kijuab ∂ log ρ(i)Kij ∂Kab vcd ∂ log ρ(i)Kij ∂Kcd =∑ i ρ(i)∑ j uijvij Kij = ∑ i ρ(i)gi Ki(ui, vi) = gρ,m K (u, v). (37) 68 Entropy 2014, 16, 3207–3233 This recovers the weighted sum of Fisher metrics from Equation (35). Δk m−1 Δmk−1 Ek Δk·m−1 ψ( 1 4, 3 4)Δk m−1 ψ( 1 2, 1 2)Δk m−1 Figure 2. An illustration of different embeddings of the conditional polytope Δk m−1 in a probability simplex. The left panel shows an embedding in Δmk−1 by the inverse of the moment map μ of the independence model. The right panel shows an affine embedding in Δk·m−1 as a set of joint probability distributions for two different specifications of marginals. Are there natural maps that leave the metrics gρ,m invariant? Let us reconsider the stochastic embeddings from Definition 4. Let R be a k × l indicator partition matrix and R a stochastic partition matrix with the same block structure as R. Observe that to each indicator partition matrix R there are many compatible stochastic partition matrices R, but the indicator partition matrix R for any stochastic partition matrix R is unique. Furthermore, let Q = {Q(a)}a∈[k] be a collection of stochastic partition matrices. The corresponding conditional embedding f maps K ∈ Δk m−1 to f (K) := R⊤(K ⊗ Q) ∈ Δl n−1. Let ρ ∈ Δ◦ k−1. Suppose that K describes the conditional distribution of Y given X and that ψρ(K) describes the joint distribution of Y and X. As explained in Section 4.1, the matrix f (P) := R⊤(P ⊗ Q) describes the joint distribution of a pair of random variables (X′, Y′), and the conditional distribution of Y′ given X′ is given by f (K). In this situation, the marginal distribution of X′ is given by ρ′ = ρR. Therefore, the following diagram commutes: (Δk m−1)◦ Δ◦ mk−1 (Δl n−1)◦ Δ◦ nl−1 ψρ ψρ′ f f (38) The preceding discussion implies the first statement of the following result: 69 Entropy 2014, 16, 3207–3233 Theorem 7. • For any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦ k−1, the Riemannian metric gρ,m on (Δk m−1)◦ satisfies: gρ,m = f ∗(gρ′,n), for ρ′ = ρR, (39) for any conditional embedding f : K �→ R(K ⊗ Q). • Conversely, suppose that for any k ≥ 1 and m ≥ 2 and any ρ ∈ Δ◦ k−1, there is a Riemannian metric g(ρ,m) on (Δk m−1)◦, such that Equation (39) holds for all conditional embeddings, and suppose that g(ρ,m) depends continuously on ρ. Then, there is a constant A > 0 that satisfies g(ρ,m) = Agρ,m. Proof. The first statement follows from the commutative diagram (38). For the second statement, denote by ρk the uniform distribution on a set of k elements. If f : K �→ R(K ⊗ Q) is a homogeneous conditional embedding of Δk m−1 in Δl n−1, then R = k l R is a stochastic partition matrix corresponding to the partition indicator matrix R. Observe that ρl = ρkR. Therefore, the family of Riemannian metrics gρk,m on Δk m−1 satisfies the assumptions of Theorem 4. Therefore, there is a constant A > 0 for which gρk,m equals A/k times the product Fisher metric. This proves the statement for uniform distributions ρ. A general distribution ρ ∈ Δ◦ k−1 can be approximated by a distribution with rational probabilities. Since g(ρ,m) is assumed to be continuous, it suffices to prove the statement for rational ρ. In this case, there exists a stochastic partition matrix R for which ρ′ := ρR is a uniform distribution, and so, g(ρ′,n) is of the desired form. Equation (39) shows that g(ρ,m) is also of the desired form. 7. Gradient Fields and Replicator Equations In this section, we use gradient fields in order to compare Riemannian metrics on the space (Δk n−1)◦. 7.1. Replicator Equations We start with gradient fields on the simplex Δ◦ n−1. A Riemannian metric g on Δ◦ n−1 allows us to consider gradient fields of differentiable functions F: Δ◦ n−1 → R. To be more precise, consider the differential dpF : TpΔ◦ n−1 → R of F in p. It is a linear form on TpΔ◦ n−1, which maps each tangent vector u to dpF(u) = ∂F ∂u(p) ∈ R. Using the map u �→ gp(u, ·), this linear form can be identified with a tangent vector in TpΔ◦ n−1, which we denote by gradpF. If we choose the Fisher metric g(n) as the Riemannian metric, we obtain the gradient in the following way. First consider a differentiable extension of F to the positive cone Rn+, which we will denote by the same symbol F. With the partial derivatives ∂iF of F, the Fisher gradient of F on the simplex Δ◦ n−1 is given as: (gradpF)i = pi � ∂iF(p) − n ∑ j=1 pj ∂jF(p) � , i ∈ [n]. (40) Note that the expression on the right-hand side of Equation (40) does not depend on the particular differentiable extension of F to Rn+. The corresponding differential equation is well known in theoretical biology as the replicator equation; see [19,20]. ˙pi = pi � ∂iF(p) − n ∑ j=1 pj ∂jF(p) � , i ∈ [n]. (41) 70 Entropy 2014, 16, 3207–3233 We now apply this gradient formula to functions that have the structure of an expectation value. Given real numbers Fi, i ∈ [n], referred to as fitness values, we consider the mean fitness: ¯F(p) := n ∑ i=1 pi Fi. (42) Replacing the pi by any positive real numbers leads to a differentiable extension of F, also denoted by F. Obviously, we have ∂iF = Fi, which leads to the following replicator equation: ˙pi = pi (Fi − ¯F(p)) , i ∈ [n]. (43) This equation has the solution: pi(t) = pi(0)etFi ∑n j=1 pj(0)etFi , i ∈ [n]. (44) Clearly, the mean fitness will increase along this solution of the gradient field. The rate of increase can be easily calculated: d dt ¯F � p(t) � = n ∑ i=1 ˙pi(t) Fi = n ∑ i=1 pi (Fi − ¯F(p)) Fi = n ∑ i=1 pi (Fi − ¯F(p))2 = varp(F) > 0. (45) As limit points of this solution, we obtain: lim t→−∞ pi(t) = � pi(0) ∑j∈argmin F pj(0), if i ∈ argmin F 0 , otherwise , i ∈ [n], (46) and: lim t→+∞ pi(t) = � pi(0) ∑j∈argmax F pj(0), if i ∈ argmax F 0 , otherwise , i ∈ [n]. (47) 7.2. Extension of the Replicator Equations to Stochastic Matrices Now, we come to the corresponding considerations of gradient fields in the context of stochastic matrices K ∈ (Δk n−1)◦. We consider a function: K �→ F(K) = F(K11, . . . , K1n; K21, . . . , K2n; . . . ; Kk1, . . . , Kkn). (48) One way to deal with this is to consider for each i ∈ [k] the corresponding replicator equation: ˙Kij = Kij ⎛ ⎝∂ijF(K) − n ∑ j′=1 Kij′ ∂ij′F(K) ⎞ ⎠ , j ∈ [n]. (49) Obviously, this is the gradient field that one obtains by using the product Fisher metric on (Δk n−1)◦ (Equation (17)): g(k,m) K (u, v) = ∑ ij 1 Kij uijvij. (50) If we replace the metric by the weighted product Fisher metric considered by Kakade (Equation (35)), gρ,m K (u, v) = ∑ ij ρi Kij uijvij, (51) 71 Entropy 2014, 16, 3207–3233 then we obtain ˙Kij = Kij ρi ⎛ ⎝∂ijF(K) − n ∑ j′=1 Kij′ ∂ij′F(K) ⎞ ⎠ , j ∈ [n]. (52) 7.3. The Example of Mean Fitness Next, we want to study how the gradient flows with respect to different metrics compare. We restrict to the class of metrics gρ,m (Equation (35)), where ρ ∈ Δ◦ k is a probability distribution. In principle, one could drop the normalization condition ∑i ρi = 1 and allow arbitrary coefficients ρi. However, it is clear that the rate of convergence can always be increased by scaling all values ρi with a common positive factor. Therefore, some normalization condition is needed for ρ. With a probability distribution p ∈ Δ◦ k−1 and fitness values Fij, let us consider again the example of an expectation value function: ¯F(K) = k ∑ i=1 pi n ∑ j=1 Kij Fij. (53) With ∂ij ¯F(π) = pi Fij, this leads to: ˙Kij = pi ρi Kij ⎛ ⎝Fij − n ∑ j′=1 Kij′ Fij′ ⎞ ⎠ , j ∈ [n]. (54) The corresponding solutions are given by: Kij(t) = Kij(0) et pi ρi Fij ∑n j′=1 Kij′(0) et pi ρi Fij′ , i ∈ [n]. (55) Since argmax( pi ρi Fi·) and argmin( pi ρi Fi·) are independent of ρi > 0, the limit points are given independently of the chosen ρ as: lim t→−∞ Kij(t) = ⎧ ⎨ ⎩ Kij(0) ∑j′∈argmin Fi· Kij′(0), if j ∈ argmin Fi· 0 , otherwise , i ∈ [n], (56) and: lim t→+∞ Kij(t) = ⎧ ⎨ ⎩ Kij(0) ∑j′∈argmax Fi· Kij′(0), if j ∈ argmax Fi· 0 , otherwise , i ∈ [n]. (57) This is consistent with the fact that the critical points of gradient fields are independent of the chosen Riemannian metric. However, the speed of convergence does depend on the metric: For each i, let Gi = maxj Fij and gi = maxj/∈argmax(Fij) Fij be the largest and second-largest values in the i-th row of Fij, respectively. Then, as: t → ∞, Kij(t) = � 1 − O(exp(− pi ρi (Gi − gi)t), if i ∈ argmax Fi· O(exp(− pi ρi (Gi − Fij)t) , otherwise (58) 72 Entropy 2014, 16, 3207–3233 Therefore, ¯F(K(t)) = ∑ i pi ∑ j∈argmax Fi· Fij + O � exp(− pi ρi (Gi − gi)t) � = ∑ i pi ∑ j∈argmax Fi· Fij + O � exp(− inf i � pi ρi (Gi − gi) � t) � . (59) Thus, in the long run, the rate of convergence is given by infi{ pi ρi (Gi − gi)}, which depends on the parameter ρ of the metric. As a result, in this case study, the optimal choice of ρi, i.e., with the largest convergence rate, can be computed if the numbers Gi and gi are known. Consider, for example, the case that the differences Gi − gi are of comparable sizes for all i. Then, we need to find the choice of ρ that maximizes infi{ pi ρi }. Clearly, infi{ pi ρi } ≤ 1 (since there is always an index i with pi ≤ ρi). Equality is attained for the choice ρi = pi. Thus, we recover the choice of Kakade. 8. Conclusions So, which Riemannian metric should one use in practice on the set of stochastic matrices, (Δk m−1)◦? The results provided in this manuscript give different answers, depending on the approach. In all cases, the characterized Riemannian metrics are products of Fisher metrics with suitable factor weights. Theorem 4 suggests to use a factor weight proportional to 1/k, and Theorem 6 suggests to use a constant weight independent of k. In many cases, it is possible to work within a single conditional polytope (Δk m−1)◦ and a fixed k, and then, these two results are basically equivalent. On the other hand, Theorem 7 gives an answer that allows arbitrary factor weights ρ. Which metric performs best obviously depends on the concrete application. The first observation is that in order to use the metric gρ,m of Theorem 7, it is necessary to know ρ. If the problem at hand suggests a natural marginal distribution ρ, then it is natural to make use of it and choose the metric gρ,m. Even if ρ is not known at the beginning, a learning system might try to learn it to improve its performance. On the other hand, there may be situations where there is no natural choice of the weights ρ. Observe that ρ breaks the symmetry of permuting the rows of a stochastic matrix. This is also expressed by the structural difference between Theorems 4 and 6 on the one side and Theorem 7 on the other. While the first two Theorems provide an invariance metric characterization, Theorem 7 provides a “covariance” classification; that is, the metrics gρ,m are not invariant under conditional embeddings, but they transform in a controlled manner. This again illustrates that the choice of a metric should depend on which mappings are natural to consider, e.g., which mappings describe the symmetries of a given problem. For example, consider a utility function of the form F = ∑i ρi ∑j KijFij. Row permutations do not leave gρ,m invariant (for a general ρ), but they are not symmetries of the utility function F, either, and hence, they are not very natural mappings to consider. However, row permutations transform the metric gρ,m and the utility function in a controlled manner; in such a way that the two transformations match. Therefore, in this case, it is natural to use gρ,m. On the other hand, when studying problems that are symmetric under all row permutations, it is more natural to use the invariant metric g(k,m). Appendix A Appendix A Conditions for Positive Definiteness Equation (16) in Lebanon’s Theorem 3 defines a Riemannian metrics whenever it defines a positive-definite quadratic form. The next Proposition gives sufficient and necessary conditions for which this is the case. 73 Entropy 2014, 16, 3207–3233 Proposition A1. For each pair k ≥ 1 and m ≥ 2, consider the tensor on Rk×m + defined by: g(k,m) M (∂ab, ∂cd) = A(|M|) + δac � B(|M|) |Ma| + δbd C(|M|) Mab � (A1) for some differentiable functions A, B, C ∈ C∞(R+). The tensor g(k,m) defines a Riemannian metric for all k and m if and only if C(α) > 0, B(α) + C(α) > 0 and A(α) + B(α) + C(α) > 0 for all α ∈ R+. Proof. The tensors are Riemannian metrics when: g(k,m) M (V) = A(|M|)(∑ ab Vab)2 + B(|M|)∑ a |M| |Ma|(∑ b Vab)2 + C(|M|)∑ ab |M| Mab V2 ab (A2) is strictly positive for all non-zero V ∈ Rk×m, for all M ∈ Rk×m + . We can derive necessary conditions on the functions A, B, C from some basic observations. Choosing V = ∂ab in Equation (A2) shows that A(|M|) + |M| |Ma| B(|M|) + |M| Mab C(|M|) has to be positive for all a ∈ [k], b ∈ [m], for all M ∈ Rk×m + . Since Mab can be arbitrarily small for fixed |M| and |Ma|, we see that C has to be non-negative. Since we can choose |Ma| ≈ Mab ≪ |M| for a fixed |M|, we find that B + C has to be non-negative. Further, since we can choose Mab ≈ |Ma| ≈ |M| for a given |M|, we find that A + B + C has to be non-negative. This shows that the quadratic form is positive definite only if C ≥ 0, B + C ≥ 0, A + B + C ≥ 0. Since the cone of positive definite matrices is open, these inequalities have to be strictly satisfied. In the following, we study sufficient conditions. For any given M ∈ Rk×m + , we can write Equation (A2) as a product V⊤GV, for all V ∈ Rkm, where G = GA + GB + GC ∈ Rkm×km is the sum of a matrix GA with all entries equal to A(|M|), a block diagonal matrix GB whose a-th block has all entries equal to |M| |Ma| B(|M|), and a diagonal matrix GC with diagonal entries equal to |M| Mab C(|M|). The matrix G is obviously symmetric, and by Sylvester’s criterion, it is positive definite iff all its leading principal minors are positive. We can evaluate the minors using Sylvester’s determinant Theorem. That Theorem states that for any invertible m × m matrix X, an m × n matrix Y and an n × m matrix Z, one has the equality det(X + YZ) = det(X) det(In + ZX−1Y). Let us consider a leading square block G′, consisting of all entries Gab,cd of G with row-index pairs (a, b) satisfying b ∈ [m] for all a < a′ and b ≤ b′ for a = a′ for some a′ ≤ k and b′ ≤ m; and the same restriction for the column index pairs. The corresponding block G′ A + G′ B can be written as the rank-a′ matrix YZ, with Y consisting of columns 1a for all a ≤ a′ and Z consisting of rows A + 1a |M| |Ma| B for all a ≤ a′. Hence, the determinant of G′ is equal to: det(G′) = det(G′ C) · det(Ia′ + ZG′−1 C Y). (A3) Since G′C is diagonal, the first term is just: det(G′ C) = � ∏ a 0, C + B > 0 and � 1 + ∑a≤a′ Aa C+Ba � > 0 for all a′ and b′. The latter inequality is satisfied whenever A + B + C > 0. This completes the proof. Appendix B Proofs of the Invariance Characterization The following Lemma follows directly from the Definition and contains all the technical details we need for the proofs. Lemma A1. The push-forward f∗ : TMRk×m + → Tf (M)Rl×n + of a map f ∈ ˆF l,n k,m is given by: f∗(∂ab) = l ∑ i=1 n ∑ j=1 RaiQ(a) bj ∂′ ij, (A7) and the pull-back of a metric g(l,n) on Rl×n + through f is given by: ( f ∗g(l,n))M(∂ab, ∂cd) = g(l,n) f (M)( f∗∂ab, f∗∂cd) = l ∑ i=1 n ∑ j=1 l ∑ s=1 n ∑ t=1 RaiRcsQ(a) bj Q(c) dt g(l,n) f (M)(∂′ ij, ∂′ st). (A8) Proof of Theorem 4. We follow the strategy of [5,14]. The idea is to consider subclasses of maps from the class F l,n k,m and to evaluate their push-forward and pull-back maps together with the isometry requirement. This yields restrictions on the possible metrics, eventually fully characterizing them. First. Consider the maps hπ,σ ∈ F k,m k,m , resulting from permutation matrices Q(a) = Pπa, πa : [m] → [m] for all a ∈ [k], and R = Pσ, σ: [k] → [k]. Requiring isometry yields: (hπ,σ)∗(∂ab) = ∂′ σ(a) πa(b) (A9) g(k,m) M (∂ab, ∂cd) = g(k,m) hπ,σ(M)(∂σ(a) π(a)(b), ∂σ(c) π(c)(d)). (A10) Second. Consider the maps rzw ∈ F kz,mw k,m defined by Q(1) = · · · = Q(k) ∈ Rm×mw and R ∈ Rk×kz being uniform. In this case, for some permutations π and σ, (rzw)∗(∂ab) = 1 w z ∑ i=1 w ∑ j=1 ∂′ σ(a)(i) π(b)(j) (A11) (rzw∗g(kz,mw))M(∂ab, ∂cd) = 1 w2 z ∑ i=1 w ∑ j=1 z ∑ s=1 w ∑ t=1 g(kz,mw) rzw(M) (∂′ σ(a)(i) π(b)(j), ∂′ σ(c)(s) π(d)(t)). (A12) 75 Entropy 2014, 16, 3207–3233 Third. For a rational matrix M = 1 Z ˜M with ˜M ∈ Nk×m and row-sum | ˜Ma| = N ∈ N for all a ∈ [k], consider the map vM ∈ F zk,N k,m that maps M to a constant matrix. In this case, R ∈ Rk×kz and Q(a) has the b-th row with | ˜Mab| entries with value 1 | ˜Mab| at positions π(ab)([ ˜Mab]) ⊆ [N], and: (vM)∗(∂ab) = 1 ˜Mab k ∑ i=1 ˜Mab ∑ j=1 ∂′ σ(a)(i) π(ab)(j) (A13) (vM∗g(kz,N))M(∂ab, ∂cd) = 1 ˜Mab 1 ˜Mcd z ∑ i=1 ˜Mab ∑ j=1 z ∑ s=1 ˜Mcd ∑ t=1 g(kz,N) vM(M)(∂′ σ(a)(i) π(ab)(j), ∂′ σ(c)(s) π(cd)(t)). (A14) Step 1: a ̸= c. Consider a constant matrix M = U. Then: g(k,m) U (∂a1b1, ∂c1d1) = g(k,m) hπ,σ(U)(∂a2b2, ∂c2d2) = g(k,m) U (∂a2b2, ∂c2d2). (A15) This implies that g(k,m) U (∂ab, ∂cd) = ˆA(k, m) when a ̸= c. Using the second type of map, we get: ˆA(k, m) = z2w2 w2 ˆA(kz, mw), (A16) which implies g(k,m) U (∂ab, ∂cd) = A k2 , when a ̸= c. Considering a rational matrix M and the map vM yields: g(k,m) M (∂ab, ∂c,d) = A k2 . (A17) Step 2: b ̸= d. By similar arguments as in Part 1, g(k,m) U (∂ab, ∂ad) = ˆB(k, m). Evaluating the map rzw yields: ˆB(k, m) = zw2 w2 ˆB(kz, mw) + z(z − 1)w2 w2 A (kz)2 = z ˆB(kz, mw) + z − 1 z A k2 , (A18) and therefore, 1 z � ˆB(k, m) − A k2 � = ˆB(kz, mw) − A (kz)2 , (A19) which implies that � ˆB(k, m) − A k2 � is independent of m and scales with the inverse of k, such that it can be written as B k . Rearranging the terms yields g(k,m) U (∂ab, ∂ad) = A k2 + B k , for b ̸= d. For a rational matrix M, the pull-back through vM shows then: g(k,m) M (∂ab, ∂cd) = z ˜Mab ˜Mad ˜Mab ˜Mad � A (kz)2 + B kz � + z(z − 1) ˜Mab ˜Mad ˜Mab ˜Mad A (kz)2 = A k2 + B k . (A20) 76 Entropy 2014, 16, 3207–3233 Step 3: a = c and b = d. In this case, g(k,m) U (∂a1b1, ∂a1b1) = g(k,m) U (∂a2b2, ∂a2b2) = ˆC(k, m), and: ˆC(k, m) = 1 w2 zw ˆC(kz, mw) + 1 w2 zw(w − 1) � A (kz)2 + B kz � + 1 w2 zw2z(z − 1) A (kz)2 = z w ˆC(kz, mw) + (1 − 1 zw) A k2 + (1 − z zw) B k , (A21) which implies: k m � ˆC(k, m) − A k2 − B k � = kz mw � ˜C(kz, mw) − A (kz)2 − B kz � , (A22) such that the left-hand side is a constant C, and g(k,m) U (∂ab, ∂ab) = A k2 + B k + m k C. Now, for a rational matrix M, pulling back through vM gives: g(k,m) M (∂ab, ∂ab) = 1 ˜M2 ab ˜Mab � A k2 + B k + | ˜Ma| k C � + 1 ˜M2 ab ˜Mab( ˜Mab − 1) � A k2 + B k � + 0 = A k2 + B k + | ˜Ma| ˜MabkC = A k2 + k B k2 + |M| Mab C k2 . (A23) Summarizing, we found: g(k,m) M (∂ab, ∂cd) = A k2 + δac � k B k2 + δbd |M| Mab C k2 � , (A24) which proves the first statement. The second statement follows by plugging Equation (23) into Equation (A8). Finally, the statement about the positive-definiteness is a direct consequence of Proposition 1. Proof of Theorem 5. Suppose, contrary to the claim, that a family of metrics g(k,m) M exists, which is invariant with respect to any conditional embedding. By Theorem 4, these metrics are of the form of Equation (23). To prove the claim, we only need to show that A, B and C vanish. In the following, we study conditional embeddings where Q consists of identity matrices and evaluate the isometry requirement ( f ∗g(l,n))M(∂ab, ∂cd) = g(k,m) M (∂ab, ∂cd). Step 1: In the case a ̸= c, we obtain from the invariance requirement and Equation (A8), that: A k2 = |Ra||Rc| A l2 . (A25) Observe that: 1 k k ∑ i=1 |Ri| = 1 k |R| = l k. (A26) In fact, |Ri| is the cardinality of the i-th block of the partition belonging to R. Therefore, if we choose R to be the partition indicator matrix of a partition that is not homogeneous and in which |Ra| > l/k and |Rc| > l/k, then Equation (A25) implies that A = 0. Step 2: In the case a = c and b ̸= d, we obtain from invariance and Equation (A8), that: B k = l ∑ i=1 l ∑ s=1 RaiRasδis B l = |Ra| B l . (A27) Again, we may chose Ra in such a way that |Ra| ̸= k l and find that B = 0. 77 Entropy 2014, 16, 3207–3233 Step 3: Finally, in the case a = c and b = d, we obtain from invariance and Equation (A8), that: C|M| k2Mab = l ∑ i=1 l ∑ s=1 RaiRasδi,s C|R⊤M| l2(R⊤M)ib = |Ra|C|R⊤M| l2Mab . (A28) If we chose Ra, such that |Ra| ̸= |M| |R⊤M|, then we see that C = 0. Therefore, g(k,m) is the zero-tensor, which is not a metric. Acknowledgments The authors are grateful to Keyan Zahedi for discussions related to policy gradient methods in robotics applications. Guido Montúfar thanks the Santa Fe Institute for hosting him during the initial work on this article. Johannes Rauh acknowledges support by the VW Foundation. This work was supported in part by the DFG Priority Program, Autonomous Learning (DFG-SPP 1527). Author Contributions All authors contributed to the design of the research. The research was carried out by all authors, with main contributions by Guido Montúfar and Johannes Rauh. The manuscript was written by Guido Montúfar, Johannes Rauh and Nihat Ay. All authors read and approved the final manuscript. Conflicts of Interest The authors declare no conflict of interests. References 1. Amari, S. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276. 2. Kakade, S. A Natural Policy Gradient. Advances in Neural Information Processing Systems 14; MIT Press: Cambridge, MA, USA, 2001; pp. 1531–1538. 3. Shahshahani, S. A New Mathematical Framework for the Study of Linkage and Selection; American Mathematical Society: Providence, RI, USA, 1979. 4. Chentsov, N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Providence, RI, USA, 1982. 5. Campbell, L. An extended ˇCencov characterization of the information metric. Proc. Am. Math. Soc. 1986, 98, 135–141. 6. Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12; MIT Press: Cambridge, MA, USA, 2000; pp. 1057–1063. 7. Marbach, P.; Tsitsiklis, J. Simulation-based optimization of Markov reward processes. IEEE Trans. Autom. Control 2001, 46, 191–209. 8. Montúfar, G.; Ay, N.; Zahedi, K. Expressive power of conditional restricted boltzmann machines for sensorimotor control. 2014, arXiv:1402.3346. 9. Ay, N.; Montúfar, G.; Rauh, J. Selection Criteria for Neuromanifolds of Stochastic Dynamics. In Advances in Cognitive Neurodynamics (III); Yamaguchi, Y., Ed.; Springer-Verlag: Dordrecht, The Netherlands 2013; pp. 147–154. 10. Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190. 11. Peters, J.; Schaal, S. Policy Gradient Methods for Robotics. In Proceedings of the IEEE International Conference on Intelligent Robotics Systems (IROS 2006), Beijing, China, 9–15 October 2006. 78 Entropy 2014, 16, 3207–3233 12. Peters, J.; Vijayakumar, S.; Schaal, S. Reinforcement learning for humanoid robotics. In Proceedings of the third IEEE-RAS international conference on humanoid robots, Karlsruhe, Germany, 29–30 September 2003; pp. 1–20. 13. Bagnell, J.A.; Schneider, J. Covariant policy search. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 9–15, 2003; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2003; pp. 1019–1024. 14. Lebanon, G. Axiomatic geometry of conditional models. IEEE Trans. Inform. Theor. 2005, 51, 1283–1294. 15. Lebanon, G. An Extended ˇCencov-Campbell Characterization of Conditional Information Geometry. In Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence (UAI 04), Banff, AL, Canada, 7–11 July 2004; Chickering, D.M., Halpern, J.Y., Eds.; AUAI Press: Arlington, VA, USA, 2004; pp. 341–345. 16. Barndorff-Nielsen, O. Information and Exponential Families: In statistical Theory; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1978. 17. Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory; Institute of Mathematical Statistics: Hayward, CA, USA, 1986. 18. Zahedi, K.; Ay, N.; Der, R. Higher coordination with less control—A result of informaion maximiation in the sensorimotor loop. Adapt. Behav. 2010, 18. 19. Hofbauer, J.; Sigmund, K. Evolutionary Games and Population Dynamics; Cambridge University Press: Cambridge, United Kingdom, 1998. 20. Ay, N.; Erb, I. On a notion of linear replicator equations. J. Dyn. Differ. Equ. 2005, 17, 427–451. c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 79 entropy Article Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes André Klein Rothschild Blv. 123 Apt.7, 65271 Tel Aviv, Israel; A.A.B.Klein@uva.nl or klein@contact.uva.nl; Tel.: 972.5.25594723 Received: 12 February 2014; in revised form: 11 March 2014 / Accepted: 24 March 2014 / Published: 8 April 2014 Abstract: In this survey paper, a summary of results which are to be found in a series of papers, is presented. The subject of interest is focused on matrix algebraic properties of the Fisher information matrix (FIM) of stationary processes. The FIM is an ingredient of the Cramér-Rao inequality, and belongs to the basics of asymptotic estimation theory in mathematical statistics. The FIM is interconnected with the Sylvester, Bezout and tensor Sylvester matrices. Through these interconnections it is shown that the FIM of scalar and multiple stationary processes fulfill the resultant matrix property. A statistical distance measure involving entries of the FIM is presented. In quantum information, a different statistical distance measure is set forth. It is related to the Fisher information but where the information about one parameter in a particular measurement procedure is considered. The FIM of scalar stationary processes is also interconnected to the solutions of appropriate Stein equations, conditions for the FIM to verify certain Stein equations are formulated. The presence of Vandermonde matrices is also emphasized.MSC Classification: 15A23, 15A24, 15B99, 60G10, 62B10, 62M20. Keywords: Bezout matrix; Sylvester matrix; tensor Sylvester matrix; Stein equation; Vandermonde matrix; stationary process; matrix resultant; Fisher information matrix 1. Introduction In this survey paper, a summary of results derived and described in a series of papers, is presented. It concerns some matrix algebraic properties of the Fisher information matrix (abbreviated as FIM) of stationary processes. An essential property emphasized in this paper concerns the matrix resultant property of the FIM of stationary processes. To be more explicit, consider the coefficients of two monic polynomials p(z) and q(z) of finite degree, as the entries of a matrix such that the matrix becomes singular if and only if the polynomials p(z) and q(z) have at least one common root. Such a matrix is called a resultant matrix and its determinant is called the resultant. The Sylvester, Bezout and tensor Sylvester matrices have such a property and are extensively studied in the literature, see e.g., [1,3]. The FIM associated with various stationary processes will be expressed by these matrices. The derived interconnections are obtained by developing the necessary factorizations of the FIM in terms of the Sylvester, Bezout and tensor Sylvester matrices. These factored forms of the FIM enable us to show that the FIM of scalar and multiple stationary processes fulfill the resultant matrix property. Consequently, the singularity conditions of the appropriate Fisher information matrices and Sylvester, Bezout and tensor Sylvester matrices coincide, these results are described in [4,6]. A statistical distance measure involving entries of the FIM is presented and is based on [7]. In quantum information, a statistical distance measure is set forth, see [8,10], and is related to the Fisher information but where the information about one parameter in a particular measurement procedure is considered. This leads to a challenging question that can be presented as, can the existing distance measure in quantum information be developed at the matrix level? Entropy 2014, 16, 2023–2055; doi:10.3390/e16042023 www.mdpi.com/journal/entropy 80 Entropy 2014, 16, 2023–2055 The matrix Stein equation, see e.g., [11], is associated with the Fisher information matrices of scalar stationary processes through the solutions of the appropriate Stein equations. Conditions for the Fisher information matrices or associated matrices to verify certain Stein equations are formulated and proved in this paper. The presence of Vandermonde matrices is also emphasized. The general and more detailed results are set forth in [12] and [13]. In this survey paper it is shown that the FIM of linear stationary processes form a class of structured matrices. Note that in [14], the authors emphasize that statistical problems related to stationary processes have been treated successfully with the aid of Toeplitz forms. This paper is organized as follows. The various stationary processes, considered in this paper, are presented in Section 2, the Fisher information matrices of the stationary processes are displayed in Section 3. Section 3 sets forth the interconnections between the Fisher information matrices and the Sylvester, Bezout, tensor Sylvester matrices, and solutions to Stein equations. A statistical distance measure is expressed in terms of entries of a FIM. 2. The Linear Stationary Processes In this section we display the class of linear stationary processes whose corresponding Fisher information matrix shall be investigated in a matrix algebraic context. But first some basic definitions are set forth, see e.g., [15]. If a random variable X is indexed to time, usually denoted by t, the observations {Xt, t ∈ T } is called a time series, where T is a time index set (for example, T = Z, the integer set). 2.1. Definition 2.1 A stochastic process is a family of random variables {Xt, t ∈ T } defined on a probability space {Ω, F, ℘}. 2.2. Definition 2.2 The Autocovariance function. If {Xt, t ∈ T } is a process such that Var(Xt) < ∞ (variance) for each t, then the autocovariance function γX (·, ·) of {Xt} is defined by γX (r, s) = Cov (Xr, Xs) = E [(Xr − E Xr) (Xs − E Xs)], r, s ∈ Z and E represents the expected value. 2.3. Definition 2.3 Stationarity. The time series {Xt, t ∈ Z}, with the index set Z ={0,±}1,±}2, . . .}, is said to be stationary if (i) E |Xt|2 < ∞ (ii) E (Xt) = m for all t ∈ Z, m is the constant average or mean (iii) γX (r, s) = γX (r + t, s + t) for all r, s, t ∈ Z, From Definition 2.3 can be concluded that the joint probability distributions of the random variables {X1, X2, . . . Xtn} and {X1+k, X2+k, . . . Xtn+k} are the same for arbitrary times t1, t2, . . . , tn for all n and all lags or leads k = 0, ±}1, ±}2, . . .. The probability distribution of observations of a stationary process is invariant with respect to shifts in time. In the next section the linear stationary processes that will be considered throughout this paper are presented. 2.4. The Vector ARMAX or VARMAX Process We display one of the most general linear stationary process called the multivariate autoregressive, moving average and exogenous process, the VARMAX process. To be more specific, consider the vector difference equation representation of a linear system {y(t), t ∈ Z}, of order (p, r, q), p ∑ j=0 Aj y(t − j) = r ∑ j=0 Cj x(t − j) + q ∑ j=0 Bj e(t − j), t ∈ Z (1) 81 Entropy 2014, 16, 2023–2055 where y(t) are the observable outputs, x(t) the observable inputs and ϵ(t) the unobservable errors, all are n-dimensional. The acronym VARMAX stands for vector autoregressive-moving average with exogenous variables. The left side of (1) is the autoregressive part the second term on the right is the moving average part and x(t) is exogenous. If x(t) does not occur the system is said to be (V)ARMA. Next to exogenous, the input x(t) is also named the control variable, depending on the field of application, in econometrics and time series analysis, e.g., [15], and in signal processing and control, e.g., [16,17]. The matrix coefficients, Aj ∈ Rn×n, Cj ∈ Rn×n, and Bj ∈ Rn×n are the associate parameter matrices. We have the property A0 ≡ B0 ≡ C0 ≡ In. Equation (1) can compactly be written as A(z) y(t) = C(z) x(t) + B(z) e(t) (2) where A(z) = p ∑ j=0 Aj zj; C(z) = r ∑ j=0 Cj zj; B(z) = q ∑ j=0 Bj zj we use z to denote the backward shift operator, for example z xt = xt−1. The matrix polynomials A(z), B(z) and C(z) are the associated autoregressive, moving average matrix polynomials, and the exogenous matrix polynomial respectively of order p, q and r respectively. Hence the process described by (2) is denoted as a VARMAX(p, r, q) process. Here z ∈ C with a duplicate use of z as an operator and as a complex variable, which is usual in the signal processing and time series literature, e.g., [15,16,18]. The assumptions Det(A(z)) ̸= 0, such that |z| ≤ 1 and Det(B(z)) ̸= 0, such that |z| < 1 for all z ∈ C, is imposed so that the VARMAX(p, r, q) process (2) has exactly one stationary solution and the condition Det(B(z)) ̸= 0 implies the invertibility condition, see e.g., [15] for more details. Under these assumptions, the eigenvalues of the matrix polynomials A(z) and B(z) lie outside the unit circle. The eigenvalues of a matrix polynomial Y (z) are the roots of the equation Det(Y (z)) = 0, Det(X) is the determinant of X. The VARMAX(p, r, q) stationary process (2) is thoroughly discussed in [15,18,19]. The error {ϵ(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables each having positive definite covariance matrix ∑ and we assume, for all s, t, Eϑ { x(s) ϵT(t)} = 0, where XT denotes the transposition of matrix X and Eϑ represents the expected value under the parameter ϑ. The matrix ϑ represents all the VARMAX(p, r, q) parameters, with the total number of parameters being n2(p + q + r). For different purposes which will be specified in the next sections, two choices of the parameter structure are considred. First, the parameter vector ϑ ∈ Rn2(p+q+r)×1 is defined by ϑ = vec {A1, A2, . . . , Ap, C1, C2, . . . , Cr, B1, B2, . . . , Bq} (3) The vec operator transforms a matrix into a vector by stacking the columns of the matrix one underneath the other according to vec X = col(col(Xij)n i=1)n j=1, see e.g., [2,20]. A different choice is set forth, when the parameter matrix ϑ ∈ Rn×n(p+q+r) is of the form ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+r ϑp+r+1 ϑp+r+2 . . . ϑp+r+q) (4) = (A1 A2 . . . Ap C1 C2 . . . Cr B1 B2 . . . Bq) (5) Representation (5) of the parameter matrix has been used in [21]. The estimation of the matrices A1, A2,. . ., Ap, C1, C2,. . ., Cr, B1, B2, . . ., Bq and ∑ has received considerable attention in the time series and statistical signal processing literature, see e.g., [15,17,19]. In [19], the authors study the asymptotic properties of maximum likelihood estimates of the coefficients of VARMAX(p, r, q) processes, stored in a (ℓ × 1) vector ϑ, where ℓ = n2(p + q + r). Before describing the control-exogenous variable x(t) used in this survey paper, we shall present the different special cases of the model described in 1 and 2. 82 Entropy 2014, 16, 2023–2055 2.5. The Vector ARMA or VARMA Process When the process (2) does not contain the control process x(t) it yields A(z)y(t) = B(z)e(t) (6) which is a vector autoregressive and moving average process, VARMA(p, q) process, see e.g., [15]. The matrix ϑ represents now all the VARMA parameters, with the total number of parameters being n2(p+q). The VARMA(p, q) version of the parameter vector ϑ defined in (3) is then given by ϑ = vec {A1, A2, . . . , Ap, B1, B2, . . . , Bq} (7) A VARMA process equivalent to the parameter matrix (4) is then the n × n(p + q) parameter matrix ϑ = (ϑ1 ϑ2 . . . ϑp ϑp+1 ϑp+2 . . . ϑp+q) = (A1 A2 . . . Ap B1 B2 . . . Bq) (8) A description of the input variable x(t), in 2 follows. Generally, one can assume either that x(t) is non stochastic or that x(t) is stochastic. In the latter case, we assume Eϑ{ x(s) ϵT(t)} = 0, for all s, t, and that statistical inference is performed conditionally on the values taken by x(t). In this case it can be interpreted as constant, see [22] for a detailed exposition. However, in the papers referred in this survey, like in [21] and [23], the observed input variable x(t), is assumed to be a stationary VARMA process, of the form α(z)x(t) = β(z)η(t) (9) where α(z) and β(z) are the autoregressive and moving average polynomials of appropriate degree and {η(t), t ∈ Z} is a collection of uncorrelated zero mean n-dimensional random variables each having positive definite covariance matrix Ω. The spectral density of the VARMA process x(t) is Rx(·)/2π and for a definition, see e.g., [15,16], to obtain Rx(eiω) = α−1(eiω)β(eiω)Ωβ∗(eiω)α−∗(eiω) ω ∈ [−π, π] (10) where i is the imaginary unit with the property i2 = −1, ω is the frequency, the spectral density Rx(eiω) is Hermitian, and we further have, Rx(eiω) ≥ 0 and � π −π Rx(eiω)dω < ∞. As mentioned above, the basic assumption, x(t) and ϵ(t) are independent or at least uncorrelated processes, which corresponds geometrically with orthogonal processes, holds and X* is the complex conjugate transpose of matrix X. 2.6. The ARMAX and ARMA Processes The scalar equivalent to the VARMAX(p, r, q) and VARMA(p, q) processes, given by 2 and 6 respectively, shall now be displayed, to obtain for the ARMAX(p, r, q) process a(z)y(t) = c(z)x(t) + b(z)e(t) (11) and for the ARMA(p, q) process a(z)y(t) = b(z)e(t) (12) popularized in, among others, the Box-Jenkins type of time series analysis, see e.g., [15]. Where a(z), b(z) and c(z) are respectively the scalar autoregressive, moving average polynomials and exogenous polynomial, with corresponding scalar coefficients aj, bj and cj, a(z) = p ∑ j=0 aj zj; c(z) = r ∑ j=0 cj zj; b(z) = q ∑ j=0 bj zj (13) 83 Entropy 2014, 16, 2023–2055 Note that as in the multiple case, a0 = b0 = 1. The parameter vector, ϑ, for the processes, 11 and 12 is then ϑ = {a1, a2, . . . , ap, c1, c2, . . . , cr, b1, b2, . . . , bq} (14) and ϑ = {a1, a2, . . . , ap, b1, b2, . . . , bq} (15) respectively. In the next section the matrix algebraic properties of the Fisher information matrix of the stationary processes (2), (6), (11) and (12) will be verified. Interconnections with various known structured matrices like the Sylvester resultant matrix, the Bezout matrix and Vandermonde matrix are set forth. The Fisher information matrix of the various stationary processes is also expressed in terms of the unique solutions to the appropriate Stein equations. 3. Structured Matrix Properties of the Asymptotic Fisher Information Matrix of Stationary Processes The Fisher information is an ingredient of the Cramér-Rao inequality, also called by some the Cauchy-Schwarz inequality in mathematical statistics, and belongs to the basics of asymptotic estimation theory in mathematical statistics. The Cramér-Rao theorem [24] is therefore considered. When assuming that the estimators of ϑ, defined in the previuos sections, are asymptotically unbiased, the inverse of the asymptotic information matrix yields the Cramér-Rao bound, and provided that the estimators are asymptotically efficient, the asymptotic covariance matrix then verifies the inequality Cov � ˆϑ � ≽ I−1� ˆϑ � here I (�ϑ) is the FIM, Cov (�ϑ) is the covariance of �ϑ, the unbiased estimator of ϑ, for a detailed fundamental statistical analysis, see [25,26]. The FIM equals the Cramér-Rao lower bound, and the subject of the FIM is also of interest in the control theory and signal processing literature, see e.g., [27]. Its quantum analog was introduced immediately after the foundation of mathematical quantum estimation theory in the 1960’s, see [28,29] for a rigorous exposition of the subject. More specifically, the Fisher information is also emphasized in the context of quantum information theory, see e.g., [30,31]. It is clear that the Cramér-Rao inequality takes a lot of attention because it is located on the highly exciting boundary of statistics, information and quantum theory and more recently matrix theory. In the next sections, the Fisher information matrices of linear stationary processes will be presented and its role as a new class of structured matrices will be the subject of study. When time series models are the subject, using 2 for all t ∈ Z to determine the residual ϵ(t) or ϵt(ϑ), to emphasize the dependency on the parameter vector ϑ, and assuming that x(t) is stochastic and that (y(t), x(t)) is a Gaussian stationary process, the asymptotic FIM F(ϑ) is defined by the following (ℓ × ℓ) matrix which does not depend on t F(ϑ) = E ��∂et(ϑ) ∂ϑ⊤ �⊤ Σ−1 �∂et(ϑ) ∂ϑ⊤ �� (16) where the (v × ℓ) matrix ∂(·)/∂ϑ T, the derivative with respect to ϑ T, for any (v × 1) column vector (·) and ℓ is the total number of parameters. The derivative with respect to ϑ T is used for obtaining the appropriate dimensions. Equality (16) is used for computing the FIM of the various time series processes presented in the previous sections and appropriate definitions of the derivatives are used, especially for the multivariate processes (2) and (6), see [21,22]. 84 Entropy 2014, 16, 2023–2055 3.1. The Fisher Information Matrix of an ARMA(p, q) Process In this section, the focus is on the FIM of the ARMA process (12). When ϑ is given in 15, the derivatives in 16 are at the scalar level ∂et(ϑ) ∂aj = 1 a(z)et−j for j = 1, . . . , p and∂et(ϑ) ∂bk = − 1 b(z)et−k for k = 1, . . . , q when combined for all j and k, the FIM of the ARMA process (12) with the variance of the noise process ϵt(ϑ) equal to one, yields the block decomposition, see [32] F(ϑ) = � Faa(ϑ) Fab(ϑ) Fba(ϑ) Fbb(ϑ) � (17) The expressions of the different blocks of the matrix F(ϑ) are Faa(ϑ) = 1 2πi � |z|=1 up(z)u⊤ p (z−1) a(z)a(z−1) dz z = 1 2πi � |z|=1 up(z)v⊤ p (z) a(z)ˆa(z) dz (18) Fab(ϑ) = − 1 2πi � |z|=1 up(z)u⊤ q (z−1) a(z)b(z−1) dz z = − 1 2πi � |z|=1 up(z)v⊤ q (z) a(z)ˆb(z) dz (19) Fba(ϑ) = − 1 2πi � |z|=1 uq(z)u⊤ p (z−1) a(z−1)b(z) dz z = − 1 2πi � |z|=1 uq(z)v⊤ p (z) ˆa(z)b(z) dz (20) Fbb(ϑ) = 1 2πi � |z|=1 uq(z)u⊤ q (z−1) b(z)b(z−1) dz z = 1 2πi � |z|=1 uq(z)v⊤ q (z) b(z)ˆb(z) dz (21) where the integration above and everywhere below is counterclockwise around the unit circle. The reciprocal monic polynomials â(z) and �b(z) are defined as â(z) = zpa(z−1) and �b (z) = zqb(z−1) and ϑ =(a1, . . . , ap, b1, . . . , bq) T introduced in (15). For each positive integer k we have uk(z) = (1, z, z2, . . . , zk−1) T and vk(z) = zk−1uk(z−1). Considering the stability condition of the ARMA(p, q) process implies that all the roots of the monic polynomials a(z) and b(z) lie outside the unit circle. Consequently, the roots of the polynomials â(z) and �b(z) lie within the unit circle and will be used as the poles for computing the integrals (18)–(21) when Cauchy’s residue theorem is applied. Notice that the FIM F(ϑ) is symmetric block Toeplitz so that Fab(ϑ) = F ⊤ ba(ϑ) and the integrands in (18)–(21) are Hermitian. The computation of the integral expressions, (18)–(21) is easily implementable by using the standard residue theorem. The algorithms displayed in [33] and [22] are suited for numerical computations of among others the FIM of an ARMA(p, q) process. 3.2. The Sylvester Resultant Matrix - The Fisher Information Matrix The resultant property of a matrix is considered, in order to show that the FIM F(ϑ) has the matrix resultant property implies to show that the matrix F(ϑ) becomes singular if and only if the appropriate scalar monic polynomials â(z) and �b(z) have at least one common zero. To illustrate the subject, the following known property of two polynomials is set forth. The greatest common divisor (frequently abbreviated as GCD) of two polynomials is a polynomial, of the highest possible degree, that is a factor of both the two original polynomials, the roots of the GCD of two polynomials are the common roots of the two polynomials. Consider the coefficients of two monic polynomials p(z) and q(z) of finite degree, as the entries of a matrix such that the matrix becomes singular if and only if the polynomials p(z) and q(z) have at least one common root. Such a matrix is called a resultant matrix and its determinant is 85 Entropy 2014, 16, 2023–2055 called the resultant. Therefore we present the known (p + q) × (p + q) Sylvester resultant matrix of the polynomials a and b, see e.g., [2], to obtain S(a, b) = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 1 a1 · · · ap 0 · · · 0 0 ... ... ... ... ... ... ... ... ... ... 0 0 · · · 0 1 a1 · · · ap 1 b1 · · · bq 0 · · · 0 0 ... ... ... ... ... ... ... ... ... ... 0n×n 0 · · · 0 1 b1 · · · bq ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ (22) Consider the q ×(p+q) and p×(p+q) upper and lower submatrices Sp (b) and Sq (−a) of the Sylvester resultant matrix S (−b, a) such that S(b, −a) = � Sp(b) −Sq(a) � (23) The matrix S (a, b) becomes singular in the presence of one or more common zeros of the monic polynomials â(z) and �b(z), this property is assessed by the following equalities R(a, b) = ∏ i = 1, . . . , p j = 1, . . . , q (αi − βj), R(b, a) = (−1)pq ∏ i = 1, . . . , p j = 1, . . . , q (αi − βj) (24) and R(b, −a) = (−1)q ∏ i = 1, . . . , p j = 1, . . . , q (βj − αi), and R(−b, a) = (−1)p ∏ i = 1, . . . , p j = 1, . . . , q (βj − αi) (25) where R(a, b) is the resultant of â(z) and �b(z), and is equal to Det includegraphics[scale=1]entropy-16-02023f6.pdf (a, b). The string of equalities in (24) and (25) hold since R(b, a) = (−1)pq R(a, b), R(b, −a) = (−1)q R(b, a), and R(−b, a) = (−1)p R(b, a), see [34]. The zeros of the scalar monic polynomials â(z) and �b(z) are αi and βj respectively and are assumed to be distinct. By this is meant, when we have (z − αi)nαi and (z − βj)nβj with the powers nαi and nβj both greater than one, that only the distinct roots will be considered free from the corresponding powers. The key property of the classical Sylvester resultant matrix S (a, b) is that its null space provides a complete description of the common zeros of the polynomials involved. In particular, in the scalar case the polynomials â(z) and �b(z) are coprime if and only if S (a, b) is non-singular. The following key property of the classical Sylvester resultant matrix S (a, b), is given by the well known theorem on resultants, to obtain dim Ker S(a, b) = ν(a, b) (26) 86 Entropy 2014, 16, 2023–2055 where ν(a, b) is the number of common roots of the polynomials â(z) and �b(z), with counting multiplicities, see e.g., [3]. The dimension of a subspace V is represented by dim (V ), Ker (X) is the null space or kernel of the matrix X, denoted by Null or Ker. The null space of an n × n matrix A with coefficients in a field K (typically the field of the real numbers or of the complex numbers) is the set Ker A = {x ∈ Kn: Ax = 0}, see e.g., [1,2,20]. In order to prove that the FIM F (ϑ) fulfills the resultant matrix property, the following factorization is derived, Lemma 2.1 in [5], F(ϑ) = S(b, −a)P(ϑ)S⊤(b, −a) (27) where the matrix ℘(ϑ) ∈ R(p+q)×(p+q) admits the form P(ϑ) = 1 2πi � |z|=1 up+q(z)u⊤ p+a(z−1) a(z)b(z)a(z−1)b(z−1) dz z = 1 2πi � |z|=1 up+q(z)v⊤ p+q(z) a(z)b(z)ˆa(z)ˆb(z) dz (28) It is proved in [5] that the symmetric matrix ℘(ϑ) fulfills the property, ℘(ϑ) ≻ O. The factorization (27) allows us to show the matrix resultant property of the FIM, Corollary 2.2 in [5] states. The FIM of an ARMA(p, q) process with polynomials a(z) and b(z) of order p, q respectively becomes singular if and only if the polynomials â(z) and �b(z) have at least one common root. From Corollary 2.2 in [5] can be concluded, the FIM of an ARMA(p, q) process and the Sylvester resultant matrix S (−b, a) have the same singularity property. By virtue of (26) and (27) we will specify the dimension of the null space of the FIM F (ϑ), this is set forth in the following lemma. 3.2.1. Lemma 3.1 Assume that the polynomials â(z) and b(z) have ν(a, b) common roots, counting multiplicities. The factorization (27) of the FIM and the property (26) enable us to prove the equality dim (Ker F(ϑ)) = dim (Ker S(b, −a)) = ν(a, b) (29) Proof The matrix ℘(ϑ) ∈ R(p+q)×(p+q), given in (27), fulfills the property of positive definiteness, as proved in [5]. This implies that a Cholesky decomposition can be applied to ℘(ϑ), see [35] for more details, to obtain ℘(ϑ) =LT(ϑ)L(ϑ), where L(ϑ) is a R(p+q)×(p+q) upper triangular matrix that is unique if its diagonal elements are all positive. Consequently, all its eigenvalues are then positive so that the matrix L(ϑ) is also positive definite. Factorization of (27) now admits the representation F(ϑ) = S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a) (30) and taking the property, if A is an m× n matrix, then Ker (A) = Ker (ATA), into account, yields when applied to (30) Ker F(ϑ) = Ker S(b, −a)L⊤(ϑ)L(ϑ)S⊤(b, −a) = Ker L(ϑ)S⊤(b, −a) Assume the vector u ∈ Ker L(ϑ) S⊤ (b, −a), such that L(ϑ) S⊤ (b, −a)u = 0 and set S⊤ (b, −a)u = v = ⇒ L(ϑ)v = 0, since the matrix L(ϑ) ≻ O = ⇒ v = 0, this implies S⊤ (b, −a)u = 0 = ⇒ u ∈ Ker S⊤ (b, −a). Consequently, 87 Entropy 2014, 16, 2023–2055 Ker F(ϑ) = Ker S⊤(b, −a) (31) We will now consider the Rank-Nullity Theorem, see e.g., [1], if A is an m × n matrix, then dim (Ker A) + dim (Im A) = n and the property dim (Im A) = dim (Im AT). When applied to the (p + q) × (p + q) matrix S (b, −a), it yields dim (Ker S(b, −a)) = dim (Ker S⊤(b, −a)) ⇒ dim (Ker F(ϑ)) = dim (Ker S(b, −a)) which completes the proof. Notice that the dimension of the null space of matrix A is called the nullity of A and the dimension of the image of matrix A, dim (Im A), is termed the rank of matrix A. An alternative proof to the one developed in Corollary 2.2 in [5], is given in a corollary to Lemma 3.1, reconfirming the resultant matrix property of the FIM F (ϑ). 3.2.2. Corollary 3.2 The FIM F (ϑ) of an ARMA(p, q) process becomes singular if and only if the autoregressive and moving average polynomials â(z) and �b(z) have at least one common root. Proof By virtue of the equality (31) combining with the property Det S⊤ (b, −a) = Det S (b, −a) and the matrix resultant property of the Sylvester matrix S (b, −a) yields, Det S⊤ (b, −a) = 0 ⇔ Ker S⊤ (b, −a) ̸= {0} if and only if the ARMA(p, q) polynomials â(z) and �b(z) have at least one common root. Equivalently, Det S⊤ (b, −a) ̸= 0 ⇔ Ker S⊤ (b, −a) = {0} if and only if the ARMA(p, q) polynomials â(z) and �b(z) have no common roots. Consequently, by virtue of the equality Ker F (ϑ) =Ker S⊤ (b, −a) can be concluded, the FIM F (ϑ) becomes singular if and only if the ARMA(p, q) polynomials â(z) and �b(z) have at least one common root. This completes the proof. 3.3. The Statistical Distance Measure and the Fisher Information Matrix In [7] statistical distance measures are studied. Most multivariate statistical techniques are based upon the concept of distance. For that purpose a statistical distance measure is considered that is a normalized Euclidean distance measure with entries of the FIM as weighting coefficients. The measurements x1, x2,. . . , xn are subject to random fluctuations of different magnitudes and have therefore different variabilities. It is then important to consider a distance that takes the variability of these variables or measurements into account when determining its distance from a fix point. A rotation of the coordinate system through a chosen angle while keeping the scatter of points given by the data fixed, is also applied, see [7] for more details. It is shown that when the FIM is positive definite, the appropriate statistical distance measure is a metric. In case of a singular FIM of an ARMA stationary process, the metric property depends on the rotation angle. The statistical distance measure, is based on m parameters unlike a statistical distance measure introduced in quantum information, see e.g., [8,9], that is also related to the Fisher information but where the information about one parameter in a particular measurement procedure is considered. 88 Entropy 2014, 16, 2023–2055 The straight-line or Euclidean distance between the stochastic vector x = � x1 x2 . . . xn �⊤ and fixed vector y = � y1 y2 . . . yn �⊤ where x, y ∈ Rn, is given by d(x, y) = ∥x − y∥ = � n ∑ j=1 (xj − yj)2 �1/2 (32) where the metric d(x, y):= ||x−y|| is induced by the standard Euclidean norm || · || on Rn, see e.g., [2] for the metric conditions. The observations x1, x2, . . . , xn are used to compute maximum likelihood estimated of the parameters ϑ1, ϑ2, . . . , ϑm and where m < n. These estimated parameters are random variables, see e.g., [15]. The distance of the estimated vector ϑ ∈ Rm, given in (15), is studied. Entries of the FIM are inserted in the distance measure as weighting coefficients. The linear transformation �ϑ = Li(ϕ)ϑ (33) is applied, where Li(ϕ) ∈ Rm×n is the Givens rotation matrix with rotation angle ϕ, with 0 ≤ ϕ ≤ 2π and i ∈ {1, . . . , m − 1}, see e.g., [36], and is given by Li(ϕ) = ⎛ ⎜ ⎜ ⎜ ⎝ Ii−1 0 0 0 0 (cos(ϕ))i,i (− sin(ϕ))i,i+1 0 0 (sin(ϕ))i+1,i (cos(ϕ))i+1,i+1 0 0 0 0 Im−i−1 ⎞ ⎟ ⎟ ⎟ ⎠, 0 ≤ ϕ ≤ 2π (34) The following matrix decomposition is applied in order to obtain a transformed FIM Fϕ(ϑ) = Li(ϕ)F(ϑ)L⊤ i (ϕ) (35) where Fϕ(ϑ) and F (ϑ) are respectively the transformed and untransformed Fisher information matrices. It is straightforward to conclude that by virtue of (35), the transformed and untransformed Fisher information matrices F ϕ(ϑ) and F (ϑ), are similar since the rotation matrix Li(ϕ) is orthogonal. Two matrices A and B are similar if there exists an invertible matrix X such that the equality AX = XB holds. As can be seen, the Givens matrix Li(ϕ) involves only two coordinates that are affected by the rotation angle ϕ whereas the other directions, which correspond to eigenvalues of one, are unaffected by the rotation matrix. By virtue of (35) can be concluded that a positive definite FIM, F (ϑ) ≻ 0, implies a positive definite transformed FIM, F ϕ(ϑ) ≻ 0. Consequently, the elements on the main diagonal of F (ϑ), f 1,1, f 2,2, . . . , fm,m, as well as the elements on the main diagonal of F ϕ(ϑ), �f1,1, �f2,2, . . . , �fm,m are all positive. However, the elements on the main diagonal of a singular FIM of a stationary ARMA process are also positive. As developed in [7], combining (33) and (35) yields the distance measure of the estimated parameters ϑ1, ϑ2, . . . , ϑm accordingly, to obtain d2 Fϕ(ϑ) = m ∑ j=1,j̸=i,i+1 � ϑ2 j fj,j � + {ϑi cos(ϕ) − ϑi+1 sin(ϕ)}2 �fi,i(ϕ) + {ϑi+1 cos(ϕ) + ϑi sin(ϕ)}2 �fi+1,i+1(ϕ) (36) where �fi,i(ϕ) = fi,i cos2(ϕ) − fi,i+1 sin(2ϕ) + fi+1,i+1 sin2(ϕ) (37) 89 Entropy 2014, 16, 2023–2055 �fi+1,i+1(ϕ) = fi+1,i+1 cos2(ϕ) + fi,i+1 sin(2ϕ) + fi,i sin2(ϕ) (38) and fj,l are entries of the FIM F (ϑ) whereas �fi,i(φ) and �fi+1,i+1(φ) are the transformed components since the rotation affects only the entries, i and i+1, as can be seen in matrix Li(ϕ). In [7], the existence of the following inequalities is proved �fi,i(ϕ) > 0 and �fi+1,i+1(ϕ) > 0 this guaratees the metric property of (36). When the FIM of an ARMA(p, q) process is the case, a combination of (27) and (35) for the ARMA(p, q) parameters, given in (15) yields for the transformed FIM, Fϕ(ϑ) = Sϕ(−b, a)P(ϑ)S⊤ ϕ (−b, a) (39) where ℘(ϑ) is given by (28) and the transformed Sylvester resultant matrix is of the form Sϕ(−b, a) = Li(ϕ)S(−b, a) (40) Proposition 3.5 in [7], proves that the transformed FIM F ϕ(ϑ) and the transformed Sylvester matrix Sφ (−b, a) fulfill the resultant matrix property by using the equalities (40) and (39). The following property is then set forth. 3.3.1. Proposition 3.3 The properties Ker Fϕ(ϑ) = Ker S⊤ ϕ (−b, a) and Ker Sϕ(−b, a) = Ker S(−b, a) hold true. Proof By virtue of the equalities (39), (40) and the orthogonality property of the rotation matrix Li(ϕ) which implies that Ker Li(ϕ) = {0} combined with the same approach as in Lemma 3.1 completes the proof. A straightforward conclusion from Proposition 3.3 is then dim Ker Fϕ(ϑ) = dim Ker Sϕ(−b, a), dim Ker Sϕ(−b, a) = dim Ker S(−b, a) In the next section a distance measure introduced in quantum information is discussed. Statistical Distance Measure - Fisher Information and Quantum Information In quantum information, the Fisher information, the information about a parameter θ in a particular measurement procedure, is expressed in terms of the statistical distance s, see [8,10]. The statistical distance used is defined as a measure to distinguish two probability distributions on the basis of measurement outcomes, see [37]. The Fisher information and the statistical distance are statistical quantities, and generally refer to many measurements as it is the case in this survey. However, in the quantum information theory and quantum statistics context, the problem set up is presented as follows. There may or may not be a small phase change θ, and the question is whether it is there. In that case you can design quantum experiments that will tell you the answer unambiguously in a single measurement. The equality derived is of the form F (ϕ) = � ds dθ �2 (41) 90 Entropy 2014, 16, 2023–2055 the Fisher information is the square of the derivative of the statistical distance s with respect to θ. Contrary to (36), where the square of the statistical distance measure is expressed in terms of entries of a FIM F (ϑ) which is based on information about m parameters estimated from n measurements, for m < n. A challenging question could therefore be formulated as follows, can a generalization of equality (41) be developed in a quantum information context but at the matrix level ? To be more specific, many observations or measurements that lead to more than one parameter such that the corresponding Fisher information matrix is interconnected to an appropriate statistical distance matrix, a matrix where entries are scalar distance measures. This question could equally be a challenge to algebraic matrix theory and to quantum information. 3.4. The Bezoutian - The Fisher Information Matrix In this section an additional resultant matrix is presented, it concerns the Bezout matrix or Bezoutian. The notation of Lancaster and Tismenetsky [2] shall be used and the results presented are extracted from [38]. Assume the polynomials a and b given by a(z) = ∑n j=0 aj zj and b(z) = ∑n j=0 bj zj, cfr. (13) but where p = q = n, and we further assume a0 = b0 = 1. The Bezout matrix B(a, b) of the polynomials a and b is defined by the relation a(z)b(w) − a(w)b(z) = (z − w)u⊤ n (z)B(a, b)un(z) This matrix is often referred as the Bezoutian. We will display a decomposition of the Bezout matrix B(a, b) developed in [38]. For that purpose the matrix Uϕ and its inverse Tϕ are presented, where ϕ is a given complex number, to obtain Uϕ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 1 0 · · · · · · 0 −ϕ 1 · · · · · · 0 0 ... ... ... ... ... 0 · · · 0 −ϕ 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ , Tϕ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 1 0 · · · · · · 0 ϕ 1 · · · · · · 0 ϕ2 ... ... ... ... ... ϕn−1 · · · ϕ2 ϕ 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ Let (1 − α1z) and (1 − β1z) be a factor of a(z) and b(z) respectively and α1 and β1 are zeros of â(z) and �b(z). Consider the factored form of the nth order polynomials a(z) and b(z) of the form a(z) = (1 − α1z)a−1(z) and b(z) = (1 − β1z)b−1(z) respectively. Proceeding this way, for α2, . . . , αn yields the recursion a−(k−1)(z) = (1 − αkz)a−k(z), equivalently for the polynomials b−k(z) and a0(z) = a(z) and b0(z) = b(z). Proposition 3.1 in [38] is presented. The following non-symmetric decomposition of the Bezoutian is derived, considering the notations above B(a, b) = Uα1 � B(a−1, b−1) 0 0 0 � U⊤ β1 + (β1 − α1)bβ1a⊤ α1 (42) with aα1 such that a⊤ α1 un(z) = a−1 similarly for bβ1. Iteration gives the following expansion for the Bezout matrix B(a, b) = n ∑ k=1 (βk − αk)Uα1 . . . Uαk−1Uβk+1 . . . Uβnen 1 (en 1)⊤ U⊤ β1 . . . U⊤ βk−1U⊤ αk+1 . . . U⊤ αn where en 1 is the first unit standard basis column vector in Rn, by ej we denote the jth coordinate vector, ej = (0, . . . , 1, . . . , 0) T, with all its components equal to 0 except the jth component which equals 1. The following corollarys to Proposition 3.1 in [38] are now presented. Corollary 3.2 in [38] states. Let ϕ be a common zero of the polynomials â(z) and �b(z). Then a(z) = (1 − ϕz)a−1(z) and b(z) = (1 − ϕz)b−1(z) and 91 Entropy 2014, 16, 2023–2055 B(a, b) = Uϕ � B(a−1, b−1) 0 0 0 � U⊤ ϕ This a direct consequence of (42) and from which can be concluded that the Bezoutian B(a, b) is non-singular if and only if the polynomials a(z) and b(z) have no common factors. A similar conclusion is drawn for the FIM in (27) so that matrices F (ϑ) and B(a, b) have the same singularity property. Related to Corollary 3.2 in[38], this is where we give a description of the kernel or nullspace of the Bezout matrix. Corollary 3.3 in [38] is now presented. Let ϕ1, . . ., ϕm be all the common zeros of the polynomials â(z) and �b(z), with multiplicities n1, . . . , nm. Let ℓ be the last unit standard basis column vector in Rn and put wj k = � Tj ϕk Jj−1�⊤ ℓ for k = 1, . . . , m and j = 1, . . . , nk and by J we denote the forward n × n shift matrix, Jij = 1 if i = j + 1. Consequently, the subspace Ker B(a, b) is the linear span of the vectors wj k. An alternative representation to (27) but involving the Bezoutian B(b, a) and derived in Proposition 5.1 in [38] is of the form F(ϑ) = M−1(b, a)H(ϑ)M−⊤(b, a) (43) where H(ϑ) = � I 0 0 B(b, a) � Q(ϑ) � I 0 0 B(b, a) � and M(b, a) = � P 0 PS(ˆa)P PS(ˆb)P � (44) and P = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0 · · · 0 1 ... 1 0 0 ... 1 0 · · · 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ , S(ˆa) = ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ an−1 an−2 · · · a0 an−2 a0 0 ... ... a0 0 · · · 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ and Q(ϑ) ≻ 0 The matrix S(â) is the symmetrizer of the polynomial â(z), in this paper a0 = 1, see [2] and P is a permutation matrix. In [38] it is shown that the matrix Q(ϑ) is the unique solution to an appropriate Stein equation and is strictly positive definite. However, in the next section an explicit form of the Stein solution Q(ϑ) is developed. Some comments concerning the property summarized in Corollary 5.2 in [38] follow. The matrix H(ϑ) is non-singular if and only if the polynomials a(z) and b(z) have no common factors. The proof is straightforward since the matrix Q(ϑ) is non-singular which implies that the matrixH(ϑ) is only non-singular when the Bezoutian B(b, a) is non-singular and this is fulfilled if and only if the polynomials a(z) and b(z) have no common factors. The matrix M(b, a) is non-singular if a0 ̸= 0 and b0 ̸= 0, which is the case since we have a0 = b0 = 1. From (43) can be concluded that the FIM F (ϑ) is non-singular only when the matrix H(ϑ) is non-singular or by virtue of (44) when the Bezoutian B(b, a) is non-singular. Consequently, the singularity conditions of the Bezoutian B(b, a), the FIM F (ϑ) and the Sylvester resultant matrix S 92 Entropy 2014, 16, 2023–2055 (b, −a) are therefore equivalent. Can be concluded, by virtue of (29) proved in Lemma 3.1 and the equality dim (Ker S (a, b)) = dim (Ker B(a, b)) proved in Theorem 21.11 in [1], yields dim (Ker S(b, −a)) = dim (Ker F(ϑ)) = dim (Ker B(b, a)) = ν(a, b) 3.5. The Stein Equation - The Fisher Information Matrix of an ARMA(p, q) Process In [12], a link between the FIM of an ARMA process and an appropriate solution of a Stein equation is set forth. In this survey paper we shall present some of the results and confront some results displayed in the previous sections. However, alternative proofs will be given to some results obtained in [12,38]. The Stein matrix equation is now set forth. Let A ∈ Cm×m, B ∈ Cn×n and Γ ∈ Cn×m and consider the Stein equation S − BSA⊤ = Γ (45) It has a unique solution if and only if λμ ̸= 1 for any λ ∈ σ(A) and μ ∈ σ(B), the spectrum of D is σ(D) = {λ ∈ C: det(λIm − D) = 0}, the set of eigenvalues of D. The unique solution will be given in the next theorem [11]. 3.5.1. Theorem 3.4 Let A and B be, such that there is a single closed contour C with σ(B) inside C and for each non-zero w ∈ σ(A), w−1 is outside C. Then for an arbitrary Γ the Stein 45 has a unique solution S S = 1 2πi � C (λIn − B)−1Γ(Im − λA)−⊤dλ (46) In this section an interconnection between the representation (27) of the FIM F (ϑ) and an appropriate solution to a Stein equation of the form (45) as developed in [12] is set forth. The distinct roots of the polynomials â(z) and �b(z) are denoted by α1, α2, . . . , αp and β1, β2, . . . , βq respectively such that the non-singularity of the FIM F (ϑ) is guaranteed. The following representation of the integral expression (28) is given when Cauchy’s residue theorem is applied, equation (4.8) in [12] P(ϑ) = U(ϑ)D(ϑ) ˆU(ϑ) (47) where U(ϑ) = {up+q(α1), up+q(α2), . . . , up+q(αp), up+q(β1), up+q(β2), . . . , up+q(βq)} D(ϑ) = diag �� 1 ˆa(z;αi)ˆb(αi)a(αi)b(αi) � , � 1 ˆa(βj)ˆb(z;βj)a(βj)b(βj) �� , i = 1, ..., p and j = 1, ..., q and ˆU(ϑ) = {vp+q(α1), vp+q(α2), . . . , vp+q(αp), vp+q(β1), vp+q(β2), . . . , vp+q(βq)}⊤ the polynomial p(·; β) is defined accordingly, p(z; β) = p(z) (z−β) and D (ϑ) is the (p + q) × (p + q) diagonal matrix. The matrices U (ϑ) and �U((ϑ) in (47) are the (p + q)× (p + q) Vandermonde matrices Vαβ and �U αβ respectively, given by 93 Entropy 2014, 16, 2023–2055 Vαβ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 1 α1 α2 1 · · · αp+q−1 1 1 α2 α2 2 · · · αp+q−1 2 ... ... ... ... ... 1 αp α2 p · · · αp+q−1 p 1 β1 β2 1 · · · βp+q−1 1 1 β2 β2 2 · · · βp+q−1 2 ... ... ... ... ... 1 βq β2 q · · · βp+q−1 q ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ and ˆVαβ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ αp+q−1 1 αp+q−2 1 · · · α1 1 αp+q−1 2 αp+q−2 2 · · · α2 1 ... ... ... ... ... αp+q−1 p αp+q−2 p · · · αp 1 βp+q−1 1 βp+q−2 1 · · · β1 1 βp+q−1 2 βp+q−2 2 · · · β2 1 ... ... ... ... ... βp+q−1 q βp+q−2 q · · · βq 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ It is clear that the (p + q) × (p + q) Vandermonde matrices Vαβ and �U αβ are nonsingular when αi ̸= αj, βk ̸= βh and αi ̸= βk for all i, j = 1, . . . , p and k, h = 1, . . . , q. A rigorous systematic evaluation of the Vandermonde determinants DetVαβ and Det �U αβ, yields DetVαβ = (−1)(p+q) (p+q−1)/2Φ (αi, βk) where Φ (αi, βk) = ∏ 1≤i q (67) P (ϑ) is then of the form W(ϑ) = 1 2πi � |z|=1 up+r(z)v⊤ p+r(z) h(z)ˆh(z)a(z)ˆa(z)b(z)ˆb(z) dz (68) We will formulate a Stein equation when the matrix Γ = ep+re⊤ p+r and which is of the form S − C f SC⊤ f = ep+re⊤ p+r (69) where ep+r is the last standard basis column vector in Rp+r. The next lemma is formulated. 3.7.1. Lemma 3.8 The matrix P (ϑ) given in (68) fulfills the Stein 69. Proof The unique solution of (69) is assured since the product of all the eigenvalues of Cf are different from one, the solution is of the form S = 1 2πi � |z|=1 (zIp+r − C f )−1ep+re⊤ p+r(Ip+r − zC f )−⊤dz or S = 1 2πi � |z|=1 adj(zIp+r − C f )ep+re⊤ p+radj(Ip+r − zC f )⊤ ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z) dz 102 Entropy 2014, 16, 2023–2055 taking the property of the companion matrix Cf into account implies that the last column vector of adj(zIp+r − Cf ) is the basic vector up+r(z), consequently the last column of adj(Ip+r − z Cf ) is the basic vector vp+r(z), this yields S = 1 2πi � |z|=1 up+r(z)v⊤ p+r(z) ˆa(z)ˆb(z)ˆh(z)a(z)b(z)h(z) dz = W(ϑ) Consequently, the matrix P (ϑ) defined in (68) verifies the Stein 69. This completes the proof. The matrices, ℘(ϑ) and P (ϑ), in (65), verify under specific conditions appropriate Stein equations, as has been shown in Lemma 3.5 and Lemma 3.8, respectively. We will now confirm the presence of Vandermonde matrices by applying the standard residue theorem to P (ϑ) in (68), to obtain W(ϑ) = VαβξR (ϑ) ˆVαβξ (70) The (p + r) × (p + r) diagonal matrix R(ϑ) is of the form R (ϑ) = diag �� 1/ˆa(z; αi)ˆb(αi)ˆh(αi)ϕ(αi) � , � 1/ˆa(βj)ˆb(z; βj)ˆh(βj)ϕ(βj) � , � 1/ˆa(ξk)ˆb(ξk)ˆh(z; ξk)ϕ(ξk) �� where ϕ(z) = a(z)b(z)h(z) and i = 1, . . . , p, j = 1, . . . , q and k = 1, . . . , ℓ. Whereas the (p + r) × (p + r) matrices Vαβξ and �U αβξ are of the form Vαβξ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 1 α1 α2 1 · · · αp+r−1 1 ... ... ... ... ... 1 αp α2 p · · · αp+r−1 p 1 β1 β2 1 · · · βp+r−1 1 ... ... ... ... ... 1 βq β2 q · · · βp+r−1 q 1 ξ1 ξ2 1 · · · ξp+r−1 1 ... ... ... ... ... 1 ξℓ ξ2 ℓ · · · ξp+r−1 ℓ ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⊤ , ˆVαβξ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ αp+r−1 1 αp+r−2 1 · · · α1 1 ... ... ... ... αp+r−1 p αp+r−2 p · · · αp 1 βp+r−1 1 βp+r−2 1 · · · β1 1 ... ... ... ... βp+r−1 q βp+r−2 q · · · βq 1 ξp+r−1 1 ξp+r−2 1 · · · ξ1 1 ... ... ... ... ξp+r−1 ℓ ξp+r−2 ℓ · · · ξℓ 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ The (p + r) × (p + r) Vandermonde matrices Vαβξ and �U αβξ are nonsingular when αi ̸= αj , βk ̸= βh, ξm ̸= ξn, αi ̸= βk, αi ̸= ξm, βk ̸= ξm for all i, j = 1, . . . , p, k, h = 1, . . . , q and m,n = 1, . . . , ℓ. The Vandermonde determinants DetVαβξ and Det �U αβξ, are DetVαβξ = (−1)(p+r) (p+r−1)/2 Ψ (αi, βk, ξm) where Ψ (αi, βk, ξm) = ∏ 1≤i 0: lim m→∞ Pm(x0, A) = π(A) (5) for any A ∈ B. Certain conditions are required for Equation (5) to hold, but for all Markov chains presented here, these are satisfied (though, see [8]). A useful condition, which is sufficient (though not necessary) for π(·) to be an invariant distribution, is reversibility, which can be shown by the relation: π(x)p(x′|x) = π(x′)p(x|x′). (6) 132 Entropy 2014, 16, 3074–3102 Integrating over both sides with respect to x, we recover Equation (4). In other words, a chain is reversible if, at stationarity, the probability that xi ∈ A and xi+1 ∈ B are equal to the probability that xi+1 ∈ A and xi ∈ B. The relation (6) will be the primary tool used to construct Markov chains with a desired invariant distribution in the next section. 2.1.1. Monte Carlo Estimates from Markov Chains Of most interest here are estimators constructed from a Markov chain. The Ergodic Theorem states that for any chain, {Xm}m∈N, satisfying Equation (5) and any g ∈ L1(π), we have that: lim m→∞ 1 m m ∑ i=1 g(Xi) = Eπ[g(X)] (7) with probability one [7]. This is a Markov chain analogue to the Law of large numbers. The efficiency of estimators of the form ˆtm = ∑i g(Xi)/m can be assessed through the autocorrelation between elements in the chain. We will assess the efficiency of ˆtm relative to estimators ¯tm = ∑i g(Zi)/m, where {Zi}m∈N is a sequence of independent random variables, each having distribution π(·). Provided Varπ[g(Zi)] < ∞, then Var[¯tm] = Varπ[g(Zi)]/m. We now seek a similar result for estimators of the form, ˆtm. It follows directly from the Kipnis–Varadhan Theorem [9] that an estimator, ˆtm, from a reversible Markov chain for which X0 ∼ π(·) satisfies: lim m→∞ Var[ˆtm] Var[¯tm] = 1 + 2 ∞ ∑ i=1 ρ(0,i) = τ, (8) provided that ∑∞ i=1 i|ρ(0,i)| < ∞, where ρ(0,i) = Corrπ[g(X0), g(Xi)]. We will refer to the constant, τ, as the autocorrelation time for the chain. Equation (8) implies that for large enough m, Var[ˆtm] ≈ τVar[¯tm]. In practical applications, the sum in Equation (8) is truncated to the first p − 1 realisations of the chain, where p is the first instance at which |ρ(0,p)| < ϵ for some ϵ > 0. For example, in the Convergence Diagnosis and Output Analysis for MCMC (CODA) package within the R statistical software ϵ = 0.05 [10,11]. Another commonly used measure of efficiency is the effective sample size me f f = m/τ, which gives the number of independent samples from π(·) needed to give an equally efficient estimate for Eπ[g(X)]. Clearly, minimising τ is equivalent to maximising me f f . The measures arising from Equation (8) give some intuition for what sort of Markov chain gives rise to efficient estimators. However, in practice, the chain will never be at stationarity. Therefore, we also assess Markov chains according to how far away they are from this point. For this, we need to measure how close Pm(x0, ·) is from π(·), which requires a notion of distance between probability distributions. Although there are several appropriate choices [12], a common option in the Markov chain literature is the total variation distance: ∥μ(·) − ν(·)∥TV := sup A∈B |μ(A) − ν(A)|, (9) which informally gives the largest possible difference between the probabilities of a single event in B according to μ(·) and ν(·). If both distributions admit densities, Equation (9) can be written (see Appendix A): ∥μ(·) − ν(·)∥TV = 1 2 � X |μ(x) − ν(x)|dx. (10) which is proportional to the L1 distance between μ(x) and ν(x). Our metric, ∥ · ∥TV ∈ [0, 1], with ∥ · ∥TV = 1 for distributions with disjoint supports and ∥μ(·) − ν(·)∥TV = 0, implies μ(·) ≡ ν(·). 133 Entropy 2014, 16, 3074–3102 Typically, for an unbounded X , the distance ∥Pm(x0, ·) − π(·)∥TV will depend on x0 for any finite m. Therefore, bounds on the distance are often sought via some inequality of the form: ∥Pm(x0, ·) − π(·)∥TV ≤ MV(x0) f (m), (11) for some M < ∞, where V : X → [1, ∞) depends on x0 and is called a drift function, and f : N → [0, ∞) depends on the number of iterations, m (and is often defined, such that f (0) = 1). A Markov chain is called geometrically ergodic if f (m) = rm in Equation (11) for some 0 < r < 1. If in addition to this, V is bounded above, the chain is called uniformly ergodic. Intuitively, if either condition holds, then the distribution of Xm will converge to π(·) geometrically quickly as m grows, and in the uniform case, this rate is independent of x0. As well as providing some (often qualitative if M and r are unknown) bounds on the convergence rate of a Markov chain, geometric ergodicity implies that a central limit theorem exists for estimators of the form, ˆtm. For more detail on this, see [13,14]. In practice several approximate methods also exist to assess whether a chain is close enough to stationarity for long-run averages to provide suitable estimators (e.g., [15]). The MCMC practitioner also uses a variety of visual aids to judge whether an estimate from the chain will be appropriate for his or her needs. 2.2. Markov Chain Monte Carlo Now that we have introduced Markov chains, we turn to simulating them. The objective here is to devise a method for generating a Markov chain, which has a desired limiting distribution, π(·). In addition, we would strive for the convergence rate to be as fast as possible and the effective sample size to be suitably large relative to the number of iterations. Of course, the computational cost of performing an iteration is also an important practical consideration. Ideally, any method would also require limited problem-specific alterations, so that practitioners are able to use it with as little knowledge of the inner workings as is practical. Although other methods exist for constructing chains with a desired limiting distribution, a popular choice is the Metropolis–Hastings algorithm [7]. At iteration i, a sample is drawn from some candidate transition kernel, Q(xi−1, ·), and then either accepted or rejected (in which case, the state of the chain remains xi−1). We focus here on the case where Q(xi−1, ·) admits a density, q(x′|xi−1), for all xi−1 ∈ X (though, see [8]). In this case, a single step is shown below (the wedge notation a ∧ b denotes the minimum of a and b). The “acceptance rate”, α(xi−1, x′), governs the behaviour of the chain, so that, when it is close to one, then many proposed moves are accepted, and the current value in the chain is constantly changing. If it is on average close to zero, then many proposals are rejected, so that the chain will remain in the same place for many iterations. However, α ≈ 1 is typically not ideal, often resulting in a large autocorrelation time (see below). The challenge in practice is to find the right acceptance rate to balance these two extremes. Algorithm 1 Metropolis–Hastings, single iteration. Require: xi−1 Draw X′ ∼ Q(xi−1, ·) Draw Z ∼ U[0, 1] Set α(xi−1, x′) ← 1 ∧ π(x′)q(xi−1|x′) π(xi−1)q(x′|xi−1) if z < α(xi−1, x′) then Set xi ← x′ else Set xi ← xi−1 end if Combining the “proposal” and “acceptance” steps, the transition kernel for the resulting Markov chain is: 134 Entropy 2014, 16, 3074–3102 P(x, A) = r(x)δx(A) + � A α(x, x′)q(x′|x)dx′, (12) for any A ∈ B, where: r(x) = 1 − � X α(x, x′)q(x′|x)dx′ is the average probability that a draw from Q(x, ·) will be rejected, and δx(A) = 1 if x ∈ A and zero, otherwise. A Markov chain defined in this way will have π(·) as an invariant distribution, since the chain is reversible for π(·). We note here that: π(xi−1)q(xi|xi−1)α(xi−1, xi) = π(xi−1)q(xi|xi−1) ∧ π(xi)q(xi−1|xi) = α(xi, xi−1)q(xi−1|xi)π(xi) in the case that the proposed move is accepted and that if the proposed move is rejected, then xi = xi−1; so the chain is reversible for π(·). It can be shown that π(·) is also the limiting distribution for the chain [7]. The convergence rate and autocorrelation time of a chain produced by the algorithm are dependent on both the choice of proposal, Q(xi−1, ·), and the target distribution, π(·). For simple forms of the latter, less consideration is required when choosing the former. A broad objective among researchers in the field is to find classes of proposal kernels that produce chains that converge and mix quickly for a large class of target distributions. We first review a simple choice before discussing one that is more sophisticated, and the will be the focus of the rest of the article. 2.3. Random Walk Proposals An extremely simple choice for Q(x, ·) is one for which: q(x′|x) = q(∥x′ − x∥) (13) where ∥ · ∥ denotes some appropriate norm on X , meaning the proposal is symmetric. In this case, the acceptance rate reduces to: α(x, x′) = 1 ∧ π(x′) π(x) . (14) In addition to simplifying calculations, Equation (14) strengthens the intuition for the method, since proposed moves with higher density under π(·) will always be accepted. A typical choice for Q(x, ·) is N (x, λ2Σ), where the matrix, Σ, is often chosen in an attempt to match the correlation structure of π(·) or simply taken as the identity [16]. The tuning parameter, λ, is the only other user-specific input required. Much research has been conducted into properties of the random walk Metropolis algorithm (RWM). It has been shown that the optimal acceptance rate for proposals tends to 0.234 as the dimension, n, of the state space, X , tends to ∞ for a wide class of targets (e.g., [17,18]). The intuition for an optimal acceptance rate is to find the right balance between the distance of proposed moves and the chances of acceptance. Increasing the former will reduce the autocorrelation in the chain if the proposal is accepted, but if it is rejected, the chain will not move at all, so autocorrelation will be high. Random walk proposals are sometimes referred to as blind (e.g., [19]), as no information about π(·) is used when generating proposals, so typically, very large moves will result in a very low chance of acceptance, while small moves will be accepted, but result in very high autocorrelation for the chain. Figure 1 demonstrates this in the simple case where π(·) is a one-dimensional N (0, 12) distribution. 135 Entropy 2014, 16, 3074–3102 � ���� ���� ����� �� �� � � � ���������� ������������� ���������������������������� � ���� ���� ����� �� �� � � � ���������� ������������ ������������������������������� � ���� ���� ����� �� �� � � � ���������� ������������ ����������������������������� Figure 1. These traceplots show the evolution of three RWM Markov chains for which π(·) is a N (0, 12) distribution, with different choices for λ. Several authors have also shown that for certain classes of π(·), the tuning parameter, λ, should be chosen, such that λ2 ∝ n−1, so that α ↛ 0 as n → ∞ [20]. Because of this, we say that algorithm efficiency “scales” O(n−1) as the dimension n of π(·) increases. Ergodicity results for a Markov chain constructed using the RWM algorithm also exist [21–23]. At least exponentially light tails are a necessity for π(x) for geometric ergodicity, which means that π(x)/e−∥x∥ → c as ∥x∥ → ∞, for some constant, c. For super-exponential tails (where π(x) → 0 at a faster than the exponential rate), additional conditions are required [21,23]. We demonstrate with a simple example why heavy-tailed forms of π(x) pose difficulties here (where π(x) → 0 at a rate slower than e−∥x∥). Example: Take π(x) ∝ 1/(1 + x2), so that π(·) is a Cauchy distribution. Then, if X′ ∼ N (x, λ2), the ratio π(x′)/π(x) = (1 + x2)/(1 + (x′)2) → 1 as |x| → ∞. Therefore, if x0 is far away from zero, the Markov chain will dissolve into a random walk, with almost every proposal being accepted. It should be noted that starting the chain from at or near zero can also cause problems in the above example, as the tails of the distribution may not be explored. See [7] for more detail here. Ergodicity results for the RWM also exist for specific classes of the statistical model. Conditions for geometric ergodicity in the case of generalised linear mixed models are given in [24], while spherically constrained target densities are discussed in [25]. In [26], the authors provide necessary conditions for the geometric convergence of RWM algorithms, which are related to the existence of exponential moments for π(·) and P(x, ·). Weaker forms of ergodicity and corresponding conditions are also discussed in the paper. In the remainder of the article, we will primarily discuss another approach to choosing Q, which has been shown empirically [1] and, in some cases, theoretically [20] to be superior to the RWM algorithm, though it should be noted that random walk proposals are still widely used in practice and are often sufficient for more straightforward problems [16]. 3. Diffusions In MCMC, we are concerned with discrete time processes. However, often, there are benefits to first considering a continuous time process with the properties we desire. For example, some continuous time processes can be specified via a form of differential equation. In this section, we derive a choice for a Metropolis–Hastings proposal kernel based on approximations to diffusions, 136 Entropy 2014, 16, 3074–3102 those continuous-time n-dimensional Markov processes (Xt)t≥0 for which any sample path t �→ Xt(ω) is a continuous function with probability one. For any fixed t, we assume Xt is a random variable taking values on the measurable space (X , B) as before. The motivation for this section is to define a class of diffusions for which π(·) is the invariant distribution. First, we provide some preliminaries, followed by an introduction to our main object of study, the Langevin diffusion. 3.1. Preliminaries We focus on the class of time-homogeneous Itô diffusions, whose dynamics are governed by a stochastic differential equation of the form: dXt = b(Xt)dt + σ(Xt)dBt, X0 = x0, (15) where (Bt)t≥0 is a standard Brownian motion and the drift vector, b, and volatility matrix, σ, are Lipschitz continuous [27]. Since E[Bt+△t − Bt|Bt = bt] = 0 for any △t ≥ 0, informally, we can see that: E[Xt+△t − Xt|Xt = xt] = b(xt)△t + o(△t), (16) implying that the drift dictates how the mean of the process changes over a small time interval, and if we define the process (Mt)t≥0 through the relation: Mt = Xt − � t 0 b(Xs)ds (17) then we have: E[(Mt+△t − Mt)(Mt+△t − Mt)T|Mt = mt, Xt = xt] = σ(xt)σ(xt)T△t + o(△t), (18) giving the stochastic part of the relationship between Xt+△t and Xt for small enough △t; see, e.g., [28]. While Equation(15) is often a suitable description of an Itô diffusion, it can also be characterised through an infinitesimal generator, A, which describes how functions of the process are expected to evolve. We define this partial differential operator through its action on a function, f ∈ C0(X ), as: A f (Xt) = lim △t→0 E[ f (Xt+△t)|Xt = xt] − f (xt) △t , (19) though A can be associated with the drift and volatility of (Xt)t≥0 by the relation: A f (x) = ∑ i bi(x) ∂ f ∂xi (x) + 1 2 ∑ i,j Vij(x) ∂2 f ∂xi∂xj (x), (20) where Vij(x) denotes the component in row i and column j of σ(x)σ(x)T [27]. As in the discrete case, we can describe the transition kernel of a continuous time Markov process, Pt(x0, ·). In the case of an Itô diffusion, Pt(x0, ·) admits a density, pt(x|x0), which, in fact, varies smoothly as a function of t. The Fokker–Planck equation describes this variation in terms of the drift and volatility and is given by: ∂ ∂t pt(x|x0) = −∑ i ∂ ∂xi [bi(x)pt(x|x0)] + 1 2 ∑ i,j ∂2 ∂xi∂xj [Vij(x)pt(x|x0)]. (21) 137 Entropy 2014, 16, 3074–3102 Although, typically, the form of Pt(x0, ·) is unknown, the expectation and variance of Xt ∼ Pt(x0, ·) are given by the integral equations: E[Xt|X0 = x0] = x0 + E �� t 0 b(Xs)ds � , E[(Xt − E[Xt])(Xt − E[Xt])T|X0 = x0] = E �� t 0 σ(Xs)σ(Xs)Tds � , where the second of these is a result of the Itô isometry [27]. Continuing the analogy, a natural question is whether a diffusion process has an invariant distribution, π(·), and whether: lim t→∞ Pt(x0, A) = π(A) (22) for any A ∈ B and any x0 ∈ X , in some sense. For a large class of diffusions (which we confine ourselves to), this is, in fact, the case. Specifically, in the case of positive Harris recurrent diffusions with invariant distribution π(·), all compact sets must be small for some skeleton chain, see [29] for details. In addition, Equation (21) provides a means of finding π(·), given b and σ. Setting the left-hand side of Equation (21) to zero gives: ∑ i ∂ ∂xi [bi(x)π(x)] = 1 2 ∑ i,j ∂2 ∂xi∂xj [Vij(x)π(x)], (23) which can be solved to find π(·). 3.2. Langevin Diffusions Given Equation (23), our goal becomes clearer: find drift and volatility terms, so that the resulting dynamics describe a diffusion, which converges to some user-defined invariant distribution, π(·). This process can then be used as a basis for choosing Q in a Metropolis–Hastings algorithm. The Langevin diffusion, first used to describe the dynamics of molecular systems [30], is such a process, given by the solution to the stochastic differential equation: dXt = 1 2∇ log π(Xt)dt + dBt, X0 = x0. (24) Since Vij(x) = 1{i=j}, it is clear that 1 2 ∂ ∂xi [log π(x)]π(x) = 1 2 ∂ ∂xi π(x), ∀i, (25) which is a sufficient condition for Equation (23) to hold. Therefore, for any case in which π(x) is suitably regular, so that ∇ log π(x) is well-defined and the derivatives in Equation (23) exist, we can use (24) to construct a diffusion, which has invariant distribution, π(·). Roberts and Tweedie [31] give sufficient conditions on π(·) under which a diffusion, (Xt)t≥0, with dynamics given by Equation (24), will be ergodic, meaning: ∥Pt(x0, ·) − π(·)∥TV → 0 (26) as t → ∞, for any x0 ∈ X . 3.3. Metropolis-Adjusted Langevin Algorithm We can use Langevin diffusions as a basis for MCMC in many ways, but a popular variant is known as the Metropolis-adjusted Langevin algorithm (MALA), whereby Q(x, ·) is 138 Entropy 2014, 16, 3074–3102 constructed through a Euler–Maruyama discretisation of (24) and used as a candidate kernel in a Metropolis–Hastings algorithm. The resulting Q is: Q(x, ·) ≡ N � x + λ2 2 ∇ log π(x), λ2I � , (27) where λ is again a tuning parameter. Before we discuss the theoretical properties of the approach, we first offer an intuition for the dynamics. From Equation (27), it can be seen that Langevin-type proposals comprise a deterministic shift towards a local mode of π(x), combined with some random additive Gaussian noise, with variance λ2 for each component. The relative weights of the deterministic and random parts are fixed, given as they are by the parameter, λ. Typically, if λ1/2 ≫ λ, then the random part of the proposal will dominate and vice versa in the opposite case, though this also depends on the form of ∇ log π(x) [31]. Again, since this is a Metropolis–Hastings method, choosing λ is a balance between proposing large enough jumps and ensuring that a reasonable proportion are accepted. It has been shown that in the limit, as n → ∞, the optimal acceptance rate for the algorithm is 0.574 [20] for forms of π(·), which either have independent and identically distributed components or whose components only differ by some scaling factor [20]. In these cases, as n → ∞, the parameter, λ, must be ∝ n−1/3, so we say the algorithm efficiency scales O(n−1/3). Note that these results compare favourably with the O(n−1) scaling of the random walk algorithm. Convergence properties of the method have also been established. Roberts and Tweedie [31] highlight some cases in which MALA is either geometrically ergodic or not. Typically, results are based on the tail behaviour of π(x). If these tails are heavier than exponential, then the method is typically not geometrically ergodic and similarly if the tails are lighter than Gaussian. However, in the in between case, the converse is true. We again offer two simple examples for intuition here. Example: Take π(x) ∝ 1/(1 + x2) as in the previous example. Then, ∇ log π(x) = −2x/(1 + x2)2 → 0 as |x| → ∞. Therefore, if x0 is far away from zero, then the MALA will be approximately equal to the RWM algorithm and, so, will also dissolve into a random walk. Example: Take π(x) ∝ e−x4. Then, ∇ log π(x) = −4x3 and X′ ∼ N (x − 4λ2x3, λ2). Therefore, for any fixed λ, there exists c > 0, such that, for |x0| > c, we have |4λ2x3| >> x and |x − 4λ2x3| >> λ, suggesting that MALA proposals will quickly spiral further and further away from any neighbourhood of zero, and hence, nearly all will be rejected. For cases where there is a strong correlation between elements of x or each element has a different marginal variance, the MALA can also be “pre-conditioned” in a similar way to the RWM, so that the covariance structure of proposals more accurately reflects that of π(x) [32]. In this case, proposals take the form: Q(x, ·) ≡ N � x + λ2 2 Σ∇ log π(x), λ2Σ � , (28) where λ is again a tuning parameter. It can be shown that provided Σ is a constant matrix, π(x) is still the invariant distribution for the diffusion on which Equation (28) is based [33]. 4. Geometric Concepts in Markov Chain Monte Carlo Ideas from information geometry have been successfully applied to statistics from as early as [34]. More widely, other geometric ideas have also been applied, offering new insight into common problems (e.g., [35,36]). A survey is given in [37]. In this section, we suggest why some ideas from differential geometry may be beneficial for sampling methods based on Markov chains. We then review what is 139 Entropy 2014, 16, 3074–3102 meant by a “diffusion on a manifold”, before turning to the specific case of Equation (24). After this, we discuss what can be learned from work in information geometry in this context. 4.1. Manifolds and Markov Chains We often make assumptions in MCMC about the properties of the space, X , in which our Markov chains evolve. Often X = Rn or a simple re-parametrisation would make it so. However, here, Rn = {(a1, ..., an) : ai ∈ (−∞, ∞) ∀i}. The additional assumption that is often made is that Rn is Euclidean, an inner product space with the induced distance metric: d(x, y) = � ∑ i (xi − yi)2. (29) For sampling methods based on Markov chains that explore the space locally, like the RWM and MALA, it may be advantageous to instead impose a different metric structure on the space, X , so that some points are drawn closer together and others pushed further apart. Intuitively, one can picture distances in the space being defined, such that if the current position in the chain is far from an area of X , which is “likely to occur” under π(·), then the distance to such a typical set could be reduced. Similarly, once this region is reached, the space could be “stretched” or “warped”, so that it is explored as efficiently as possible. While the idea is attractive, it is far from a constructive definition. We only have the pre-requisite that (X , d) must be a metric space. However, as Langevin dynamics use gradient information, we will require (X , d) to be a space on which we can do differential calculus. Riemannian manifolds are an appropriate choice, therefore, as the rules of differentiation are well understand for functions defined on them [38,39], while we are still free to define a more local notion of distance than Euclidean. In this section, we write Rn to denote the Euclidean vector space. 4.2. Preliminaries We do not provide a full overview of Riemannian geometry here [38–40]. We simply note that for our purposes, we can consider an n-dimensional Riemannian manifold (henceforth, manifold) to be an n-dimensional metric space, in which distances are defined in a specific way. We also only consider manifolds for which a global coordinate chart exists, meaning that a mapping r : Rn → M exists, which is both differentiable and invertible and for which the inverse is also differentiable (a diffeomorphism). Although this restricts the class of manifolds available (the sphere, for example, is not in this class), it is again suitable for our needs and avoids the practical challenges of switching between coordinate patches. The connection with Rn defined through r is crucial for making sense of differentiability in M. We say a function f : M → R is “differentiable” if ( f ◦ r) : Rn → R is [39]. As has been stated, Equation (29) can be induced via a Euclidean inner product, which we denote ⟨·, ·⟩. However, it will aid intuition to think of distances in Rn via curves: γ : [0, 1] → Rn. (30) We could think of the distance between two points in x, y ∈ Rn as the minimum length among all curves that pass through x and y. If γ(0) = x and γ(1) = y, the length is defined as: L(γ) = � 1 0 � ⟨γ′(t), γ′(t)⟩dt, (31) giving the metric: d(x, y) = inf {L(γ) : γ(0) = x, γ(1) = y} . (32) In Rn, the curve with a minimum length will be a straight line, so that Equation (32) agrees with Equation (29). More generally, we call a solution to Equation (32) a geodesic [38]. 140 Entropy 2014, 16, 3074–3102 In a vector space, metric properties can always be induced through an inner product (which also gives a notion of orthogonality). Such a space can be thought of as “flat”, since for any two points, y and z, the straight line ay + (1 − a)z, a ∈ [0, 1] is also contained in the space. In general, manifolds do not have vector space structure globally, but do so at the infinitesimal level. As such, we can think of them as “curved”. We cannot always define an inner product, but we can still define distances through (32). We define a curve on a manifold, M, as γM : [0, 1] → M. At each point γM(t) = p ∈ M, the velocity vector, γ′ M(t), lies in an n-dimensional vector space, which touches M at p. These are known as tangent spaces, denoted TpM, which can be thought of as local linear approximations to M. We can define an inner product on each as gp : TpM → R, which allows us to define a generalisation of (31) as: L(γM) = � 1 0 � gp(γ′ M(t), γ′ M(t))dt. (33) and provides a means to define a distance metric on the manifold as d(x, y) = inf {L(γM) : γM(0) = x, γM(1) = y}. We emphasise the difference between this distance metric on M and gp, which is called a Riemannian metric or metric tensor and which defines an inner product on TpM. Embeddings and Local Coordinates So far, we have introduced manifolds as abstract objects. In fact, they can also be considered as objects that are embedded in some higher-dimensional Euclidean space. A simple example is any two-dimensional surface, such as the unit sphere, lying in R3. If a manifold is embedded in this way, then metric properties can be induced from the ambient Euclidean space. We seek to make these ideas more concrete through an example, the graph of a function, f (x1, x2), of two variables, x1 and x2. The resulting map, r, is: r : R2 → M (34) r(x1, x2) = (x1, x2, f (x1, x2)). (35) We can see that M is embedded in R3, but that any point can be identified using only two coordinates, x1 and x2. In this case, each TpM is a plane, and therefore, a two-dimensional subspace of R3, so: (i) it inherits the Euclidean inner product, ⟨·, ·⟩; and (ii) any vector, v ∈ TpM, can be expressed as a linear combination of any two linearly independent basis vectors (a canonical choice is the partial derivatives ∂r/∂x1 := r1 and r2, evaluated at x = r−1(p) ∈ R2). The resulting inner product, gp(v, w), between two vectors, v, w ∈ TpM, can be induced from the Euclidean inner product as: ⟨v, w⟩ = ⟨v1r1(x) + v2r2(x), w1r1(x) + w2r2(x)⟩, = v1w1⟨r1(x), r1(x)⟩ + v1w2⟨r1(x), r2(x)⟩ + v2w1⟨r2(x), r1(x)⟩ + v2w2⟨r2(x), r2(x)⟩, = vTG(x)w, where: G(x) = � ⟨r1(x), r1(x)⟩ ⟨r1(x), r2(x)⟩ ⟨r1(x), r2(x)⟩ ⟨r2(x), r2(x)⟩ � (36) and we use vi, wi to denote the components of v and w. To write (31) using this notation, we define the curve, x(t) ∈ R2, corresponding to γM(t) ∈ M as x = (r−1 ◦ γM) : [0, 1] → R2. Equation (31) can then be written: L(γM) = � 1 0 � x′(t)TG(x(t))x′(t)dt, (37) which can be used in (32) as before. 141 Entropy 2014, 16, 3074–3102 The key point is that, although we have started with an object embedded in R3, we can compute the Riemannian metric, gp(v, w) (and, hence, distances in M), using only the two-dimensional “local” coordinates (x1, x2). We also need not have explicit knowledge of the mapping, r, only the components of the positive definite matrix, G(x). The Nash embedding theorem [41] in essence enables us to define manifolds by the reverse process: simply choose the matrix, G(x), so that we define a metric space with suitable distance properties, and some object embedded in some higher-dimensional Euclidean space will exist for which these metric properties can be induced as above. Therefore, to define our new space, we simply choose an appropriate matrix-valued map, G(x) (we discuss this choice in Section 4.4). If G(x) does not depend on x, then M has a vector space structure and can be thought of as “flat”. Trivially, G(x) = I gives Euclidean n-space. We can also define volumes on a Riemannian manifold in local coordinates. Following standard coordinate transformation rules, we can see that for the above example, the area element, dx, in R2 will change according to a Jacobian J = |(Dr)T(Dr)|1/2, where Dr = ∂(p1, p2, p3)/∂(x1, x2). This reduces to J = |G(x)|1/2, which is also the case for more general manifolds [38]. We therefore define the Riemannian volume measure on a manifold, M, in local coordinates as: VolM(dx) = |G(x)| 1 2 dx. (38) If G(x) = I, then this reduces to the Lebesgue measure. 4.3. Diffusions on Manifolds By a “diffusion on a manifold” in local coordinates, we actually mean a diffusion defined on Euclidean space. For example, a realisation of Brownian motion on the surface, S ⊂ R3, defined in Figure 2 through r(x1, x2) = (x1, x2, sin(x1) + 1) will be a sample path, which is defined on S and “looks locally” like Brownian motion in a neighbourhood of any point, p ∈ S. However, the pre-image of this sample path (through r−1) will not be a realisation of a Brownian motion defined on R2, owing to the nonlinearity of the mapping. Therefore, to define “Brownian motion on S”, we define some diffusion (Xt)t≥0 that takes values in R2, for which the process (r(Xt))t≥0 “looks locally” like a Brownian motion (and lies on S). See [42] for more intuition here. Our goal, therefore, is to define a diffusion on Euclidean space, which, when mapped onto a manifold through r, becomes the Langevin diffusion described in (24) by the above procedure. Such a diffusion takes the form: dXt = 1 2 ˜∇ log ˜π(Xt)dt + d ˜Bt, (39) where those objects marked with a tilde must be defined appropriately. The next few paragraphs are technical, and readers aiming to simply grasp the key points may wish to skip to the end of this Subsection. We turn first to ( ˜Bt)t≥0, which we use to denote Brownian motion on a manifold. Intuitively, we may think of a construction based on embedded manifolds, by setting ˜B0 = p ∈ M, and for each increment sampling some random vector in the tangent space TpM, and then moving along the manifold in the prescribed direction for an infinitesimal period of time before re-sampling another velocity vector from the next tangent space [42]. In fact, we can define such a construction using Stratonovich calculus and show that the infinitesimal generator can be written using only local coordinates [28]. Here, we instead take the approach of generalising the generator directly from Euclidean space to the local coordinates of a manifold, arriving at the same result. We then deduce the stochastic differential equation describing ( ˜Bt)t≥0 in Itô form using (20). For a standard Brownian motion on Rn, A = △/2, where △ denotes the Laplace operator: △ f = ∑ i ∂2 f ∂x2 i = div(∇ f ). (40) 142 Entropy 2014, 16, 3074–3102 Substituting A = △/2 into (20) trivially gives bi(x) = 0 ∀i, Vij(x) = 1{i=j}, as required. The Laplacian, △ f (x), is the divergence of the gradient vector field of some function, f ∈ C2(Rn), and its value at x ∈ Rn can be thought of as the average value of f in some neighbourhood of x [43]. A B Figure 2. A two-dimensional manifold (surface) embedded in R3 through r(x1, x2) = (x1, x2, sin(x1) + 1), parametrised by the local coordinates, x1 and x2. The distance between points A and B is given by the length of the curve γ(t) = (t, t, sin(t) + 1)). To define a Brownian motion on any manifold, the gradient and divergence must be generalised. We provide a full derivation in Appendix B, which shows that the gradient operator on a manifold can be written in local coordinates as ∇M = G−1(x)∇. Combining with the operator, divM, we can define a generalisation of the Laplace operator, known as the Laplace–Beltrami operator (e.g., [44,45]), as: △LB f = divM(∇M f ) = |G(x|− 1 2 n ∑ i=1 ∂ ∂xi � |G(x)| 1 2 n ∑ j=1 {G−1(x)}ij ∂ f ∂xj � , (41) for some f ∈ C2 0(M). The generator of a Brownian motion on M is △LB/2 [44]. Using (20), the resulting diffusion has dynamics given by: d ˜Bt = Ω(Xt)dt + � G−1(Xt)dBt, Ωi(Xt) = 1 2|G(Xt)|− 1 2 n ∑ j=1 ∂ ∂xj � |G(Xt)| 1 2 {G−1(Xt)}ij � . Those familiar with the Itô formula will not be surprised by the additional drift term, Ω(Xt). As Itô integrals do not follow the chain rule of ordinary calculus, non-linear mappings of martingales, such as (Bt)t≥0, typically result in drift terms being added to the dynamics (e.g., [27]). To define ˜∇, we simply note that this is again the gradient operator on a general manifold, so ˜∇ = G−1(x)∇. For the density, ˜π(x), we note that this density will now implicitly be defined with respect to the volume measure, |G(x)| 1 2 dx, on the manifold. Therefore, to ensure the diffusion (39) has the correct invariant density with respect to the Lebesgue measure, we define: ˜π(x) = π(x)|G(x)|− 1 2 . (42) Putting these three elements together, Equation (39) becomes: 143 Entropy 2014, 16, 3074–3102 dXt = 1 2G−1(Xt)∇ log � π(Xt)|G(Xt)|− 1 2 � dt + Ω(Xt)dt + � G−1(Xt)dBt, which, upon simplification, becomes: dXt = 1 2G−1(Xt)∇ log π(Xt)dt + Λ(Xt)dt + � G−1(Xt)dBt, (43) Λi(Xt) = 1 2 ∑ j ∂ ∂xj {G−1(Xt)}ij. (44) It can be shown that this diffusion has invariant Lebesgue density π(x), as required [33]. Intuitively, when a set is mapped onto the manifold, distances are changed by a factor, � G(x). Therefore, to end up with the initial distances, they must first be changed by a factor of � G−1(x) before the mapping, which explains the volatility term in Equation (43). The resulting Metropolis–Hastings proposal kernel for this “MALA on a manifold” was clarified in [33] and is given by: Q(x, ·) ≡ N � x + λ2 2 G−1(x)∇ log π(x) + λ2Λ(x), λ2G−1(x) � , (45) where λ2 is a tuning parameter. The nonlinear drift term here is slightly different to that reported in [1,32], for reasons discussed in [33]. 4.4. Choosing a Metric We now turn to the question of which metric structure to put on the manifold, or equivalently, how to choose G(x). In this section, we sometimes switch notation slightly, denoting the target density, π(x|y), as some of the discussion is directed towards Bayesian inference, where π(·) is the posterior distribution for some parameter, x, after observing some data, y. The problem statement is: what is an appropriate choice of distance between points in the sample space of a given probability distribution? A related (but distinct) question is how to define a distance between two probability distributions from the same parametric family, but with different parameters. This has been a key theme in information geometry, explored by Rao [46] and others [2] for many years. Although generic measures of distance between distributions (such as total variation) are often appropriate, based on information-theoretic principles, one can deduce that for a given parametric family, {px(y) : x ∈ X }, it is in some sense natural to consider this “space of distributions” to be a manifold, where the Fisher information is the matrix, G(x) (with the α = 0 connection employed; see [2] for details). Because of this, Girolami and Calderhead [1] proposed a variant of the Fisher metric for geometric Markov chain Monte Carlo, as: G(x) = Ey|x � − ∂2 ∂xi∂xj log f (y|x) � − ∂2 ∂xi∂xj log π0(x), (46) where π(x|y) ∝ f (y|x)π0(x) is the target density, f denotes the likelihood and π0 the prior. The metric is tailored to Bayesian problems, which are a common use for MCMC, so the Fisher information is combined with the negative Hessian of the log-prior. One can also view this metric as the expected negative Hessian of the log target, since this naturally reduces to (46). The motivation for a Hessian-style metric can also be understood from studying MCMC proposals. From (45) and by the same logic as for general pre-conditioning methods [32], the objective is to choose G−1(x) to match the covariance structure of π(x|y) locally. If the target density were Gaussian with covariance matrix, Σ, then: − ∂2 ∂xi∂xj log π(x|y) = Σ. (47) 144 Entropy 2014, 16, 3074–3102 In the non-Gaussian case, the negative Hessian is no longer constant, but we can imagine that it matches the correlation structure of π(x|y) locally at least. Such ideas have been discussed in the geostatistics literature previously [47]. One problem with simply using (47) to define a metric is that unless π(x|y) is log-concave, the negative Hessian will not be globally positive-definite, although Petra et al. [48] conjecture that it may be appropriate for use in some realistic scenarios and suggest some computationally efficient approximation procedures [48]. Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = −∂2 log π(x)/∂x2. Then, G−1(x) = (1 + x2)2/(2 − 2x2), which is negative if x2 > 1, so unusable as a proposal variance. Girolami and Calderhead [1] use the Fisher metric in part to counteract this problem. Taking expectations over the data ensures that the likelihood contribution to G(x) in (46) will be positive (semi-)definite globally (e.g., [49]); so, provided a log-concave prior is chosen, then (46) should be a suitable choice for G(x). Indeed, Girolami and Calderhead [1] provide several examples in which geometric MCMC methods using this Fisher metric perform better than their “non-geometric” counterparts. Betancourt [50] also starts from the viewpoint that the Hessian (47) is an appropriate choice for G(x) and defines a mapping from the set of n × n matrices to the set of positive-definite n × n matrices by taking a “smooth” absolute value of the eigenvalues of the Hessian. This is done in a way such that derivatives of G(x) are still computable, inspiring the author to the name, SoftAbs metric. For a fixed value of x, the negative Hessian, H(x), is first computed and, then, decomposed into UTDU, where D is the diagonal matrix of eigenvalues. Each diagonal element of D is then altered by the mapping tα : R → R, given by: tα(λi) = λi coth(αλi), (48) where α is a tuning parameter (typically chosen to be as large as possible for which eigenvalues remain non-zero numerically). The function, tα, acts as an absolute value function, but also uplifts eigenvalues, which are close to zero to ≈ 1/α. It should be noted that while the Fisher metric is only defined for models in which a likelihood is present and for which the expectation is tractable, the SoftAbs metric can be found for any target distribution, π(·). Many authors (e.g., [1,48]) have noted that for many problems, the terms involving derivatives of G(x) are often small, and so, it is not always worth the computational effort of evaluating them. Girolami and Calderhead [1] propose the simplified manifold, MALA, in which proposals are of the form: Q(x, ·) ≡ N � x + λ2 2 G−1(x)∇ log π(x), λ2G−1(x) � (49) Using this method means derivatives of G(x) are no longer needed, so more pragmatic ways of regularising the Hessian are possible. One simple approach would be to take the absolute values of each eigenvalue, giving G(x) = UT|D|U, where H(x) = UTDU is the negative Hessian and |D| is a diagonal matrix with {|D|}ii = |λi| (this approach may fall into difficulties if eigenvalues are numerically zero). Another would be choosing G(x) as the “nearest” positive-definite matrix to the negative Hessian, according to some distance metric on the set of n × n matrices. The problem has, in fact, been well-studied in mathematical finance, in the context of finding correlations using incomplete data sets [51], and tackled using distances induced by the Frobenius norm. Approximate solution algorithms are discussed in Higham [51]. It is not clear to us at present whether small changes to the Hessian would result in large changes to the corresponding positive definite matrix under a given distance or, indeed, whether given a distance metric on the space of matrices, there is always a well-defined unique “nearest” positive definite matrix. Below, we provide two simple examples, here showing how a “Hessian-style metric” can alleviate some of the difficulties associated with both heavy and light-tailed target densities. 145 Entropy 2014, 16, 3074–3102 Example: Take π(x) ∝ 1/(1 + x2), and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) = −x(1 + x2)/|1 − x2|, which no longer tends to zero as |x| → ∞, suggesting a manifold variant of MALA with a Hessian-style metric may avoid some of the pitfalls of the standard algorithm. Note that the drift may become very large if |x| ≈ 1, but since this event occurs with probability zero, we do not see it as a major cause for concern. Example: Take π(x) ∝ e−x4, and set G(x) = | − ∂2 log π(x)/∂x2|. Then, G−1(x)∇ log π(x) = −x/3, which is O(x), so alleviating the problem of spiralling proposals for light-tailed targets demonstrated by MALA in an earlier example. Other choices for G(x) have been proposed, which are not based on the Hessian. These have the advantage that gradients need not be computed (either analytically or using computational methods). Sejdinovic et al. [52] propose a Metropolis–Hastings method, which can be viewed as a geometric variant of the RWM, where the choice for G(x) is based on mapping samples to an appropriate feature space, and performing principal component analysis on the resulting features to choose a local covariance structure for proposals. If we consider the RWM with Gaussian proposals to be a Euler–Maruyama discretisation of Brownian motion on a manifold, then proposals will take the form Q(x, ·) ≡ N (x + λ2Ω(x), λ2G−1(x)). If we assume (like in the simplified manifold MALA) that Ω(x) ≈ 0, then we have proposals centred at the current point in the Markov chain with a local covariance structure (the full Hastings acceptance rate must now be used as q(x′|x) ̸= q(x|x′) in general). As no gradient information is needed, the Sejdinovic et al. metric can be used in conjunction with the pseudo-marginal MCMC algorithm, so that π(x|y) need not be known exactly. Examples from the article demonstrate the power of the approach [52]. An important property of any Riemannian metric is how it transforms under coordinate change (e.g., [2]). The Fisher information metric commonly studied in information geometry is an example of a “coordinate invariant” choice for G(x). If we consider two parametrisations for a statistical model given by x and z = t(x), computing the Fisher information under x and then transforming this matrix using the Jacobian for the mapping, t, will give the same result as computing the Fisher information under z. It should be noted that because of either the prior contribution in (46) or the nonlinear transformations applied in other cases, none of the metrics we have reviewed here have this property, which means that we have no principled way of understanding how G(x) will relate to G(z). It is intuitive, however, that using information from all of π(x), rather than only the likelihood contribution, f (y|x), would seem sensible when trying to sample from π(·). 5. Survey of Applications Rather than conduct our own simulation study, we instead highlight some cases in the literature where geometric MCMC methods have been used with success. Martin et al. [53] consider Bayesian inference for a statistical inverse problem, in which a surface explosion causes seismic waves to travel down into the ground (the subsurface medium). Often, the properties of the subsurface vary with distance from ground level or because of obstacles in the medium, in which case, a fraction of the waves will scatter off these boundaries and be reflected back up to ground level at later times. The observations here are the initial explosion and the waves, which return to the surface, together with return times. The challenge is to infer the properties of the subsurface medium from this data. The authors construct a likelihood based on the wave equation for the data and perform Bayesian inference using a variant of the manifold MALA. Figures are provided showing the local correlations present in the posterior and, therefore, highlighting the need for an algorithm that can navigate the high density region efficiently. Several methods are compared in the paper, but the variant of MALA that incorporates a local correlation structure is shown to be the most efficient, particularly as the dimension of the problem increases [53]. 146 Entropy 2014, 16, 3074–3102 Calderhead and Girolami [54] dealt with two models for biological phenomena based on nonlinear dynamical systems. A model of circadian control in the Arabidopsis thaliana plant comprised a system of six nonlinear differential equations, with twenty two parameters to be inferred. Another model for cell signalling consisted of a system of six nonlinear differential equations with eight parameters, with inference complicated by the fact that observations of the model are not recorded directly [54]. The resulting inference was performed using RWM, MALA and geometric methods, with the results highlighting the benefits of taking the latter approach. The simplified variant of MALA on a manifold is reported to have produced the most efficient inferences overall, in terms of effective sample size per unit of computational time. Stathopoulos and Girolami [55] considered the problem of inferring parameters in Markov jump processes. In the paper, a linear noise approximation is shown, which can make inference in such models more straightforward, enabling an approximate likelihood to be computed. Models based on chemical reaction dynamics are considered; one such from chemical kinetics contained four unknown parameters; another from gene expression consisted of seven. Inference was performed using the RWM, the simplified manifold MALA and Hamiltonian methods, with the MALA reported as most efficient according to the chosen diagnostics. The authors note that the simplified manifold method is both conceptually simple and able to account for local correlations, making it an attractive choice for inference [55]. Konukoglu et al. [56] designed a method for personalising a generic model for a physiological process to a specific patient, using clinical data. The personalisation took the form of patient-specific parameter inference. The authors highlight some of the difficulties of this task in general, including the complexity of the models and the relative sparsity of the datasets, which often result in a parameter identifiability issue [56]. The example discussed in the paper is the Eikonal-diffusion model describing electrical activity in cardiac tissue, which results in a likelihood for the data based on a nonlinear partial differential equation, combined with observation noise [56]. A method for inference was developed by first approximating the likelihood using a spectral representation and then using geometric MCMC methods on the resulting approximate posterior. The method was first evaluated on synthetic data and then repeated on clinical data taken from a study for ventricular tachycardia radio-frequency ablation [56]. 6. Discussion The geometric viewpoint in not necessary to understand manifold variants of the MALA. Indeed, several authors [32,33] have discussed these algorithms without considering them to be “geometric”, rather simply Metropolis–Hastings methods in which proposal kernels have a position-dependent covariance structure. We do not claim that the geometric view is the only one that should be taken. Our goal is merely to point out that such position-dependent methods can often be viewed as methods defined on a manifold and that studying the structure of the manifold itself may lead to new insights on the methods. For example, taking the geometric viewpoint and noting the connection with information geometry enabled Girolami and Calderhead to adopt the Fisher metric for calculations [1]. We list here a few open questions on which the geometric viewpoint may help shed some insight. Computationally-minded readers will have noted that using position-dependent covariance matrices adds a significant computational overhead in practice, with additional O(n3) matrix inversions required at each step of the corresponding Metropolis–Hastings algorithms. Clearly, there will be many problems for which the matrix, G(x), does not change very much, and therefore, choosing a constant covariance G−1(x) = Σ may result in a more efficient algorithm overall. Geometrically, this would correspond to a manifold with scalar curvature close to zero everywhere. It may be that geometric ideas could be used to understand whether the manifold is flat enough that a constant choice of G(x) is sufficient. To make sense of this truly would require a relationship between curvature, an inherently local property and more global statements about the manifold. Many results in differential geometry, beginning with the celebrated Gauss–Bonnet theorem, have previously related global and 147 Entropy 2014, 16, 3074–3102 local properties in this way [57]. It is unknown to the authors whether results exist relating the curvature of a manifold to some global property, but this is an interesting avenue for further research. A related question is when to choose the simplified manifold MALA over the full method. Problems in which the term, ∥Λ(x)∥, is sufficient large to warrant calculation correspond to those for which the manifold has very high curvature in many places; so again, making some global statement related to curvature could help here. Although there is a reasonably intuitive argument for why the Hessian is an appropriate starting point for G(x), the lack of positive-definiteness may be seen as a cause for concern by some. After all, it could be argued that if the curvature is not positive-definite in a region, then how can it be a reasonable approximation to the local covariance structure. Many statistical models used to describe natural phenomena are characterised by distributions with heavy tails or multiple modes, for which this is the case. In addition, for target densities of the form π(x) ∝ e−|x|, the Hessian is everywhere equal to zero!The attempts to force positive-definiteness we have described will typically result in small moves being proposed in such regions of the sample space, which may not be an optimal strategy. Much work in information geometry has centred on the geometry of Hessian structures [58], and some insights from this field may help to better understand the question of choosing an appropriate metric. In addition, the field of pseudo-Riemannian geometry deals with forms of G(x), which need not be positive-definite [39]; so again, understanding could be gained from here. Some recent work in high-dimensional inference has centred on defining MCMC methods for which efficiency scales O(1) with respect to the dimension, n, of π(·) [19,59]. In the case where X takes values in some infinite-dimensional function space, this can be done provided a Gaussian prior measure is defined for X. A striking result from infinite-dimensional probability spaces is that two different probability measures defined over some infinite dimensional space have a striking tendency to have disjoint supports [60]. The key challenge for MCMC is to define transition kernels for which proposed moves are inside the support for π(·). A straight-forward approach is to define proposals for which the prior is invariant, since the likelihood contribution to the posterior typically will not alter its support from that of the prior [19]. However, the posterior may still look very different from the prior, as noted in [61], so this proposal mechanism, though O(1), can still result in slow exploration. Understanding the geometry of the support and defining methods that incorporate the likelihood term, but also respect this geometry, so as to ensure proposals remain in support of π(·), is an intriguing research proposition. The methods reviewed in this paper are based on first order Langevin diffusions. Algorithms have also been developed that are based on second order Langevin diffusions, in which a stochastic differential equation governs the behaviour of the velocity of a process [62,63]. A natural extension to the work of Girolami and Calderhead [1] and Xifara et al. [33] would be to map such diffusions onto a manifold and derive Metropolis–Hastings proposal kernels based on the resulting dynamics. The resulting scheme would be a generalisation of [63], though the most appropriate discretisation scheme for a second order process to facilitate sampling is unclear and perhaps a question worthy of further exploration. We have focused primarily here on the sample space X = Rn and on defining an appropriate manifold on which to construct Markov chains. In some inference problems, however, the sample space is a pre-defined manifold, for example the set of n × n rotation matrices, commonly found in the field of directional statistics [64]. Such manifolds are often not globally mappable to Euclidean n-space. Methods have been devised for sampling from such spaces [65,66]. In order to use the methods described here for such problems, an appropriate approach for switching between coordinate patches at the relevant time would need to be devised, which could be an interesting area of further study. Alongside these geometric problems, we can also discuss geometric MCMC methods from a statistical perspective. The last example given in the previous section hinted that the manifold MALA may cope better with target distributions with heavy tails. In fact, Latuszynski et al. [67] have shown that, in one dimension, the manifold MALA is geometrically ergodic for a class of targets of the 148 Entropy 2014, 16, 3074–3102 form π(x) ∝ exp(−|x|β) for any choice of β ̸= 1. This incorporates cases where tails are heavier than exponential and lighter than Gaussian, two scenarios under which geometric ergodicity fails for the MALA. Finding optimal acceptance rates and scaling of λ with dimension are two other related challenges. In this case, the picture is more complex. Traditional results have been shown for Metropolis–Hastings methods in the case where target distributions are independent and identically-distributed or some other suitable symmetry and regularity in the shape of π(·). Manifold methods are, however, specifically tailored to scenarios in which this is not the case, scenarios in which there is a high correlation between components of x, which changes depending on the value of x. It is less clear how to proceed with finding relevant results that can serve as guidelines to practitioners here. Indeed, Sherlock [18] notes that a requirement for optimal acceptance rate results for the RWM to be appropriate is that the curvature of π(x) does not change too much, yet this is the very scenario in which we would want to use a manifold method. Acknowledgments: We thank the two reviewers for helpful comments and suggestions. Samuel Livingstone is funded by a PhD Scholarship from Xerox Research Centre Europe. Mark Girolami is funded by an Engineering and Physical Sciences Research Council Established Career Research Fellowship, EP/J016934/1, and a Royal Society Wolfson Research Merit Award. Author Contributions: Author Contributions The article was written by Samuel Livingstone under the guidance of Mark Girolami. All authors have read and approved the final manuscript. Appendix Appendix Total Variation Distance We show how to obtain (10) from (9). Denoting two probability distributions, μ(·) and ν(·), and associated densities, μ(x) and ν(x), we have: ∥μ(·) − ν(·)∥TV := sup A∈B |μ(A) − ν(A)|. Define the set B = {x ∈ X : μ(x) > ν(x)}. To see that B ∈ B, note that B = ∪q∈Q{x ∈ X : μ(x) > q} ∩ {x ∈ X : ν(x) < q}, and the result follows from properties of B (e.g., [68]). Now, for any A ∈ B: μ(A) − ν(A) ≤ μ(A ∩ B) − ν(A ∩ B) ≤ μ(B) − ν(B), and similarly: ν(A) − μ(A) ≤ ν(Bc) − μ(Bc), so, the supremum will be attained either at B or Bc. However, since μ(X ) = ν(X ) = 1, then: [μ(B) − ν(B)] − [ν(Bc) − μ(Bc)] = 0, so that |μ(B) − ν(B)| = |μ(Bc) − ν(Bc)|. Using these facts gives an alternative characterisation of the total variation distance as: ∥μ(·) − ν(·)∥TV = 1 2 (|μ(B) − ν(B)| + |μ(Bc) − ν(Bc)|) = 1 2 � X |μ(x) − ν(x)|dx as required. 149 Entropy 2014, 16, 3074–3102 Appendix Gradient and Divergence Operators on a Riemannian Manifold The gradient of a function on Rn is the unique vector field, such that, for any unit vector, u: ⟨∇ f (x), u⟩ = Du [ f (x)] = lim h→0 � f (x + hu) − f (x) h � , (A1) the directional derivative of f along u at x ∈ Rn. On a manifold, the gradient operator, ∇M, can still be defined, such that the inner product gp(∇M f (x), u) = Du[ f (x)]. Setting ∇M = G(x)−1∇ gives: gp(∇M f (x), u) = (G−1(x)∇ f (x))TG(x)u, = ⟨∇ f (x), u⟩, which is equal to the directional derivative along u as required. The divergence of some vector field, v, at a point, x ∈ Rn, is the net outward flow generated by v through some small neighbourhood of x. Mathematically, the divergence of v(x) ∈ R3 is given by ∑i ∂vi/∂xi. On a more general manifold, the divergence is also a sum of derivatives, but here, they are covariant derivatives. A short introduction is provided in Appendix C. Here, we simply state that the covariant derivative of a vector field, v, at a point p ∈ M is the orthogonal projection of the directional derivative onto the tangent space, TpM. Intuitively, a vector field on a manifold is a field of vectors, each of which lie in the tangent space to a point, p ∈ M. It only makes sense therefore to discuss how vector fields change along the manifold or in the direction of vectors, which also lie in the tangent space. Although the idea seems simple, the covariant derivative has some attractive geometric properties; notably, it can be completely written in local coordinates,and, so, does not depend on knowledge of an embedding in some ambient space. The divergence of a vector field, v, defined on a manifold, M, at the point, p ∈ M, is defined as: divM(v) = n ∑ i=1 Dc ei[vi], where ei denotes the i-th basis vector for the tangent space, TpM, at p ∈ M, and vi denotes the i-th coefficient. This can be written in local coordinates (see Appendix C) as: divM(v) = |G(x)|− 1 2 n ∑ i=1 ∂ ∂xi � |G(x)| 1 2 vi � , and can be combined with ∇M to form the Laplace–Beltrami operator (41). Appendix Vector Fields and the Covariant Derivative Here, we provide a short introduction to vector fields and differentiation on a smooth manifold; see [38,39]. The following geometric notation is used here: (i) vector components are indexed with a superscript, e.g., v = (v1, ..., vn); and (ii) repeated subscripts and superscripts are summed over, e.g., viei = ∑i viei (known as the Einstein summation convention). For any smooth manifold, M, the set of all tangent vectors to points on M is known as the tangent bundle and denoted TM. A Cr vector field defined on M is a mapping that assigns to each point, p ∈ M, a tangent vector, v(p) ∈ TpM. In addition, the components of v(p) in any basis for TpM must also be Cr [38]. We will denote the set of all vector fields on M as Γ(TM). For some vector field, v ∈ Γ(TM), at any point, p ∈ M, the vector, v(p) ∈ TpM, can be written as a linear combination of some n basis vectors {e1, ..., en} as v = viei. To understand how v will change in a particular direction along M, it only makes sense, therefore, to consider derivatives along vectors in TpM. Two other things must be 150 Entropy 2014, 16, 3074–3102 considered when defining a derivative along a manifold: (i) how the components, vi, of each basis vector will change; and (ii) how each basis vector, ei, itself will change. For the usual directional derivative on Rn, the basis vectors do not change, as the tangent space is the same at each point, but for a more general manifold, this is no longer the case: the ei’s are referred to as a “local” basis for each TpM. The covariant derivative, Dc, is defined so as to account for these shortcomings. When considering differentiation along a vector, u∗ /∈ TpM, u∗ is simply projected onto the tangent space. The derivative with respect to any u ∈ TpM can now be decomposed into a linear combination of derivatives of basis vectors and vector components: Dc u[v] = Dc uiei[viei], (A2) where the argument, p, has been dropped, but is implied for both components and local basis vectors. The operator, Dc u[v], is defined to be linear in both u and v and to satisfy the product rule [38]; so, Equation (A2) can be decomposed into: Dc u[v] = ui � Dc ei[vj]ej + vjDc ei[ej] � . (A3) The operator, Dc, need, therefore, only be defined along the direction of basis vectors ei and for vector component vi and basis vector ei arguments. For components vi, Dc ej[vi] is defined as simply the partial derivative ∂jvi := ∂vi/∂xj. The directional derivative of some basis vector ei along some ej is best understood through the example of a regular surface Σ ⊂ R3. Here, Dej[ei] will be a vector, w ∈ R3. Taking the basis for this space at the point, p, as {e1, e2, ˆn}, where ˆn denotes the unit normal to TpΣ, we can write w = αe1 + βe2 + κ ˆn. The covariant derivative, Dc ej[ei], is simply the projection of w onto TpΣ, given by w∗ = αe1 + βe2. More generally, at some point, p, in a smooth manifold, M, the covariant derivative Dc ej[ei] = Γk jiek (with upper and lower indices summed over). The coefficients, Γk ji, are known as the Christoffel symbols: Γk ji denotes the coefficient of the k-th basis vector when taking the derivative of the i-th with respect to the j-th. If a Riemannian metric, g, is chosen for M; then, they can be expressed completely as a function of g (or in local coordinates as a function of the matrix, G). Using these definitions, Equation (A3) can be re-written as: Dc u[v] = ui � ∂ivk + vjΓk ij � ek. (A4) The divergence of a vector field, v ∈ Γ(TM), at the point, p ∈ M, is given by: divM(v) = Dc ei[vi], (A5) where, again, repeated indices are summed over. If M = Rn, this reduces to the usual sum of partial derivatives, ∂ivi. On a more general manifold, M, the equivalent expression is:"’ Dc ei[vi] = ∂ivi + viΓj ij, (A6) where, again, repeated indices are summed. As has been previously stated, if a metric, g, and coordinate chart is chosen for M, the Christoffel symbols can be written in terms of the matrix, G(x). In this case [69]: Γj ij = |G(x)|− 1 2 ∂i � |G(x)| 1 2 � , (A7) so Equation (A6) becomes: Dc ei[vi] = |G(x)|− 1 2 ∂i � |G(x)| 1 2 vi� , (A8) where v = v(x). Conflicts of Interest: Conflicts of Interest 151 Entropy 2014, 16, 3074–3102 The authors declare no conflict of interest. References 1. Girolami, M.; Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B 2011, 73, 123–214. 2. Amari, S.I.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2007; Volume 191. 3. Marriott, P.; Salmon, M. Applications of Differential Geometry to Econometrics; Cambridge University Press: Cambridge, UK, 2000. 4. Betancourt, M.; Girolami, M. Hamiltonian Monte Carlo for Hierarchical Models. 2013, arXiv: 1312.0906. 5. Neal, R. MCMC using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo; Chapman and Hall/CRC: Boca Raton, FL, USA, 2011; pp. 113–162. 6. Betancourt, M.; Stein, L.C. The Geometry of Hamiltonian Monte Carlo. 2011, arXiv: 1112.4118. 7. Robert, C.P.; Casella, G. Monte Carlo Statistical Methods; Springer: New York, NY, USA, 2004; Volume 319. 8. Tierney, L. Markov chains for exploring posterior distributions. Ann. Stat. 1994, 22, 1701–1728. 9. Kipnis, C.; Varadhan, S. Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Commun. Math. Phys. 1986, 104, 1–19. 10. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2012. 11. Plummer, M.; Best, N.; Cowles, K.; Vines, K. CODA: Convergence diagnosis and output analysis for MCMC. R. News 2006, 6, 7–11. 12. Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. 13. Jones, G.L.; Hobert, J.P. Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Stat. Sci. 2001, 16, 312–334. 14. Jones, G.L. On the Markov chain central limit theorem. Probab. Surv. 2004, 1, 299–320. 15. Gelman, A.; Rubin, D.B. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992, 7, 457–472. 16. Sherlock, C.; Fearnhead, P.; Roberts, G.O. The random walk Metropolis: Linking theory and practice through a case study. Stat. Sci. 2010, 25, 172–190. 17. Sherlock, C.; Roberts, G. Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets. Bernoulli 2009, 15, 774–798. 18. Sherlock, C. Optimal scaling of the random walk Metropolis: General criteria for the 0.234 acceptance rule. J. Appl. Probab. 2013, 50, 1–15. 19. Beskos, A.; Kalogeropoulos, K.; Pazos, E. Advanced MCMC methods for sampling on diffusion pathspace. Stoch. Processes Appl. 2013, 123, 1415–1453. 20. Roberts, G.O.; Rosenthal, J.S. Optimal scaling for various Metropolis–Hastings algorithms. Stat. Sci. 2001, 16, 351–367. 21. Roberts, G.O.; Tweedie, R.L. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika 1996, 83, 95–110. 22. Mengersen, K.L.; Tweedie, R.L. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Stat. 1996, 24, 101–121. 23. Jarner, S.F.; Hansen, E. Geometric ergodicity of Metropolis algorithms. Stoch. Processes Appl. 2000, 85, 341–361. 24. Christensen, O.F.; Møller, J.; Waagepetersen, R.P. Geometric ergodicity of Metropolis–Hastings algorithms for conditional simulation in generalized linear mixed models. Methodol. Comput. Appl. Probab. 2001, 3, 309–327. 25. Neal, P.; Roberts, G. Optimal scaling for random walk Metropolis on spherically constrained target densities. Methodol. Comput. Appl. Probab. 2008, 10, 277–297. 26. Jarner, S.F.; Tweedie, R.L. Necessary conditions for geometric and polynomial ergodicity of random-walk-type. Bernoulli 2003, 9, 559–578. 27. Øksendal, B. Stochastic Differential Equations; Springer: New York, NY, USA, 2003. 152 Entropy 2014, 16, 3074–3102 28. Rogers, L.C.G.; Williams, D. Diffusions, Markov Processes and Martingales: Volume 2, Itô Calculus; Cambridge University Press: Cambridge, UK, 2000; Volume 2. 29. Meyn, S.P.; Tweedie, R.L. Stability of Markovian processes III: Foster–Lyapunov criteria for continuous-time processes. Adv. Appl. Probab. 1993, 25, 518–518. 30. Coffey, W.; Kalmykov, Y.P.; Waldron, J.T. The Langevin Equation: with Applications to Stochastic Problems in Physics, Chemistry, and Electrical Engineering; World Scientific: Singapore, Singapore, 2004; Volume 14. 31. Roberts, G.O.; Tweedie, R.L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 1996, 2, 341–363. 32. Roberts, G.O.; Stramer, O. Langevin diffusions and Metropolis–Hastings algorithms. Methodol. Comput. Appl. Probab. 2002, 4, 337–357. 33. Xifara, T.; Sherlock, C.; Livingstone, S.; Byrne, S.; Girolami, M. Langevin diffusions and the Metropolis-adjusted Langevin algorithm. Stat. Probab. Lett. 2013, 91, 14–19. 34. Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 1946, 186, 453–461. 35. Critchley, F.; Marriott, P.; Salmon, M. Preferred point geometry and statistical manifolds. Ann. Stat. 1993, 21, 1197–1224. 36. Marriott, P. On the local geometry of mixture models. Biometrika 2002, 89, 77–93. 37. Barndorff-Nielsen, O.; Cox, D.; Reid, N. The role of differential geometry in statistical theory. Int. Stat. Rev. 1986, 54, 83–96. 38. Boothby, W.M. An Introduction to Differentiable Manifolds and Riemannian Geometry; Academic Press: San Diego, CA, USA, 1986; Volume 120. 39. Lee, J.M. Smooth Manifolds; Springer: New York, NY, USA, 2003. 40. Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992. 41. Nash, J.F., Jr. The imbedding problem for Riemannian manifolds. In The Essential John Nash; Princeton University Press: Princeton, NJ, USA, 2002; p. 151. 42. Manton, J.H. A Primer on Stochastic Differential Geometry for Signal Processing. 2013, arXiv: 1302.0430. 43. Stewart, J. Multivariable Calculus; Cengage Learning: Boston, MA, USA, 2011. 44. Hsu, E.P. Stochastic Analysis on Manifolds; American Mathematical Society: Providence, RI, USA, 2002; Volume 38. 45. Kent, J. Time-reversible diffusions. Adv. Appl. Probab. 1978, 10, 819–835. 46. Radhakrishna Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. 47. Christensen, O.F.; Roberts, G.O.; Sköld, M. Robust Markov chain Monte Carlo methods for spatial generalized linear mixed models. J. Comput. Graph. Stat. 2006, 15, 1–17. 48. Petra, N.; Martin, J.; Stadler, G.; Ghattas, O. A computational framework for infinite-dimensional Bayesian inverse problems: Part II. Stochastic Newton MCMC with application to ice sheet flow inverse problems. 2013, arXiv: 1308.6221. 49. Pawitan, Y. In All Likelihood: Statistical Modelling and Inference Using Likelihood; Oxford University Press: Oxford, UK, 2001. 50. Betancourt, M. A General Metric for Riemannian Manifold Hamiltonian Monte Carlo. In Geometric Science of Information; Springer: New York, NY, USA, 2013; pp. 327–334. 51. Higham, N.J. Computing the nearest correlation matrix—a problem from finance. IMA J. Numer. Anal. 2002, 22, 329–343. 52. Sejdinovic, D.; Garcia, M.L.; Strathmann, H.; Andrieu, C.; Gretton, A. Kernel Adaptive Metropolis–Hastings. 2013, arXiv: 1307.5302. 53. Martin, J.; Wilcox, L.C.; Burstedde, C.; Ghattas, O. A stochastic Newton MCMC method for large-scale statistical inverse problems with application to seismic inversion. SIAM J. Sci. Comput. 2012, 34, A1460–A1487. 54. Calderhead, B.; Girolami, M. Statistical analysis of nonlinear dynamical systems using differential geometric sampling methods. Interface Focus 2011, 1, 821–835. 55. Stathopoulos, V.; Girolami, M.A. Markov chain Monte Carlo inference for Markov jump processes via the linear noise approximation. Philos. Trans. R. Soc. A 2013, 371, 20110541. 153 Entropy 2014, 16, 3074–3102 56. Konukoglu, E.; Relan, J.; Cilingir, U.; Menze, B.H.; Chinchapatnam, P.; Jadidi, A.; Cochet, H.; Hocini, M.; Delingette, H.; Jaïs, P.; et al. Efficient probabilistic model personalization integrating uncertainty on data and parameters: Application to eikonal-diffusion models in cardiac electrophysiology. Prog. Biophys. Mol. Biol. 2011, 107, 134–146. 57. Do Carmo, M.P.; Do Carmo, M.P. Differential Geometry of Curves and Surfaces; Englewood Cliffs: Prentice-Hall, NJ, USA, 1976; Volume 2. 58. Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, Singapore, 2007; Volume 1. 59. Cotter, S.; Roberts, G.; Stuart, A.; White, D. MCMC methods for functions: Modifying old algorithms to make them faster. Stat. Sci. 2013, 28, 424–446. 60. Da Prato, G.; Zabczyk, J. Stochastic Equations in Infinite Dimensions; Cambridge University Press: Cambridge, UK, 2008. 61. Law, K.J. Proposals which speed up function-space MCMC. J. Comput. Appl. Math. 2014, 262, 127–138. 62. Ottobre, M.; Pillai, N.S.; Pinski, F.J.; Stuart, A.M. A Function Space HMC Algorithm With Second Order Langevin Diffusion Limit. 2013, arXiv: 1308.0543. 63. Horowitz, A.M. A generalized guided Monte Carlo algorithm. Phys. Lett. B 1991, 268, 247–252. 64. Mardia, K.V.; Jupp, P.E. Directional Statistics; Wiley: New York, NY, USA, 2009; Volume 494. 65. Byrne, S.; Girolami, M. Geodesic Monte Carlo on embedded manifolds. Scand. J. Stat. 2013, 40, 825–845. 66. Diaconis, P.; Holmes, S.; Shahshahani, M. Sampling from a manifold. In Advances in Modern Statistical Theory and Applications: A Festschrift in Honor of Morris L. Eaton; Institute of Mathematical Statistics: Washington, DC, USA, 2013; pp. 102–125. 67. Latuszynski, K.; Roberts, G.O.; Thiery, A.; Wolny, K. Discussion on “Riemann manifold Langevin and Hamiltonian Monte Carlo methods” (by Girolami, M. and Calderhead, B.). J. R. Stat. Soc. Ser. B 2011, 73, 188–189. 68. Capinski, M.; Kopp, P.E. Measure, Integral and Probability; Springer: New York, NY, USA, 2004. 69. Schutz, B.F. Geometrical Methods of Mathematical Physics; Cambridge University Press: Cambridge, UK, 1984. c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 154 entropy Article Variational Bayes for Regime-Switching Log-Normal Models Hui Zhao and Paul Marriott * University of Waterloo, 200 University Avenue West, Waterloo, ON N2L 3G1, Canada; E-Mail: h6zhao@uwaterloo.ca * E-Mail: pmarriot@uwaterloo.ca; Tel.: +1-519-888-4567. Received: 14 April 2014; in revised form: 12 June 2014 / Accepted: 7 July 2014 / Published: 14 July 2014 Abstract: The power of projection using divergence functions is a major theme in information geometry. One version of this is the variational Bayes (VB) method. This paper looks at VB in the context of other projection-based methods in information geometry. It also describes how to apply VB to the regime-switching log-normal model and how it provides a computationally fast solution to quantify the uncertainty in the model specification. The results show that the method can recover exactly the model structure, gives the reasonable point estimates and is very computationally efficient. The potential problems of the method in quantifying the parameter uncertainty are discussed. Keywords: information geometry; variational Bayes; regime-switching log-normal model; model selection; covariance estimation 1. Introduction While, in principle, the calculation of the posterior distribution is mathematically straightforward, in practice, the computation of many of its features, such as posterior densities, normalizing constants and posterior moments, is a major challenge in Bayesian analysis. Such computations typically involve high dimensional integrals, which often have no analytical or tractable forms. The variational Bayes (VB) method was developed to generate tractable approximations to these quantities. This method provides analytic approximations to the posterior distribution by minimizing the Kullback– Leibler (KL) divergence from the approximations to the actual posterior and has been demonstrated to be computationally very fast. VB gains its computational advantages by making simplifying assumptions about the posterior dependence structure. For example, in the simplest form, it assumes posterior independence between selected sets of parameters. Under these assumptions, the resultant approximate posterior is either known analytically or can be computed by a simple iteration algorithm similar to the Expectation-Maximization (EM) algorithm. In this paper, we show that, as well as having advantages of computational speed, the VB algorithm does an excellent job of model selection, in particular in finding the appropriate number of regimes. While the simplification in the dependence gives computational advantages, it also comes at a cost. For example, we also found that the posterior variance may be underestimated. In [1], we propose a novel method to compute the true posterior covariance matrix by only using the information obtained from VB approximations. The use of projections to particular families is, of course, not new to information geometry (IG). In [2], we find the famous Pythagorean results concerning projection using α-divergences to α-families, and other important results on projections based on divergences can be found in [3] and [4] (Chapter 7). Entropy 2014, 16, 3832–3847; doi:10.3390/e16073832 www.mdpi.com/journal/entropy 155 Entropy 2014, 16, 3832–3847 1.1. Variational Bayes Suppose, in a Bayesian inference problem that we use q(τ) to approximate the posterior p(τ|y), where y is the data and τ = {τ1, · · · , τp} the model parameter vector. The KL divergence between them is defined as, KL [q(τ)||p(τ|y)] = � q(τ) log q(τ) p(τ|y)dτ, (1) provided the integral exists. We want to balance two things, having the discrepancy between p and q small, while keeping q tractable. Hence, we want to seek q(τ), which minimizes Equation (1), while keeping q(τ) in an analytically tractable form. First, note that the evaluation of Equation (1) requires p(τ|y), which may be unavailable, since in the general Bayesian problem, its normalizing constant is one of the main intractable integrals. However, we note that: KL [q(τ)||p(τ|y)] = � q(τ) log q(τ) p(τ|y)p(y)dτ + log p(y) = − � q(τ) log p(τ, y) q(τ) dτ + log p(y). (2) Thus, minimizing Equation (1) is equivalent to maximizing the first term of the right-hand side of Equation (2). The key computational point is that, often, the term p(τ, y) is available even when the full posterior � p(τ,y) p(τ,y)dτ is not. Definition 1. Let F(q) = � q(τ) log p(τ,y) q(τ) dτ and: ˆq = arg max q∈Q F(q), (3) where Q is a predetermined set of probability density functions over the parameter space. Then ˆq is called the variational approximation or variational posterior distribution, and functions of ˆq (such as mean, variance, etc.), are called variational parameters. Some of the power of Definition 1 comes when we assume that all elements of Q have tractable posteriors. In that case, all variational parameters will then also be tractable when the optimization can be achieved. A prime example of a choice for Q is the set of all densities that factorize as q(τ) = d ∏ i=1 qi(τi). This reduces the computational problem from computing a high dimensional integral to one of computing a number of one-dimensional ones. Furthermore, as we see in the example of this paper, it is often the case that the variational families are standard exponential families (since they are often ‘maximum entropy models’ in some sense), and the optimisation problem (3) can be solved by simple iterative methods with very fast convergence. The core of the method builds on the basis of the principle of the variational free energy minimization in physics, which is concerned with finding the maxima and minima of a functional over a class of functions, and the method gains its name from this root. Early developments of the method can be found in machine learning, especially in applications on neural networks [5,6]. The method has been successfully applied in many different disciplines and domains, for example, in independent component analysis [7,8], graphical models [9,10], information retrieval [11] and factor analysis [12]. 156 Entropy 2014, 16, 3832–3847 In the statistical literature, an early application of the variational principle can be found in the work of [13] to construct Bayes estimators. In recent years, the method has obtained more attention from both the application and theoretical perspective, for example [14–18]. 1.2. Regime-Switching Models In this paper, we illustrate the strengths and weaknesses of VB through a detailed case study. In particular, we look at a model that is used in finance, risk management and actuarial science, the so-called regime-switching log-normal model (RSLN) proposed, in this context, by [19]. Switching between different states, or regimes, is a common phenomenon in many time series, and regime-switching models, originally proposed by [20], have been used to model these switching processes. As demonstrated in [21], the maximum likelihood estimate (MLE) does not give a simple method to deal with parameter uncertainty; for details of this method, see [21]. The asymptotic normality of maximum likelihood estimators may not apply for sample sizes commonly found in practice. Hence, to understand parameter uncertainty, [21] considered the RSLN model in a Bayesian framework using the Metropolis–Hastings algorithm. Furthermore, model uncertainty, in particular selecting the correct number of regimes, is a major issue. Hence, model selection criteria have to be used to choose the “best” model. Hardy [19] found that a two-regime RSLN model maximized the Bayes information criterion (BIC) [22] for both monthly TSE 300 total return data and S&P 500 total return data; however, according to the Akaike information criterion (AIC) [23], a three-regime model was the optimal on S&P data. To account for the model uncertainty associated with the number of regimes, [24] offered a trans-dimensional model using reversible jump MCMC [25]. We note that BIC is not necessarily ideal for model selection with state space models [26], while it is still commonly used in the literature. MCMC methods make possible the computation of all posterior quantities; however there are a number of practical issues associated with their implementation. A primary concern is determining that the generated chain has, in fact, “converged”. In practice, MCMC practitioners have to resort to convergence diagnostic techniques. Furthermore, the computational cost can be a concern. Other implementational issues include the difficulty of making good initalisation choices, implementing the MCMC algorithm in one long chain or several shorter chains in parallel, etc. Detailed discussions can be found in [27]. One of the main contributions of this paper is to apply the variational Bayes (VB) method to the RSLN model and present a solution to quantify the uncertainty in model specification. The VB method is a technique that provides analytical approximations to the posterior quantities, and in practice, it is demonstrated to be a very much faster alternative to MCMC methods. 2. Variational Bayes and Informational Geometry In this section, we explore the relationship between VB and IG, in particular the statistical properties of divergence-based projections onto exponential families. Here, we used the IG of [2], in particular the ±1 dual affine parameters for exponential families. One of the most striking results from [2] is the Pythagorean property of these dual affine coordinate systems. This is illustrated in Figure 1, which shows a schematic representing a model space containing the distribution f0(x) and an exponential family f (x; θ). 157 Entropy 2014, 16, 3832–3847 θ −1−geodesic +1−geodesic of (x) f(x, ) Figure 1. Projections onto an exponential family. The Pythagorean result comes from using the KL divergence to project onto the exponential family f (x; θ) = ν(x) exp {s(x)θ − ψ(θ)}, i.e., min θ � − log f (x; θ) f0(x) f0(x)dx. All distributions that project to the same point form a −1-flat space defined by all distributions f (x) with the same mean, i.e., E�θ(s(x)) = Ef (x)(s(x)), and further, it is Fisher orthogonal to the +1-flat family f (x; θ). The statistical interpretation of this concerns the behaviour of a model f (x, θ) when the data generation process does not lie in the model. In contrast to this, we have the VB method, which uses the reverse KL divergence for the projection, i.e., min θ � log f (x; θ) f0(x) f (x; θ)dx. This results in a Fisher orthogonal projection, shown in Figure 1, but now using a +1-flat family. This does not have the property that the mean of s(x) is constant, but as we shall see, it does have nice computational properties when used in the context of Bayesian analysis. In order to investigate the information geometry of VB, we consider two examples. The first, in Section 3.1, is selected to maximally illustrate the underlying geometric issues and to get some understanding of the quality of the VB approximation. The second, in Section 3.2, shows an important real-world application from actuarial science and is illustrated with simulated and real data. 3. Applications of Variational Bayes 3.1. Geometric Foundation We consider the simplest model that shows dependence. Let X1, X2 be two binary random variables, with distribution π := (π00, π10, π01, π11), where P(X1 = i, X2 = j) = πij, i, j ∈ {0, 1}. Further, let the marginal distributions be denoted by π1 = P(X1 = 1), π2 = P(X2 = 1). We want to consider the geometry of the VB projection from a general distribution to the family of independent distributions. This represents the way that VB gains its computational advantages by simplifying the posterior dependence structure. The model space is illustrated in Figure 2, where π is represented by a point in the three simplex, and the independence surface, where π00π11 = π10π01, is also shown. 158 Entropy 2014, 16, 3832–3847 1 y 0 0 0.5 z 1.0 x 1 Figure 2. Space of distributions with independence surface: marginal probabilityand dependence. Both the interior of the simplex and independence surface are exponential families, and it is convenient to use the natural parameters for the interior of the simplex: ξ1 = log π10 π00 , ξ2 = log π01 π00 , ξ3 = log π11π00 π10π01 where the independence surface is given by ξ3 = 0. The independence surface can also be parameterised by the marginal distributions π1, π2 or the corresponding natural parameters ξind i := log(πi/(1 − πi)). For any distribution, π, represented in natural parameters by (ξ1, ξ2, ξ3), has its VB approximation defined implicitly by the simultaneous equations: ξind 1 (π1) = ξ1 + ξ3π2, (4) ξind 2 (π2) = ξ2 + ξ3π1. (5) These can be solved, as is typical with VB methods, by iterating updated estimates of π1 and π2 across the two equations. We show this in a realistic example in the following section. Having seen the VB solution in this simple model, we can investigate the quality of the approximation. If we were using the forward KL project, as proposed by [2], then the mean will be preserved by the approximation, while, of course, the variance structure is distorted. In the case of using the reverse KL projection, as used by VB, the mean will not be preserved, but in this example, we can investigate the distortion explicitly. Let (ξ1(α), ξ2(α), ξ3(α)) be a +1-geodesic, which cuts the independence surface orthogonally and is parameterised by α, where α = 0 corresponds to the independence surface. In this example, all such geodesics can be computed explicitly. Figure 3 shows the distortion associated with the VB approximation. In the left-hand panel, we show the mean, which is the marginal probability, P(X1 = 1), for all points on the orthogonal geodesic. We see, as expected, that this is not constant, but it is locally constant at α = 0, showing that the distortion of the mean can be small near the independence surface. The right-hand panel shows the dependence, as measured by the log-odds, for points on the geodesic. As expected, the VB does not preserve the dependence structure; indeed, it is designed to exploit the simplification of the dependence structure. 159 Entropy 2014, 16, 3832–3847 Figure 3. Distortion implied by variational Bayes (VB) approximation. 3.2. Variational Bayes for the RSLN Model The regime-switching log-normal model [19] with a fixed finite number, K, of regimes can be described as a bivariate discrete time process with the observed data sequence w1:T = {wt}T t=1 and the unobserved regime sequence S1:T = {St}T t=1, where St ∈ {1, · · · , K} and T is the number of observations. The logarithm of wt, denoted by yt = log wt, is assumed normally distributed, having mean μi and variance σ2 i both dependent on the hidden regime St. The sequence of S1:T is assumed to follow a first order Markov chain having transition probabilities A = (aij) with the probabilities π = (πi)K i=1 to start the first regime. The RSLN model is a special case of more general state-space models, which were studied in detail by [28]. In this paper, we use this model and simulated and real data to illustrate the VB method in practice. We also calibrate its performance by referring to [24], which used MCMC methods to fit the same model to the same data. Here, we are regarding the MCMC analysis as a form of “gold-standard", but with the cost of being orders-of-magnitude slower than VB in computational time. In the Bayesian framework, we use a symmetric Dirichlet prior for π, that is p(π) = Dir(π; Cπ K , · · · , Cπ K ), for Cπ > 0. Let ai denote the i − th row vector of A. The prior for A is chosen as p(A) = ∏K i=1 p(ai) = ∏K i=1 Dir(ai; CA K , · · · , CA K ), for CA > 0, and the prior distribution for {(μi, σ2 i )}K i=1 is chosen to be normal-inverse gamma, p({μi, σ2 i }K i=1) = ∏K i=1 N(μi|σ2 i ; γ, σ2 i η2 )IG(σ2 i ; α, β). In the above setting, Cπ, CA, γ, η2, α and β are hyper-parameters. Thus, the joint posterior distribution of π, A, {μi, σ2 i }K i=1, and S1:T is P(π, A, {μi, σ2 i }K i=1, S1:T|y1:T) and is proportional to: p(S1|π) T−1 ∏ t=1 p(St+1|St; A) T ∏ t=1 p(yt|St; {μi, σ2 i }K i=1)p(π)p(A)p({μi, σ2 i }K i=1). (6) This posterior distribution and its corresponding marginal posterior distributions are analytically intractable. In VB, we seek an approximation of Equation (6), denoted by q(π, A, {μi, σ2 i }K i=1, S1:T), to which we want to balance two things: having the discrepancy between Equation (6) and q small, while keeping q tractable. In general, there are two ways to choose q. The first is to specify a particular distributional family for q, for example the multivariate normal distribution. The other is to choose q with a simpler dependency structure than that of Equation (6); for example, we choose q, which factorizes as: q(π, A, {μi, σ2 i }K i=1, S1:T) = q(π) K ∏ i=1 q(ai) K ∏ i=1 q(μi|σ2 i )q(σ2 i )q(S1:T). (7) 160 Entropy 2014, 16, 3832–3847 The Kullback–Leibler (KL) divergence [29] can be used as the measure of dissimilarity between Equations (6) and (7). For succinctness, we denote τ = (π, A, {μi, σ2 i }K i=1, S1:T); thus the KL divergence is defined as: KL(q(τ) || p(τ|y)) = � q(τ) log q(τ) p(τ|y)dτ. (8) Note that the evaluation of Equation (8) requires p(τ|y), which is unavailable. However, we note that: KL(q(τ) || p(τ|y)) = log p(y) − � q(τ) log p(τ, y) q(τ) dτ Given the factorization Equation (7), this can be written as: KL(q(τ) || p(τ|y)) = log p(y) − � ∑ S1:T q(π)q(A) K ∏ i=1 q(μi|σ2 i )q(σ2 i )q(S1:T) log p(π, A, {μi, σ2 i }K i=1, S1:T, y1:T) q(π)q(A) K ∏ i=1 q(μi|σ2 i )q(σ2 i )q(S1:T) dπdAd{μi, σ2 i }K i=1 Consider first the q(π) term. The right-hand side can be rearranged as: KL ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ q(π) ���� ���� exp � � ∑ S1:T q(S1:T)q(A) K ∏ i=1 q(μi|σ2 i )q(σ2 i ) log p(π, A, {μi, σ2 i }K i=1, s1:T, y1:T)dAd{μi, σ2 i }K i=1 � Zπ ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ + Kπ, (9) where: Kπ = � ∑ S1:T q(S1:T)q(A) K ∏ i=1 q(μi|σ2 i )q(σ2 i )q(S1:T) log q(A) K ∏ i=1 q(μi|σ2 i )q(σ2 i )dAd{μi, σ2 i }K i=1 − log Zπ + log p(y), and Zπ is a normalizing term. The first term of Equation (9) is the only term that depends on q(π). Thus, the minimum value of KL(q(τ) || p(τ|y)) is achieved when this term equals zero. Hence, we obtained: q(π) = exp � � ∑S1:T q(S1:T)q(A) ∏K i=1 q(μi|σ2 i )q(σ2 i ) log p(π, A, {μi, σ2 i }K i=1, s1:T, y1:T)dAd{μi, σ2 i }K i=1 � Zπ (10) Given the joint distribution of p(π, A, {μi, σ2 i }K i=1, s1:T, y1:T) in the form of Equation (6), the straightforward evaluation of Equation (10) results in: q(π) ∝ K ∏ i=1 π CKπ K +ws i −1 i = Dir(π, wπ 1 , · · · , wπ K); wπ i = CK π K + ws i , ws i = Eq(S1:T)[S1,i] (11) where S1,i = 1, if the process is in state i at time 1, and zero otherwise. 161 Entropy 2014, 16, 3832–3847 Similarly, we can rearrange Equation (9) with respect to {q(ai)}K i=1, {q(μi|σ2 i )}K i=1, {q(σ2 i )}K i=1 and q(S1:T), respectively, and using the same arguments, then we can obtain: q(A) = k ∏ i Dir(ai; wA i1, ..., wA ik); wA ij = CA K + vs ij, (12) q(μi|σ2 i ) = N � γ′ i, σ2 i κi � , γ′ i = η2γ + ps i η2 + qs i , κi = η2 + qs i (13) q(σ2 i ) = IG � α′ i, β′ i � , α′ i = α + qs i 2 , β′ i = β + rs i 2 + η2 2 (γ ′ i − γ)2 (14) q(S1:T) = k ∏ i=1 π∗S1,i i T−1 ∏ t=1 k ∏ i=1 k ∏ j=1 a ∗St,iSt+1,j ij T ∏ t=1 k ∏ i=1 θ∗St,i ˜Z , (15) where St,i = 1, if the process in state i at time t, and zero otherwise, and with π∗ i = eEq(π)[log πi], a∗ ij = eEq(A)[log(aij)], θ∗ i,t = eEq(μi|σ2 i )q(σ2 i )[log φi(yt)], vs ij = ∑T−1 t=1 Eq(S1:T) � St,iSt+1,j � , ps i = ∑T t=1 Eq(S1:T)[st,i]yt, qs i = ∑T t=1 Eq(S1:T)[st,i], rs i = ∑T t=1(γ′ i − yt)2Eq(S)[st,i]. Here, ψ is the digamma function, φ is the normal density function and the exact functional forms used in the updates are shown in Algorithm 1. Algorithm 1 Variational Bayes algorithm for the regime-switching log-normal model (RSLN) model. Initialize ws i (0), vs ij (0), ps i (0), qs i (0), and rs i (0) at step 0 while wπ i (t−1), wA ij (t−1), γ′ i (t−1), α′ i (t−1), β′ i (t−1), π∗ i (t−1), a∗ ij (t−1), and θ∗ i,t (t−1) do not converge do 1. Compute wπ i (t), wA ij (t), γ′ i (t), κi(t), α′ i (t), and β′ i (t)at step t by wπ i (t) = CK π K + ws i (t−1), wA ij (t) = CA π K + vs ij (t−1), γ′ i (t) = η2γ + ps i (t−1) η2 + qs i (t−1) , κi(t) = η2 + qs i (t−1), α′ i (t) = α + qs i (t−1) 2 , β′ i (t) = β + rs i (t−1) 2 + η2 2 (γ ′ i (t) − γ)2 2. Compute π∗ i (t), θ∗ i,t (t) and a∗ ij (t) at step t by: π∗ i (t) = exp � ψ(wπ i (t)) − ψ(∑ i wπ i (t)) � , a∗ ij (t) = exp � ψ(wA ij (t)) − ψ(∑ j=1 wA ij (t)) � θ∗ i,t (t) = exp � − 1 2 log 2π − 1 2(log β′ i (t) − ψ(α′ i (t))) − 1 2 � (yt − γ′ i (t))2 α′ i (t) β′ i (t) + 1 κi(t) � � 3. Compute ws i (t), vs ij (t), ps i (t), qs i (t), and rs i (t) at step t by: ws i (t) = Eq(t)(S1:T)[S1,i], vs ij (t) = T−1 ∑ t=1 Eq(t)(S1:T) � St,iSt+1,j � , ps i (t) = T−1 ∑ t=1 Eq(t)(S1:T)[st,i]yt, qs i (t) = T−1 ∑ t=1 Eq(t)(S1:T)[st,i], rs i (t) = T−1 ∑ t=1 (γ′ i (t) − yt)2Eq(t)(S)[st,i] t ⇐ t + 1 end while The VB method proceeds, as was shown with the simple Equations (4) and (5), by iterative updating the variational parameters to solve a set of simultaneous equations. In this example, the update equations for the variables π, A, {μi, σ2 i }K i=1, S1:T are given explicitly by Algorithm 1. For the initialisation, we choose symmetric values for most of the parameters and choose random values for 162 Entropy 2014, 16, 3832–3847 others, as appropriate. For this example, this worked very satisfactory, although we note that for more general state space models [28], states that find good initial values can be non-trivial. 3.3. Interpretation of Results First, all approximating distributions above turn out to lie in well-known parametric families. The only unknown quantities are the parameters of these distributions, which are often called the variational parameters. The evaluation of parameters of q(π), q(A), q(μi|σ2 i ), and q(σ2 i ) requires knowledge of q(S1:T), and also, the evaluation of π∗ i , a∗ ij and θ∗ i,t requires knowledge of q(π), q(A), q(μi|σ2 i ) and q(σ2 i ). This structure leads to an iterative updating scheme, described in Algorithm 1. The main computational effort in Algorithm 1 is computing Eq(S1:T)[St,i] and Eq(S1:T) � St,iSt+1,j � , which have no simple tractable forms. We note that the distributional form of q(S1:T) has a very similar structure as the conditional distribution of p(S1:T|Y1:T, τ) for which the forward-backward algo- rithm [30] is commonly used to compute Ep(S1:T|Y1:T,τ)[St,i|Y1:T, τ] and Ep(S1:T|Y1:T,τ) � St,iSt+1,j|Y1:T, τ � . Therefore, we also use the forward-backward algorithm to compute Eq(S1:T)[St,i] and Eq(S1:T) � St,iSt+1,j � . The conditional distribution of q(μi|σ2 i ) is N � μi|σ2 i ; γ′ i, σ2 i κi � , then the marginal distribution of μi is the location-scale t distribution, denoted as t2α′ i(μi; γ′ i, κi β′ i/α′ i ), where the density function of tν(x; μ, λ) is defined as p(x|ν, μ, λ) = Γ( ν+1 2 ) Γ( ν 2 ) � λ πν � 1 2 � 1 + λ(x−μ)2 ν �− ν+1 2 , for x, μ ∈ (−∞, +∞) and ν, λ > 0. 4. Numerical Studies 4.1. Simulated Data In this section, we applied the VB solutions to four sets of simulated data, which are used in [24]. Through these simulated studies, we will test the performance of VB on detecting the number of regimes and compare it with those of the BIC and the MCMC methods [24]. For this paper, we present only an initial study with a relatively small number of datasets. The results are highly promising, but more extensive studies are needed to draw comprehensive conclusions. Furthermore, see [28] for general results on VB in hidden state space models. To estimate the number of regimes, we construct a matrix, called the relative magnitude matrix (RMM), defined as A′ = �ˆa′ ij � , where ˆa′ ij = wA ij wA 0 , wA 0 = ∑K i=1 ∑K j=1 wA ij and wA ij is the parameter of q(A). Our model selection procedure is to fit a VB with a large number of regimes and to examine the rows and columns in the RMM. If the values of the entries in the i − th row and the i − th column of A′ are all equal to CA/K T−1+CA×K, then we will declare the regime i nonexistent. This method is validated by the following observations. It can be shown that the parameter of vs ij in wA ij is equal to the number of times the process leaves regime i and enters regime j. Therefore, for the i − th regime, the values of zero for all of vs ji and vs ij with j = 1, · · · , K indicate that there is no transition process entering or leaving regime i. Table 1 specifies the parameters for the four cases, and we generate 671 observations for each case (equal to the number of months from January 1956 to September 2011). The parameters used in Case 1 are identical to the maximum likelihood estimates for TSX monthly return data from 1956 to 1999 [19]. Case 2 only has one regime present. Case 3 is similar to Case 1, but the two regimes have the same mean. Case 4 adds a third regime. For each case, we use MLE to fit a one-regime, two-regime, three-regime and four-regime RSLN model and report the corresponding BIC and log-likelihood scores. We then misspecify the number of regimes and run a four-regime VB algorithm. 163 Entropy 2014, 16, 3832–3847 Table 1. Parameters of the simulated data. Case Regime 1 Regime 2 Regime 3 Transition Probability (μi, σi) (μi, σi) (μi, σi) 1 (0.012, 0.035) (−0.016, 0.078) - �0.963 0.037 0.210 0.790 � 2 (0.014, 0.050) - - - 3 (0.000, 0.035) (0.000, 0.078) - �0.963 0.037 0.210 0.790 � 4 (0.012, 0.035) (−0.016, 0.078) (0.04, 0.01) ⎛ ⎝ 0.953 0.037 0.01 0.210 0.780 0.01 0.80 0.190 0.01 ⎞ ⎠ Table 2 shows the number of iterations that VB takes to converge in each case and the corresponding computational time (on a MacBook, 2 GHz processor). On average, VB converges after a hundred iterations and takes about one minute. On the same computer, a 104-iteration Reverse Jump MCMC (RJMCMC) will take about 10 h to finish. Using diagnostics, this seemed to be enough for convergence, while not being an “unfair” comparison in terms of time with VB. We can see that the computational efficiency will be a very attractive feature of the VB method. The results of the BIC with the log-likelihood (in parentheses), the relative magnitude matrices and the posterior probabilities for the models with the different number of regimes estimated by MCMC (cited from Hartman and Heaton [24]) are given in Table 3. In Case 1, the BIC favors the two-regime model. The posterior probability estimated by MCMC for the one-regime model is the largest, but there is still a large probability for the two regime model. Note that the prior specification for the number of regimes can effect these numbers and is always an issue with these forms of multidimensional MCMC. The relative magnitude matrix clearly shows that there are only two regimes whose ˆa′ ij are not negligible. This implies VB removes excess transition and emission processes and discovers the exact number of hidden regimes. In Case 2 and Case 3, both VB and the BIC can select the correct number of regimes, and the posterior probability for the one-regime model estimated by MCMC is still the largest. In Case 4, VB does not detect the third regime. The transition probability to this regime is only 0.01, and the means and standard deviations of Regime 1 make the rare data from Regime 3 easily merged within the data from Regime 1. From Table 3, it is clear that for all of the cases, the log-likelihood always increases as the number of regimes increase. Table 2. Computational efficiency of VB. Case 1 Case 2 Case 3 Case 4 Iterations to converge 62 182 132 94 Computational time [s] 27.161 80.842 58.510 45.044 164 Entropy 2014, 16, 3832–3847 Table 3. The estimated number of regimes by VB, BIC and MCMC. No. of MLE RJMCMC VB Case Regimes BIC (Log Likelihood) Posterior Probability Relative Magnitude Matrix 1 1 2 3 4 1, 108.875(1, 115.384) 1, 158.227(1, 174.499) 1, 156.370(1, 182.405) 1, 153.150(1, 188.948) 0.647 0.214 0.088 <0.052 ⎛ ⎜ ⎜ ⎝ 0.14357 0.00004 0.00004 0.03153 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.03018 0.00004 0.00004 0.79428 ⎞ ⎟ ⎟ ⎠ 2 1 2 3 4 1, 045.448(1, 051.957) 1, 038.360(1, 054.632) 1, 030.733(1, 056.768) 1, 026.882(1, 062.680) 0.864 0.109 0.020 <0.006 ⎛ ⎜ ⎜ ⎝ 0.99944 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 ⎞ ⎟ ⎟ ⎠ 3 1 2 3 4 1, 110.903(1, 117.411) 1, 139.214(1, 155.486) 1, 131.904(1, 157.719) 1, 121.921(1, 157.940) 0.629 0.221 0.098 <0.052 ⎛ ⎜ ⎜ ⎝ 0.11322 0.00004 0.00004 0.02647 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.02659 0.00004 0.00004 0.83327 ⎞ ⎟ ⎟ ⎠ 4 1 2 3 4 1, 044.819(1, 051.328) 1, 092.610(1, 108.881) 1, 087.435(1, 113.470) 1, 080.240(1, 116.038) 0.641 0.203 0.094 <0.06 ⎛ ⎜ ⎜ ⎝ 0.22643 0.00004 0.00004 0.05518 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.05377 0.00004 0.00004 0.66417 ⎞ ⎟ ⎟ ⎠ 4.2. Real Data In this section, we apply the VB solution to the TSX monthly total return index in the period from January, 1956, to December, 1999 (528 observations in total and studied in [19,21]). A four-regime VB is implemented first. VB converges after 100 iterations about 34.284 s (on a MacBook, 2 GHz processor). The relative magnitude matrix, given in Table 4, clearly shows that VB identifies two regimes. This matches both of the BIC and AIC-based results [19]. Based on these results, we then fit a two-regime VB, which converges after 83 iterations in about 14.241 s. Table 5 gives the marginal distributions for all of the parameters. Figure 4 presents the corresponding density functions, where we can see that all of the plots show a symmetric and bell-shaped pattern. Table 4. Estimations of the number of regimes for TSXdata. January 1956–December 1999 R. M. M. ⎛ ⎜ ⎜ ⎝ 0.11496 0.00005 0.00005 0.02803 0.00005 0.00005 0.00005 0.00005 0.00005 0.00005 0.00005 0.00005 0.02853 0.00005 0.00005 0.82791 ⎞ ⎟ ⎟ ⎠ Table 5. The marginal distributions of the parameters estimated by VB. Parameter Distribution Mean s.d. Transition Probability μ1 t454.61(0.0123, 370778.19) 0.0123 0.00165 - σ2 1 IG(227.30, 0.28) 0.00122(0.0349) 0.00008 - μ2 t80.39(−0.0161, 12987.55) −0.0161 0.00889 - σ2 2 IG(40.20, 0.24) 0.00603(0.0777) 0.00098 - p1,2 p2,1 Beta(15.21, 434.78) Beta(15.00, 61.21) 0.0338 0.1969 0.00851 0.04525 �0.9662 0.0338 0.1969 0.8031 � 165 Entropy 2014, 16, 3832–3847 ����� ����� ���� ���� ���� � �� ��� ��� ��� ��� ��� ��� ������� (a) ����� ����� ����� ����� ����� � ���� ���� ���� ���� ���� ������� (b) ��� ��� ��� ��� ��� ��� � �� �� �� �� �� ������� (c) Figure 4. The VB marginal distributions of the parameters. (a) μ2 (left) and μ1 (right); (b) σ2 1 (left) and σ2 2 (right) ; (c) p1,2 (left) and p2,1 (right) . Table 6 (the upper part) gives the maximum likelihood estimates (cited from [19]), mean parameters computed by the MCMC method (cited from [21]) and mean parameters computed by VB. It clearly shows that the point estimates by VB are very close to those by MLE and MCMC. The numbers in parenthesis in Table 6 are the standard deviations computed by the three methods, respectively. It is worth noting that all of the variance estimated by VB are smaller than those by the MLE or MCMC methods. In fact, some other researchers also report the underestimation of posterior variance in other VB applications, for example [31,32]. In the paper [1], we look at some diagnostics methods that can assess how well the VB approximates the true posterior, particularly with regards to its covariance structure. The methods proposed also allow us to generate simple corrections when the approximation error is large. Table 6. Estimates and standard deviations by VB, MLE and MCMC. μ1 σ1 p1,2 μ2 σ2 p2,1 VB 0.0123(0.00165) 0.0349(0.00008) 0.0338(0.00851) −0.0161(0.00889) 0.0777(0.00098) 0.1969(0.04525) MLE 0.0123(0.002) 0.0347(0.001) 0.0371(0.012) −0.0157(0.010) 0.0778(0.009) 0.2101(0.086) MCMC 0.0122(0.002) 0.0351(0.002) 0.0334(0.012) −0.0164(0.010) 0.0804(0.009) 0.2058(0.065) 5. Conclusions Variational Bayes can be thought of in terms of information geometry as a projection-based approximation technique; it provides a framework to approximate posteriors. We applied this method to the regime-switching log-normal model and provide solutions to account for both model uncertainty and parameter uncertainty. The numerical results show that our method can recover exactly the number of regimes and gives reasonable point estimates. The VB method is also demonstrated to be very computationally efficient. The application on the TSX monthly total return index data in the period from January 1956 to December 1999, confirms the similar results in the literature in finding the number of regimes. Author Contributions The article was written by Hui Zhao under the guidance of Paul Marriott. All authors have read and approved the final manuscript. 166 Entropy 2014, 16, 3832–3847 Conflicts of Interest The authors declare no conflict of interest. References 1. Zhao, H.; Marriott, P. Diagnostics for variational bayes approximations. 2013, arXiv:1309.5117. 2. Amari, S.-I. Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1990. 3. Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 1983, 11, 793–803. 4. Kass, R.; Vos, P. Geometrical Foundations of Asymptotic Inference; Wiley: New York, NY, USA, 1997. 5. Hinton, G.E.; van Camp, D. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th ACM Conference on Computational Learning Theory, Santa Cruz, CA, USA, 26–28 July 1993; ACM: New York, NY, USA, 1993. 6. MacKay, D. Developments in Probabilistic Modelling with Neural Networks—Ensemble Learning. In Neural Networks: Artifical Intelligence and Industrial Applications; Springer: London, UK, 1995; pp. 191–198. 7. Attias, H. Independent Factor Analysis. Neur. Comput. 1999, 11, 803–851. 8. Lappalainen, H. Ensemble Learning For Independent Component Analysis. In Proceedings of the First International Workshop on Independent Component Analysis, Aussois, France, 11–15 January 1999; pp. 7–12. 9. Beal, M.; Ghahramani, Z. The variational Bayesian EM algorithm for incomplete data: With application to scoring graphical model structures. Bayesian Stat. 2003, 7, 453–463. 10. Winn, J. Variational Message Passing and its Applications. Ph.D. Thesis, Department of Physics, University of Cambridge, Cambridge, UK, 2003. 11. Blei, D.M.; Ng, A.Y.; Jordan, M.I.; Lafferty, J. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. 12. Ghahramani, Z.; Beal, M.J. A Variational Inference for Bayesian Mixtures of Factor Analysers. Adv. Neur. Inf. Process. Syst. 2000, 12, 449–455. 13. Haff, L.R. The Variational Form of Certain Bayes Estimators. Ann. Stat. 1991, 19, 1163–1190. 14. Faes, C.; Ormerod, J.T.; Wand, M.P. Variational Bayesian Inference for Parametric and Nonparametric Regression With Missing Data. J. Am. Stat. Assoc. 2011, 106, 959–971. 15. McGrory, C.; Titterington, D.; Reeves, R.; Pettitt, A.N. Variational Bayes for estimating the parameters of a hidden Potts model. Stat. Comput. 2009, 19, 329–340. 16. Ormerod, J.T.; Wand, M.P. Gaussian Variational Approximate Inference for Generalized Linear Mixed Models. J. Comput. Graph. Stat. 2011, 21, 1–16. 17. Hall, P.; Humphreys, K.; Titterington, D.M. On the Adequacy of Variational Lower Bound Functions for Likelihood-Based Inference in Markovian Models with Missing Values. J. R. Stat. Soc. Ser. B 2002, 64, 549–564. 18. Wang, B.; Titterington, M. Convergence Properties of a general algorithm for calculating variational Bayesian estimates for a normal mixture model. Bayesian Anal. 2006, 1, 625–650. 19. Hardy, M.R. A Regime-Switching Model of Long-Term Stock Returns. N. Am. Actuar. J. 2001, 5, 41–53. 20. Hamilton, J.D. A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle. Econometrica 1989, 57, 357–384. 21. Hardy, M.R. Bayesian Risk Management for Equity-Linked Insurance. Scand. Actuar. J. 2002, 2002, 185–211. 22. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. 23. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. 24. Hartman, B.M.; Heaton, M.J. Accounting for regime and parameter uncertainty in regime-switching models. Insur. Math. Econ. 2011, 49, 429–437. 25. Green, P.J. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 1995, 82, 711–732. 26. Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: Cambridge, UK, 2009. 27. Brooks, S.P. Markov Chain Monte Carlo Method and Its Application. J. R. Stat. Soc. Ser. D 1998, 47, 69–100. 167 Entropy 2014, 16, 3832–3847 28. Ghahramani, Z.; Hinton, G.E. Variational learning for switching state-space models. Neur. Comput. 1998, 12, 831–864. 29. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat 1951, 22, 79–86. 30. Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat. 1970, 41, 164–171. 31. Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J. R. Stat. Soc. Ser. B 2009, 71, 319–392. 32. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 168 entropy Article On Clustering Histograms with k-Means by Using Mixed α-Divergences Frank Nielsen 1,2,*, Richard Nock 3 and Shun-ichi Amari 4 1 Sony Computer Science Laboratories, Inc, Tokyo 141-0022, Japan 2 École Polytechnique, 91128 Palaiseau Cedex, France 3 NICTA and The Australian National University, Locked Bag 9013, Alexandria NSW 1435, Australia 4 RIKEN Brain Science Institute, 2-1 Hirosawa Wako City, Saitama 351-0198, Japan; E-Mail: amari@brain.riken.jp * E-Mail: Frank.Nielsen@acm.org; Tel.:+81-3-5448-4380. Received: 15 May 2014; in revised form: 10 June 2014 / Accepted: 13 June 2014 / Published: 17 June 2014 Abstract: Clustering sets of histograms has become popular thanks to the success of the generic method of bag-of-X used in text categorization and in visual categorization applications. In this paper, we investigate the use of a parametric family of distortion measures, called the α-divergences, for clustering histograms. Since it usually makes sense to deal with symmetric divergences in information retrieval systems, we symmetrize the α-divergences using the concept of mixed divergences. First, we present a novel extension of k-means clustering to mixed divergences. Second, we extend the k-means++ seeding to mixed α-divergences and report a guaranteed probabilistic bound. Finally, we describe a soft clustering technique for mixed α-divergences. Keywords: bag-of-X; α-divergence; Jeffreys divergence; centroid; k-means clustering; k-means seeding 1. Introduction: Motivation and Background 1.1. Clustering Histograms in the Bag-of-Word Modeling Paradigm A common task of information retrieval (IR) systems is to classify documents into categories. Given a training set of documents labeled with categories, one asks to classify new incoming documents. Text categorisation [1,2] proceeds by first defining a dictionary of words from a corpus. It then models each document by a word count yielding a word distribution histogram per document (see the University of California, Irvine, UCI, machine learning repository for such data-sets [3]). The importance of the words in the dictionary can be weighted by the term frequency-inverse document frequency [2] (tf-idf) that takes into account both the frequency of the words in a given document, but also of the frequency of the words in all documents: Namely, the tf-idf weight for a given word in a given document is the product of the frequency of that word in the document times the logarithm of the ratio of the number of documents divided by the document frequency of the word [2]. Defining a proper distance between histograms allows one to: • Classify a new on-line document: We first calculate its word distribution histogram signature and seek for the labeled document, which has the most similar histogram to deduce its category tag. • Find the initial set of categories: we cluster all document histograms and assign a category per cluster. This text classification method based on the representation of the bag-of -words (BoWs) has also been instrumental in computer vision for efficient object categorization [4] and recognition in natural images [5]. This paradigm is called bag-of-features [6] (BoFs) in the general case. It first requires one to create a dictionary of “visual words” by quantizing keypoints (e.g., affine invariant descriptors of image patches) of the training database. Quantization is performed using the k-means [7–9] algorithm Entropy 2014, 16, 3273–3301; doi:10.3390/e16063273 www.mdpi.com/journal/entropy 169 Entropy 2014, 16, 3273–3301 that partitions n data X = {x1, ..., xn} into k pairwise disjoint clusters C1, ..., Ck, where each data element belongs to the closest cluster center (i.e., the cluster prototype). From a given initialization, batched k-means first assigns data points to their closest centers and then updates the cluster centers and reiterates this process until convergence is met to a local minimum (not necessarily the global minimum) after a provably finite number of steps. Csurka et al. [4] used the squared Euclidean distance for building the visual vocabulary. Depending on the chosen features, other distances have proven useful. For example, the symmetrized Kullback–Leibler (KL) divergence was shown to perform experimentally better than the Euclidean or squared Euclidean distances for a compressed histogram of gradient descriptors [10] (CHoGs), even if it is not a metric distance, since its fails to satisfy the triangular inequality. To summarize, k-means histogram clustering with respect to the symmetrized KL (called Jeffreys divergence J) can be used to quantize both visual words and document categories. Nowadays, the seminal bag-of-word method has been generalized fruitfully to various settings using the generic bag-of-X paradigm, like the bag-of-textons [6], the bag-of-readers [11], etc. Bag-of-X represents each data (e.g., document, image, etc.) as an histogram of codeword count indices. Furthermore, the semantic space [12] paradigm has been recently explored to overcome two drawbacks of the bag-of-X paradigms: the high-dimensionality of the histograms (number of bins) and difficult human interpretation of the codewords due to the lack of semantic information. In semantic space, modeling relies on semantic multinomials that are discrete frequency histograms; see [12]. In summary, clustering histograms with respect to symmetric distances (like the symmetrized KL divergence) is playing an increasing role. It turns out that the symmetrized KL divergence belongs to a 1-parameter family of divergences, called symmetrized α-divergences, or Jeffreys α-divergence [13]. 1.2. Contributions Since divergences D(p : q) are usually asymmetric distortion measures between two objects p and q, one has to often consider two kinds of centroids obtained by carrying the minimization process either on the left argument or on the right argument of the divergences; see [14]. In theory, it is enough to consider only one type of centroid, say the right centroid, since the left centroid with respect to a divergence D(p : q) is equivalent to the right centroid with respect to the mirror divergence D′(p : q) = D(q : p). In this paper, we consider mixed divergences [15] that allow one to handle in a unified way the arithmetic symmetrization S(p, q) = 1 2(D(p : q) + D(q : p)) of a given divergence D(p : q) with both the sided divergences: D(p : q) and its mirror divergence D′(p : q). The mixed α-divergence is the mixed divergence obtained for the α-divergence. We term α-clustering the clustering with respect to α-divergences and mixed α-clustering the clustering w.r.t. mixed α-divergences [16]. Our main contributions are to extend the celebrated batched k-means [7–9] algorithm to mixed divergences by associating two dual centroids per cluster and to generalize the probabilistically guaranteed good seeding of k-means++ [17] to mixed α-divergences. The mixed α-seedings provide guaranteed probabilistic clustering bounds by picking up seeds from the data and do not require explicitly computing of centroids. Therefore, it follows a fast clustering technique in practice, even when cluster centers are not available in closed form. We also consider clustering histograms by explicitly building the symmetrized α-centroids and end up with a variational k-means when the centroids are not available in closed-form, Finally, we investigate soft mixed α-clustering and discuss topics related to α-clustering. Note that clustering with respect to non-symmetrized α-divergences has been recently investigated independently in [18] and proven useful in several applications. 1.3. Outline of the Paper The paper is organized as follows: Section 2 introduces the notion of mixed divergences, presents an extension of k-means to mixed divergences and recalls some properties of α-divergences. Section 3 describes the α-seeding techniques and reports a probabilistically-guaranteed bound on the clustering quality. Section 4 investigates the various sided/symmetrized/mixed calculations of the α-centroids. 170 Entropy 2014, 16, 3273–3301 Section 5 presents the soft α-clustering with respect to α-mixed divergences. Finally, Section 6 summarises the contributions, discusses related topics and hints at further perspectives. The paper is followed by two appendices. Appendix B studies several properties of α-divergences that are used to derive the guaranteed probabilistic performance of the α-seeding. Appendix C proves that α-sided centroids are quasi-arithmetic means for the power generator functions. 2. Mixed Centroid-Based k-Means Clustering 2.1. Divergences, Centroids and k-Means Consider a set H of n histograms h1, ..., hn, each with d bins, with all positive real-valued bins: hi j > 0, ∀1 ≤ i ≤ d, 1 ≤ j ≤ n. A histogram h is called a frequency histogram when its bins sums up to one: w(h) = wh = ∑i hi = 1. Otherwise, it is called a positive histogram that can eventually be normalized to a frequency histogram: ˜h .= h w(h). (1) The frequency histograms belong to the (d-1)-dimensional open probability simplex Δd: Δd .= � (x1, ..., xd) ∈ Rd | ∀i, xi > 0, and d ∑ i=1 xi = 1 � . (2) That is, although frequency histograms have d bins, the constraint that those bin values should sum up to one yields d-1 degrees of freedom. In probability theory, the frequency or counting of histograms either model discrete multinomial probabilities or discrete positive measures (also called positive arrays [19]). The celebrated k-means clustering [8,9] is one of the most famous clustering techniques that has been generalized in many ways [20,21]. In information geometry [22], a divergence D(p : q) is a smooth C3 differentiable dissimilarity measure that is not necessarily symmetric (D(p : q) ̸= D(q : p), hence the notation “:” instead of the classical “,” reserved for metric distances), but is non-negative and satisfies the separability property: D(p : q) = 0 iff p = q. More precisely, let ∂iD(x : y) = ∂ ∂xi D(x : y), ∂,iD(x : y) = ∂ ∂yi D(x : y). Then, we require ∂iD(x : x) = ∂,iD(x : x) = 0 and −∂i∂,jD(x : y) positive definite for defining a divergence. For a distance function D(· : ·), we denote by D(x : H) the weighted average distance of x to a set a weighted histograms: D(x : H) .= n ∑ j=1 wiD(x : hj). (3) An important class of divergences on frequency histograms is the f-divergences [23–25] defined for a convex generator f (with f (1) = f ′(1) = 0 and f ′′(1) = 1): If (p : q) .= d ∑ i=1 qi f � pi qi � . Those divergences preserve information monotonicity [19] under any arbitrary transition probability (Markov morphisms). f-divergences can be extended to positive arrays [19]. The k-means algorithm on a set of weighted histograms can be tailored to any divergence as follows: First, we initialize the k cluster centers C = {c1, ..., ck} (say, by picking up randomly arbitrary distinct seeds). Then, we iteratively repeat until convergence the following two steps: • Assignment: Assign each histogram hj to its closest cluster center: l(hj) .= arg k min l=1 D(hj : cl). 171 Entropy 2014, 16, 3273–3301 This yields a partition of the histogram set H = ∪k l=1Al, where Al denotes the set of histograms of the l-th cluster: Al = {hj |l(hj) = l}. • Center relocation: Update the cluster centers by taking their centroids: cl .= arg min x ∑ hj∈Al wjD(hj : x). Throughout this paper, centroid shall be understood in the broader sense of a barycenter when weights are non-uniform. 2.2. Mixed Divergences and Mixed k-Means Clustering Since divergences are potentially asymmetric, we can define two-sided k-means or always consider a right-sided k-means, but then define another sided divergence D′(p : q) = D(q : p). We can also consider the symmetrized k-means with respect to the symmetrized divergence: S(p, q) = D(p : q) + D(q : p). Eventually, we may skew the symmetrization with a parameter λ ∈ [0, 1]: Sλ(p, q) = λD(p : q) + (1 − λ)D(q : p) (and consider other averaging schemes instead of the arithmetic mean). In order to handle those sided and symmetrized k-means under the same framework, let us introduce the notion of mixed divergences [15] as follows: Definition 1 (Mixed divergence). Mλ(p : q : r) .= λD(p : q) + (1 − λ)D(q : r), (4) for λ ∈ [0, 1]. A mixed divergence includes the sided divergences for λ ∈ {0, 1} and the symmetrized (arithmetic mean) divergence for λ = 1 2. We generalize k-means clustering to mixed k-means clustering [15] by considering two centers per cluster (for the special cases of λ = 0, 1, it is enough to consider only one). Algorithm 1 sketches the generic mixed k-means algorithm. Note that a simple initialization consists of choosing randomly the k distinct seeds from the dataset with li = ri. Algorithm 1: Mixed divergence-based k-means clustering. Input: Weighted histogram set H, divergence D(·, ·), integer k > 0, real λ ∈ [0, 1]; Initialize left-sided/right-sided seeds C = {(li, ri)}k i=1; repeat //Assignment for i = 1, 2, ..., k do Ci ← {h ∈ H : i = arg minj Mλ(lj : h : rj)}; // Dual-sided centroid relocation for i = 1, 2, ..., k do ri ← arg minx D(Ci : x) = ∑h∈Ci wjD(h : x); li ← arg minx D(x : Ci) = ∑h∈Ci wjD(x : h); until convergence; Output: Partition of H into k clusters following C; Notice that the mixed k-means clustering is different from the k-means clustering with respect to the symmetrized divergences Sλ that considers only one centroid per cluster. 172 Entropy 2014, 16, 3273–3301 2.3. Sided, Symmetrized and Mixed α-Divergences For α ̸= ±1, we define the family of α-divergences [26] on positive arrays [27] as: Dα(p : q) .= d ∑ i=1 4 1 − α2 �1 − α 2 pi + 1 + α 2 qi − (pi) 1−α 2 (qi) 1+α 2 � , = D−α(q : p), α ∈ R\{0, 1}, (5) with the limit cases D−1(p : q) = KL(p : q) and D1(p : q) = KL(q : p), where KL is the extended Kullback–Leibler divergence: KL(p : q) .= d ∑ i=1 pi log pi qi + qi − pi. (6) Divergence D0 is the squared Hellinger symmetric distance (scaled by a multiplicative factor of four) extended to positive arrays: D0(p : q) = 2 � �� p(x) − � q(x) �2 dx = 4H2(p, q), (7) with the Hellinger distance: H(p, q) = � 1 2 � �� p(x) − � q(x) �2 dx. (8) Note that α-divergences are defined for the full range of α values: α ∈ R. Observe that α-divergences of Equation (5) are homogeneous of degree one: Dα(λp : λq) = λDα(p : q) for λ > 0. When histograms p and q are both frequency histograms, we have: Dα( ˜p : ˜q) = 4 1 − α2 � 1 − d ∑ i=1 ( ˜pi) 1−α 2 ( ˜qi) 1+α 2 � , = D−α( ˜q : ˜p), α ∈ R\{0, 1}, (9) and the extended Kullback–Leibler divergence reduces to the traditional Kullback–Leibler divergence: KL( ˜p : ˜q) = ∑d i=1 ˜pi log ˜pi ˜qi . The Kullback–Leibler divergence between frequency histograms ˜p and ˜q (α = ±1) is interpreted as the cross-entropy minus the Shannon entropy: KL( ˜p : ˜q) .= H×( ˜p : ˜q) − H( ˜p). Often, ˜p denotes the true model (hidden by nature), and ˜q is the estimated model from observations. However, in information retrieval, both ˜p and ˜q play the same symmetrical role, and we prefer to deal with a symmetric divergence. The Pearson and Neyman χ2 distances are obtained for α = −3 and α = 3, respectively: D3( ˜p : ˜q) = 1 2 ∑ i ( ˜qi − ˜pi)2 ˜pi , (10) D−3( ˜p : ˜q) = 1 2 ∑ i ( ˜qi − ˜pi)2 ˜qi . (11) 173 Entropy 2014, 16, 3273–3301 The α-divergences belong to the class of Csiszár f-divergences with the following generator: f (t) = ⎧ ⎪ ⎨ ⎪ ⎩ 4 1−α2 � 1 − t(1+α)/2� , if α ̸= ±1, t ln t, if α = 1, − ln t, if α = −1 (12) Remark 1. Historically, the α-divergences have been introduced by Chernoff [28,29] in the context of hypothesis testing. In Bayesian binary hypothesis testing, we are asked to decide whether an observation belongs to one class or the other class, based on prior w1 and w2 and class-conditional probabilities p1 and p2. The average expected error of the best decision maximum a posteriori (MAP) rule is called the probability of error, denoted by Pe. When prior probabilities are identical (w1 = w2 = 1 2), we have Pe(p1, p2) = 1 2 � min(p1(x), p2(x))dx. Let S(p, q) = � min(p(x), q(x))dx denote the intersection similarity measure, with 0 < S ≤ 1 (generalizing the histogram intersection distance often used in computer vision [30]). S is bounded by the α-Chernoff affinity coefficient: S(p, q) ≤ Cβ(p, q) = � pβ(x)q1−β(x)dx, for all β ∈ [0, 1]. We can convert the affinity coefficient 0 < Cβ ≤ 1 into a divergence Dβ by simply taking Dβ = 1 − Cβ. Since the absolute value of divergences does not matter, we can rescale appropriately the divergence. One nice rescaling is by multiplying by 1 β(1−β): Dβ = 1 β(1−β)(1 − Cβ). This lets coincide the parameterized divergence with the fundamental Kullback–Leibler divergence for the limit values β ∈ {0, 1}. Last, by choosing β = 1−α 2 , it yields the well-known expression of the α-divergences. Interestingly, the α-divergences can be interpreted as a generalized α-Kullback–Leibler divergence [26] with deformed logarithms. Next, we introduce the mixed α-divergence of a histogram x to two histograms p and q as follows: Definition 2 (Mixed α-divergence). The mixed α-divergence of a histogram x to two histograms p and q is defined by: Mλ,α(p : x : q) = λDα(p : x) + (1 − λ)Dα(x : q), = λD−α(x : p) + (1 − λ)D−α(q : x), = M1−λ,−α(q : x : p), (13) The α-Jeffreys symmetrized divergence is obtained for λ = 1 2: Sα(p, q) = M 1 2 ,α(q : p : q) = M 1 2 ,α(p : q : p). The skew symmetrized α-divergence is defined by: Sλ,α(p : q) = λDα(p : q) + (1 − λ)Dα(q : p). 2.4. Notations and Hard/Soft Clusterings Throughout the paper, superscript index i denotes the histogram bin numbers and subscript index j the histogram numbers. Index l is used to iterate on the clusters. The left-sided, right-sided and symmetrized histogram positive and frequency α-centroids are denoted by lα, rα, sα and ˜lα, ˜rα, ˜sα, respectively. In this paper, we investigate the following kinds of clusterings for sets of histograms: Hard clustering. Each histogram belongs to exactly one cluster: • k-means with respect to mixed divergences Mλ,α. • k-means with respect to symmetrized divergences Sλ,α. 174 Entropy 2014, 16, 3273–3301 • Randomized seeding for mixed/symmetrized k-means by extending k-means++ with guaranteed probabilistic bounds for α-divergences. Soft clustering. Each histogram belongs to all clusters according to some weight distribution: the soft mixed α-clustering. 3. Coupled k-Means++ α-Seeding It is well-known that the Lloyd k-means clustering algorithm monotonically decreases the loss function and stops after a finite number of iterations into a local optimal. Optimizing globally the k-means loss is NP-hard [17] when d > 1 and k > 1. In practice, the performance of the k-means algorithm heavily relies on the initialization. A breakthrough was obtained by the k-means++ seeding [17], which guarantees in expectation a good starting partition. We extend this scheme to the coupled α-clustering. However, we point out that although k-means++ prove popular and are often used in practice with very good results; it has been recently pointed out that “worst case” configurations exist and even in small dimensions, on which the algorithm cannot beat significantly its expected approximability with a high probability [31]. Still, the expected approximability ratio, roughly in O(log(k)), is very good, as long as the number of clusters is not too large. Algorithm 2: Mixed α-seeding; MAS(H, k, λ, α) Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R; Let C ← hj with uniform probability ; for i = 2, 3, ..., k do Pick at random histogram h ∈ H with probability: πH(h) .= whMλ,α(ch : h : ch) ∑y∈H wyMλ,α(cy : y : cy) , (14) //where (ch, ch) .= arg min(z,z)∈C Mλ,α(z : h : z); C ← C ∪ {(h, h)}; Output: Set of initial cluster centers C; Algorithm 2 provides our adaptation of k-means++ seeding [15,17]. It works for all three of our sided/symmetrized and mixed clustering settings: • Pick λ = 1 for the left-sided centroid initialization, • Pick λ = 0 for the right-sided centroid initialization (a left-sided initialization for −α), • with arbitrary λ, for the λ-Jα (skew Jeffreys) centroids or mixed λ centroids. Indeed, the initialization is the same (see the MAS procedure in Algorithm 2). Our proof follows and generalizes the proof described for the case of mixed Bregman seeding [15] (Lemma 2). In fact, our proof is more precise, as it quantifies the expected potential with respect to the optimum only, whereas in [15], the optimal potential is averaged with a dual optimal potential, which depends on the optimal centers, but may be larger than the optimum sought. Theorem 1. Let Cλ,α denote for short the cost function related to the clustering type chosen (left-, right-, skew Jeffreys or mixed) in MASand Copt λ,α denote the optimal related clustering in k clusters, for λ ∈ [0, 1] and α ∈ (−1, 1). Then, on average, with respect to distribution (14), the initial clustering of MAS satisfies: Eπ[Cλ,α] ≤ 4 � f (λ)g(k)h2(α)Copt λ,α if λ ∈ (0, 1) g(k)z(α)h4(α)Copt λ,α otherwise . (15) 175 Entropy 2014, 16, 3273–3301 Here, f (λ) = max � 1−λ λ , λ 1−λ � , g(k) = 2(2 + log k), z(α) = � 1+|α| 1−|α| � 8|α|2 (1−|α|)2 , h(α) = maxi p|α| i / mini p|α| i ; the min is defined on strictly positive coordinates, and π denotes the picking distribution of Algorithm 2. Remark 2. The bound is particularly good when λ is close to 1/2, and in particular for the α-Jeffreys clustering, as in these cases, the only additional penalty compared to the Euclidean case [17] is h2(α), a penalty that relies on an optimal triangle inequality for α-divergences that we provide in Lemma A6 below. Remark 3. This guaranteed initialization is particularly useful for α-Jeffreys clustering, as there is no closed form solution for the centroids (except when α = ±1, see [32]). Algorithm 3: Mixed α-hard clustering: MAhC(H, k, λ, α) Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1], real α ∈ R; Let C = {(li, ri)}k i=1 ← MAS(H, k, λ, α); repeat //Assignment for i = 1, 2, ..., k do Ai ← {h ∈ H : i = arg minj Mλ,α(lj : h : rj)}; // Centroid relocation for i = 1, 2, ..., k do ri ← � ∑h∈Ai wih 1−α 2 � 2 1−α ; li ← � ∑h∈Ai wih 1+α 2 � 2 1+α ; until convergence; Output: Partition of H in k clusters following C; Algorithm 3 presents the general hard mixed k-means clustering, which can be adapted also to left- (λ = 1) and right- (λ = 0) α-clustering. For skew Jeffreys centers, since the centroids are not available in closed form [32], we adopt a variational approach of k-means by updating iteratively the centroid in each cluster (thus improving the overall loss function without computing the optimal centroids that would eventually require infinitely many iterations). 4. Sided, Symmetrized and Mixed α-Centroids The k-means clustering requires assigning data elements to their closest cluster center and then updating those cluster centers by taking their centroids. This section investigates the centroid computations for the sided, symmetrized and mixed α-divergences. Note that the mixed α-seeding presented in Section 3 does not require computing centroids and, yet, guarantees probabilistically a good clustering partition. Since mixed α-divergences are f-divergences, we start with the generic f-centroids. 4.1. Csiszár f-Centroids The centroids induced by f-divergences of a set of positive measures (that relaxes the normalisation constraint) have been studied by Ben-Tal et al. [33]. Those entropic centroids are 176 Entropy 2014, 16, 3273–3301 shown to be unique, since f-divergences are convex statistical distances in both arguments. Let Ef denote the energy to minimize when considering f-divergences: Ef .= min x∈X If (H : x) = n ∑ j=1 wjIf (hj : x), (16) = min x∈X n ∑ j=1 wj d ∑ i=1 pi j f � ci hi j � . (17) When the domain is the open probability simplex X = Δd, we get a constrained optimisation problem to solve. We transform this constrained minimisation problem (i.e., x ∈ Δd) into an equivalent unconstrained minimisation problem by using the Lagrange multiplier, γ: min x∈Rd n ∑ j=1 wjIf (hj : c) + γ � d ∑ i=1 xi − 1 � . (18) Taking the derivatives according to xi, we get: ∀i ∈ {1, ..., d}, n ∑ j=1 wj f ′ � xi hi j � − γ = 0. (19) We now consider this equation for α-divergences and symmetrized α-divergences, both f-divergences. 4.2. Sided Positive and Frequency α-Centroids The positive sided α-centroids for a set of weighted histograms were reported in [34] using the representation Bregman divergence. We summarise the results in the following theorem: Theorem 2 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted α-centroid coordinates of a set of n positive histograms h1, ..., hn are weighted α-means: ri α = f −1 α � n ∑ j=1 wj fα(hi j) � , li α = ri −α with fα(x) = � x 1−α 2 α ̸= ±1, log x α = 1. Furthermore, the frequency-sided α-centroids are simply the normalized-sided α-centroids. Theorem 3 (Sided frequency α-centroids [16]). The coordinates of the sided frequency α-centroids of a set of n weighted frequency histograms are the normalised weighted α-means. Table 1 summarizes the results concerning the sided positive and frequency α-centroids. 177 Entropy 2014, 16, 3273–3301 Table 1. Positive and frequency α-centroids: the frequency α-centroids are normalized positive α-centroids, where w(h) denotes the cumulative sum of the histogram bins. The arithmetic mean is obtained for r−1 = l1 and the geometric mean for r1 = l−1. Positive centroid Frequency centroid Right-sided centroid riα = � (∑n j=1 wj(hi j) 1−α 2 ) 2 1−α α ̸= 1 ri 1 = ∏n j=1(hi j)wj α = 1 ˜riα = ri α w(˜rα) Left-sided centroid liα = ri−α = � (∑n j=1 wj(hi j) 1+α 2 ) 2 1+α α ̸= −1 li −1 = ∏n j=1(hi j)wj α = −1 ˜liα = ˜ri−α = ri −α w(˜r−α) 4.3. Mixed α-Centroids The mixed α-centroids for a set of n weighted histograms is defined as the minimizer of: ∑ j wjMλ,α(l : hj : r). (20) We state the theorem generalizing [15]: Theorem 4. The two mixed α-centroids are the left-sided and right-sided α-centroids. Figure 1 depicts some clustering result with our α-clustering software. We remark that the clusters found are all approximately subclusters of the “distinct” clusters that appear on the figure. When those distinct clusters are actually the optimal clusters—which is likely to be the case when they are separated by large minimal distance to other clusters—this is clearly a desirable qualitative property as long as the number of experimental clusters is not too large compared to the number of optimal clusters. We remark also that in the experiment displayed, there is no closed form solution for the cluster centers. Figure 1. Snapshot of the α-clustering software. Here, n = 800 frequency histograms of three bins with k = 8, and α = 0.7 and λ = 1 2. 178 Entropy 2014, 16, 3273–3301 4.4. Symmetrized Jeffreys-Type α-Centroids The Kullback–Leibler divergence can be symmetrized in various ways: Jeffreys divergence, Jensen–Shannon divergence and Chernoff information, just to mention a few. Here, we consider the following symmetrization of α-divergences extending Jeffreys J-divergence: Sα(p, q) = 1 2 (Dα(p : q) + Dα(q : p)) = S−α(p, q), = M 1 2 (p : q : p), (21) For α = ±1, we get half of Jeffreys divergence: S±1(p, q) = 1 2 d ∑ i=1 (pi − qi) log pi qi In particular, when p and q are frequency histograms, we have for α ̸= ±1: Jα( ˜p : ˜q) = 8 1 − α2 � 1 + d ∑ i=1 H 1−α 2 ( ˜pi, ˜qi) � , (22) where H 1−α 2 (a, b) a symmetric Heinz mean [35,36]: Hβ(a, b) = aβb1−β + a1−βbβ 2 . Heinz means interpolate the arithmetic and geometric means and satisfies the inequality: √ ab = H 1 2 (a, b) ≤ Hα(a, b) ≤ H0(a, b) = a + b 2 . (Another interesting property of Heinz means is the integral representation of the logarithmic mean: L(x, y) = x−y log x−log y = � 1 0 Hβ(x, y)dβ. This allows one to prove easily that √xy ≤ L(x, y) ≤ x+y 2 .) The Jα-divergence is a Csiszár f-divergence [24,25]. Observe that it is enough to consider α ∈ [0, ∞) and that the symmetrized α-divergence for positive and frequency histograms coincide only for α = ±1. For α = ±1, Sα(p, q) tends to the Jeffreys divergence: J(p, q) = KL(p, q) + KL(q, p) = d ∑ i=1 (pi − qi)(log pi − log qi). (23) The Jeffreys divergence writes mathematically the same for frequency histograms: J( ˜p, ˜q) = KL( ˜p, ˜q) + KL( ˜q, ˜p) = d ∑ i=1 ( ˜pi − ˜qi)(log ˜pi − log ˜qi). (24) We state the results reported in [32]: Theorem 5 (Jeffreys positive centroid [32]). The Jeffreys positive centroid c = (c1, ..., cd) of a set {h1, ..., hn} of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W analytic function: ci = ai W( ai gi e) , where ai = ∑n j=1 πjhi j denotes the coordinate-wise arithmetic weighted means and gi = ∏n j=1(hi j)πj the coordinate-wise geometric weighted means. 179 Entropy 2014, 16, 3273–3301 The Lambert analytic function W [37] (positive branch) is defined by W(x)eW(x) = x for x ≥ 0. Theorem 6 (Jeffreys frequency centroid [32]). Let ˜c denote the Jeffreys frequency centroid and ˜c′ = c wc the normalised Jeffreys positive centroid. Then, the approximation factor α˜c′ = S1(˜c′, ˜H) S1(˜c, ˜H) is such that 1 ≤ α˜c′ ≤ 1 wc (with wc ≤ 1). Therefore, we shall consider α ̸= ±1 in the remainder. We state the following lemma generalizing the former results in [38] that were tailored to the symmetrized Kullback–Leibler divergence or the symmetrized Bregman divergence [14]: Lemma 1 (Reduction property). The symmetrized Jα-centroid of a set of n weighted histograms amount to computing the symmetrized α-centroid for the weighted α-mean and −α-mean: min Jα(x, H) = min x (Dα(x : rα) + Dα(lα : x)) . Proof. It follows that the minimization problem minx Sα(x, H) = ∑n j=1 wjSα(x, hj) reduces to the following minimization: min d ∑ i=1 xi − (xi) 1+α 2 ¯hi α − (xi) 1−α 2 ¯hi −α. (25) This is equivalent to minimizing: ≡ d ∑ i=1 xi − (xi) 1+α 2 ((¯hi α) 2 1−α ) 1−α 2 − (xi) 1−α 2 ((¯hi −α) 2 1+α ) 1+α 2 , ≡ d ∑ i=1 xi − (xi) 1+α 2 (ri α) 1−α 2 − (xi) 1−α 2 (li α) 1+α 2 ≡ Dα(x : rα) + Dα(lα : x). Note that α = ±1, the lemma states that the minimization problem is equivalent to minimizing KL(a : x) + KL(x : g) with respect to x, where a = l1 and g = r1 denote the arithmetic and geometric means, respectively. The lemma states that the optimization problem with n weighted histograms is equivalent to the optimization with only two equally weighted histograms. The positive symmetrized α-centroid is equivalent to computing a representation symmetrized Bregman centroid [14,34]. The frequency symmetrized α-centroid asks to minimize the following problem: min ˜x∈Δd ∑ j wjSα( ˜x, ˜hi). Instead of seeking for ˜x in the probability simplex, we can optimize on the unconstrained domain Rd−1 by using a reparameterization. Indeed, frequency histograms belong to the exponential families [39] (multinomials). Exponential families also include many other continuous distributions, like the Gaussian, Beta or Dirichlet distributions. It turns out the α-divergences can be computed in closed-form for members of the same exponential family: 180 Entropy 2014, 16, 3273–3301 Lemma 2. The α-divergence for distributions belonging to the same exponential families amounts to computing a divergence on the corresponding natural parameters: Aα(p : q) = 4 1 − α2 � 1 − e−J( 1−α 2 ) F (θp:θq) � , where Jβ F(θ1 : θ2) = βF(θ1) + (1 − β)F(θ2) − F(βθ1 + (1 − β)θ2) is a skewed Jensen divergence defined for the log-normaliser F of the family. The proof follows from the fact that � pα(x)q1−α(x)dx = e−J (α)(θp:θq) F ; see [40]. First, we convert a frequency histogram ˜h to its natural parameter θ with θi = log ˜hi ˜hd ; see [39]. The log-normaliser is a non-separable convex function F(θ) = log(1 + ∑i eθi). To convert back a multinomial to a frequency histogram with d bins, we first set ˜hd = 1 1+∑d−1 l=1 eθl and then retrieve the other bin values as ˜hi = ˜hdeθi. The centroids with respect to skewed Jensen divergences has been investigated in [13,40]. Remark 4. Note that for the special case of α = 0 (squared Hellinger centroid), the sided and symmetrized centroids coincide. In that case, the coordinates si 0 of the squared Hellinger centroid are: si 0 = � n ∑ j=1 wj � hi j �2 , 1 ≤ i ≤ d. Remark 5. The symmetrized positive α-centroids can be solved in special cases (α = ±3, α = ±1 corresponding to the symmetrized χ2 and Jeffreys positive centroids). For frequency centroids, when dealing with binary histograms (d = 2), we have only one degree of freedom and can solve the binary frequency centroids. Binary histograms (and mixtures thereof) are used in computer vision and pattern recognition [41]. Remark 6. Since α-divergences are Csiszár f-divergences and f-divergences can always be symmetrized by taking generator s(t) = f (t) + t f ( 1 t ), we deduce that symmetrized α-divergences Sα are f-divergences for the generator: f (t) = − log((1 − α) + αt) − t log � (1 − α) + α t � . (26) Hence, Sα divergences are convex in both arguments, and the sα centroids are unique. 5. Soft Mixed α-Clustering Algorithm 4 reports the general clustering with soft membership, which can be adapted to left (λinit = 1), right (λinit = 0) or mixed clustering. We have not considered a weighted histogram set in order not load the notations and because the extension is straightforward. Again, for skew Jeffreys centers, we shall adopt a variational approach. Notice that the soft clustering approach learns all parameters, including λ (if not constrained to zero or one) and α ∈ R. This is not the case for Matsuyama’s α-expectation maximization (EM) algorithm [42] in which α is fixed beforehand (and, thus, not learned). Assuming we model the prior for histograms by: pλ,α,j(hi) ∝ λ exp −Dα(lj : hi) + (1 − λ) exp −Dα(hi : rj) , (27) 181 Entropy 2014, 16, 3273–3301 the negative log-likelihood involves the α-depending quantity: k ∑ j=1 m ∑ i=1 p(j|hi) log pλ,α,j(hi) ≥ k ∑ j=1 m ∑ i=1 Mλ,α(lj : hi : rj)p(j|hi), (28) because of the concavity of the logarithm function. Therefore, the maximization step for α involves finding: arg max α k ∑ j=1 m ∑ i=1 Mλ,α(lj : hi : rj)p(j|hi) . (29) Algorithm 4: Mixed α-soft clustering; MAsC(H, k, λ, α) Input: Histogram set H with |H| = m, integer k > 0, real λ ← λinit ∈ [0, 1], real α ∈ R; Let C = {(li, ri)}k i=1 ← MAS(H, k, λ, α); repeat //Expectation for i = 1, 2, ..., m do for j = 1, 2, ..., k do p(j|hi) = πj exp(−Mλ,α(lj:hi:rj)) ∑j′ πj′ exp(−Mλ,α(lj′ :hi:rj′)); //Maximization for j = 1, 2, ..., k do πj ← 1 m ∑i p(j|hi); li ← � 1 ∑i p(j|hi) ∑i p(j|hi)h 1+α 2 i � 2 1+α ; ri ← � 1 ∑i p(j|hi) ∑i p(j|hi)h 1−α 2 i � 2 1−α ; //Alpha - Lambda α ← α − η1 ∑k j=1 ∑m i=1 p(j|hi) ∂ ∂α Mλ,α(lj : hi : rj); if λinit ̸= 0, 1 then λ ← λ − η2 � ∑k j=1 ∑m i=1 p(j|hi)Dα(lj : hi)− ∑k j=1 ∑m i=1 p(j|hi)Dα(hi : rj) � ; //for some small η1, η2; ensure that λ ∈ [0, 1]. until convergence; Output: Soft clustering of H according to k densities p(j|.) following C; No closed-form solution are known, so we compute the gradient update in Algorithm 4 with: ∂Mλ,α(lj : hi : rj) ∂α = λ∂Dα(lj : hi) ∂α + (1 − λ)∂Dα(hi : rj) ∂α , (30) 182 Entropy 2014, 16, 3273–3301 ∂Dα(p : q) ∂α = 2 (1 − α)2 × � q − �1 − α 1 + α �2 p + p 1−α 2 q 1+α 2 � 4α 1 − α2 − ln � q p ��� . (31) The update in λ is easier as: ∂Mλ,α(lj : hi : rj) ∂λ = Dα(lj : hi) − Dα(hi : rj) . (32) Maximizing the likelihood in λ would imply choosing λ ∈ {0, 1} (a hard choice for left/right centers), yet we prefer the soft update for the parameter, like for α. 6. Conclusions The family of α-divergences plays a fundamental role in information geometry: These statistical distortion measures are the canonical divergences of dual spaces on probability distribution manifolds with constant curvature κ = 1−α2 4 and the canonical divergences of dually flat manifolds for positive distribution manifolds [19]. In this work, we have presented three techniques for clustering (positive or frequency) histograms using k-means: (1) Sided left or right α-centroid k-means, (2) Symmetrized Jeffreys-type α-centroid (variational) k-means, and (3) Coupled k-means with respect to mixed α-divergences relying on dual α-centroids. Sided and mixed dual centroids are always available in closed-forms and are therefore highly attractive from the standpoint of implementation. Symmetrized Jeffreys centroids are in general not available in closed-form and require one to implement a variational k-means by updating incrementally the cluster centroids in order to monotonically decrease the k-means loss function. From the clustering standpoint, this appears not to be a problem when guaranteed expected approximations to the optimal clustering are enough. Indeed, we also presented and analyzed an extension of k-means++ [17] for seeding those k-means algorithms. The mixed α-seeding initializations do not require one to calculate centroids and behaves like a discrete k-means by picking up the seeds among the data. We reported guaranteed probabilistic clustering bounds. Thus, it yields a fast hard/soft data partitioning technique with respect to mixed or symmetrized α-divergences. Recently, the advantage of clustering using α-divergences by tuning α in applications has been demonstrated in [18]. We thus expect the computationally fast mixed α-seeding with guaranteed performance to be useful in a growing number of applications. Acknowledgments NICTA is funded by the Australian Government as represented by the Department of Broadband, Communication and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. Author Contributions: Author Contributions All authors contributed equally to the design of the research. The research was carried out by all authors. Frank Nielsen and Richard Nock wrote the paper. Frank Nielsen implemented the algorithms and performed experiments. All authors have read and approved the final manuscript. Conflicts of Interest: Conflicts of Interest The authors declare no conflict of interests. 183 Entropy 2014, 16, 3273–3301 Appendix Proof Sketch of Theorem 1 We give here the key results allowing one to obtain the proof of the Theorem, following the proof scheme of [15]. In order not to load the notations, weights are considered uniform. The extension to non-uniform weights is immediate as it boils down to duplicate histograms in the histogram set and does not change the approximation result. Let A ⊆ H be an arbitrary cluster of Copt. Let us define UA and πA as the uniform and biased distributions conditioned to A. The key to the proof is to relate the expected potential of A under UA and πA to its contribution to the optimal potential. Lemma A1. Let A ⊆ H be an arbitrary cluster of Copt. Then: Ec∼UA[Mλ,α(A, c)] = Mopt,λ,α(A) + Mopt,λ,−α(A) = Ec∼UA[Mλ,−α(A, c)] , where UA is the uniform distribution over A. Proof. α-coordinates have the property that for any subset A ⊆ H, (1/|A|) ∑p∈A uα(p) = uα(rα,A). Hence, we have: ∀c ∈ A , ∑ p∈A Dα(p : c) = ∑ p∈A Dϕα(uα(p) : uα(c)) = ∑ p∈A Dϕα(uα(p) : uα(rα,A)) + |A|Dϕα(uα(rα,A) : uα(c)) = ∑ p∈A Dα(p : rα,A) + |A|Dα(rα,A : c) . (A1) Because Dα(p : q) = D−α(q : p) and lα = r−α, we obtain: ∀c ∈ A , ∑ p∈A Dα(c : p) = ∑ p∈A D−α(p : c) = ∑ p∈A D−α(p : r−α,A) + |A|D−α(r−α,A : c) = ∑ p∈A Dα(lα,A : p) + |A|Dα(c : lα,A) . (A2) It comes now from (A1) and (A2) that: Ec∼UA[Mλ,α(A, c)] = 1 |A| ∑ c∈A ∑ p∈A {λDα(c : p) + (1 − λ)Dα(p : c)} = (1 − λ) ∑ p∈A Dα(p : rα,A) + (1 − λ) ∑ p∈A Dα(rα,A : p) +λ ∑ p∈A Dα(lα,A : p) + λ ∑ p∈A Dα(p : lα,A) = (1 − λ)Mopt,0,α(A) + λMopt,1,α(A) +(1 − λ)Mopt,0,−α(A) + λMopt,1,−α(A) = Mopt,λ,α(A) + Mopt,λ,−α(A) . (A3) 184 Entropy 2014, 16, 3273–3301 This gives the left-hand side equality of the Lemma. The right-hand side follows from the fact that Ec∼UA[Mλ,−α(A, c)] = Mopt,1−λ,α(A) + Mopt,1−λ,−α(A). Instead of Mopt,λ,α(A) + Mopt,λ,−α(A), we want a term depending solely on Mopt,λ,α(A) as it is the “true” optimum. We now give two lemmata that shall be useful in obtaining this upper bound. The first is of independent interest, as it shows that any α-divergence is a scaled, squared Hellinger distance between geometric means of points. Lemma A2. For any p, q and α ̸= 1, there exists r ∈ [p, q], such that (1 − α)2Dα(p : q) = D0(p1−αrα : q1−αrα). Proof. By the definition of Bregman divergences, for any x, y, there exists some z ∈ [x, y], such that: Dϕα(x : y) = 1 2(x − y)2ϕ”α(z) = 1 2(x − y)2 � 1 + 1 − α 2 z � 2α 1−α , and since uα is continuous and strictly increasing, for any p, q, there exists some r ∈ [p, q], such that: Dα(p : q) = Dϕα(uα(p) : uα(q)) = 1 2(uα(p) − uα(q))2 � 1 + 1 − α 2 uα(r) � 2α 1−α = 2 (1 − α)2 � p 1−α 2 − q 1−α 2 �2 rα = 2 (1 − α)2 � p1−α + q1−α − 2(pq) 1−α 2 � rα = 1 (1 − α)2 D0(p1−αrα : q1−αrα) . Lemma A3. Let discrete random variable x take non-negative values x1, x2, ..., xm with uniform probabilities. Then, for any β > −1, we have var(x1+β/uβ) ≤ var(x), with u .= (1 + β)β maxi xi. Proof. First, ∀β > −1, remark that for any x, function f (x) = x(uβ − xβ) is increasing for x ≤ u/(1 + β)β. Hence, assuming that the xis are put in non-increasing order without loss of generality, we have f (xi) ≥ f (xj), and so, xi(uβ − xβ i ) ≥ xj(uβ − xβ j ), ∀i ≥ j, as long as xi ≤ u/(1 + β)β. Choosing 185 Entropy 2014, 16, 3273–3301 u = x1(1 + β)β yields, after reordering and putting the exponent, (x1+β i − x1+β j )2 ≤ (xiuβ − xjuβ)2. Hence: 1 m ∑ i x2(1+β) i − � 1 m ∑ i x(1+β) i �2 = 1 2m2 ∑ i,j (x1+β i − x1+β j )2 ≤ 1 2m2 ∑ i,j (xiuβ − xjuβ)2 = u2β 2m2 ∑ i,j (xi − xj)2 = u2β ⎛ ⎝ 1 m ∑ i x2 i − � 1 m ∑ i xi �2⎞ ⎠ . Dividing by u2β the leftmost and rightmost terms and using the fact that var(λx) = λ2var(x) yields the statement of the Lemma. We are now ready to upper bound Mopt,λ,−α(A) as a function of Mopt,λ,α(A). Lemma A4. For any cluster A of Copt, Mopt,λ,−α(A) ≤ Mopt,λ,α(A) × � f (λ) if λ ∈ (0, 1) z(α)h2(α) otherwise , where z(α), f (λ) and h(α) are defined in Theorem 1. Proof. The case λ ̸= 0, 1 is fast, as we have by definition: Mopt,λ,−α(A) = ∑ p∈A λD−α(l−α,A : p) + (1 − λ)D−α(p : r−α,A) = ∑ p∈A λDα(p : l−α,A) + (1 − λ)Dα(r−α,A : p) = ∑ p∈A λDα(p : rα,A) + (1 − λ)Dα(lα,A : p) ≤ max �1 − λ λ , λ 1 − λ � Mopt,λ,α(A) = f (λ)Mopt,λ,α(A) . Suppose now that λ = 0 and α ≥ 0. Because Mopt,0,−α(A) = ∑p∈A D−α(p : r−α,A) = ∑p∈A Dα(lα,A : p) = Mopt,1,α(A), what we wish to do is upper bound ∑p∈A Dα(lα,A : p) = Mopt,1,α(A) as a function of ∑p∈A Dα(p : rα,A) = Mopt,0,α(A). We use Lemmatas A2 and A3 in the following derivations, using r(p) to refer to the r in Lemma A2, assuming α ≥ 0. We also note varA( f (p)) as 186 Entropy 2014, 16, 3273–3301 the variance, under the uniform distribution over A, of discrete random variable f (p), for p ∈ A. We have: ∑ p∈A Dα(lα,A : p) = ∑ p∈A D−α(p : lα,A) = 1 (1 + α)2 ∑ p∈A r(p)−αD0(p1+α : l1+α α,A ) ≤ 1 (1 + α)2 minA pα ∑ p∈A D0(p1+α : l1+α α,A ) = 1 (1 + α)2 minA pα ∑ p∈A � p1+α + l1+α α,A − 2p 1+α 2 l 1+α 2 α,A � = |A| (1 + α)2 minA pα ⎛ ⎝ 1 |A| ∑ p∈A p1+α − � 1 |A| ∑ p∈A p 1+α 2 �2⎞ ⎠ = |A|varA(p 1+α 2 ) (1 + α)2 minA pα . (A4) We have used the expression of left centroid l1+α α,A to simplify the expressions. Now, picking xi = p 1−α i 2 , β = 2α/(1 − α) and u = � 1+α 1−α � 2α 1−α maxA p 1−α 2 in Lemma A3 yields: varA(p 1+α 2 ) = u2βvarA(p 1+α 2 /uβ) = u2βvarA � p 1−α 2 pα/uβ� = u2βvar(x1+β/uβ) ≤ u2βvar(x) = u2βvarA � p 1−α 2 � = �1 + α 1 − α � 8α2 (1−α)2 max A p2αvarA � p 1−α 2 � . (A5) 187 Entropy 2014, 16, 3273–3301 Plugging this in (A4) yields: ∑ p∈A Dα(lα,A : p) ≤ �1 + α 1 − α � 8α2 (1−α)2 |A| maxA p2αvarA � p 1−α 2 � (1 + α)2 minA pα = �1 + α 1 − α � 8α2 (1−α)2 −2 �maxA p minA p �2α × |A| minA pαvarA(p 1−α 2 ) (1 − α)2 = �1 + α 1 − α � 8α2 (1−α)2 −2 �maxA p minA p �2α × minA pα (1 − α)2 ∑ p∈A D0(p1−α : r1−α α,A ) (A6) ≤ �1 + α 1 − α � 8α2 (1−α)2 −2 �maxA p minA p �2α × 1 (1 − α)2 ∑ p∈A r(p)αD0(p1−α : r1−α α,A ) = �1 + α 1 − α � 8α2 (1−α)2 −2 �maxA p minA p �2α × ∑ p∈A Dα(p : rα,A) ≤ z(α) �maxA p minA p �2α × ∑ p∈A Dα(p : rα,A) . (A7) Here, (A6) follows the path backwards of derivations that lead to (A4). The cases λ = 1 or α < 0 are obtained using the same chains of derivations and achieve the proof of Lemma A4. Lemma A4 can be directly used to refine the bound of Lemma A1 in the uniform distribution. We give the Lemma for the biased distribution, directly integrating the refinement of the bound. Lemma A5. Let A be an arbitrary cluster of Copt and C an arbitrary clustering. If we add a random couple (c, c) to C, chosen from A with π as in Algorithm 2, then: Ec∼πA[Mλ,α(A, c)] ≤ 4 � f (λ)h2(α)Mopt,λ,α(A) if λ ∈ (0, 1) z(α)h4(α)Mopt,λ,α(A) otherwise , (A8) where f (λ) and h(α) are defined in Theorem 1. Proof. The proof essentially follows the proof of Lemma 3 in [15]. To complete it, we need a triangle inequality involving α-divergences. We give it here. Lemma A6. For any p, q, r and α, we have: � Dα(p : q) ≤ �maxi{pi, qi, ri} mini{pi, qi, ri} �|α| �� Dα(p : r) + � Dα(r : q) � (A9) (where the min is over strictly positive values) Remark: take α = 0; we find the triangle inequality for the squared Hellinger distance. Proof. Using the proof of Lemma 2 in [15] for Bregman divergence Dϕα, we get: � Dϕα(x : z) ≤ ρ(α) �� Dϕα(x : y) + � Dϕα(y : z) � , (A10) 188 Entropy 2014, 16, 3273–3301 where: ρ(α) = max u,v � 1 + 1−α 2 u � 2α 1−α � 1 + 1−α 2 v � 2α 1−α . (A11) Taking x = uα(p), y = uα(q), z = uα(r) yields ρ(α) = maxs,t∈{pi,qi,ri}(s/t)|α| and the statement of Lemma A6. The rest of the proof of Lemma A5 follows the proof of Lemma 3 in [15]. We get all of the ingredients to our proof, and there remains to use Lemma 4 in [15] to achieve the proof of Theorem 1. Appendix Properties of α-Divergences For positive arrays p and q, the α-divergence Dα(p : q) can be defined as an equivalent representational Bregman divergence [19,34] Bϕα(uα(p) : uα(q)) over the (uα, vα)-structure [43] with: ϕα(x) .= 2 1 + α � 1 + 1 − α 2 x � 2 1−α , (A12) uα(p) .= 2 1 − α � p 1−α 2 − 1 � , (A13) vα(p) .= 2 1 + α p 1+α 2 , (A14) where we assume that α ̸= ±1. Otherwise, for α = ±1, we compute Dα(p : q) by taking the sided Kullback–Leibler divergence extended to positive arrays. In the proof of Theorem 1, we have used two properties of α-divergences of independent interest: • any α-divergence can be explained as a scaled squared Hellinger distance between geometric means of its arguments and a point that belong to their segment (Lemma A2); • any α-divergence satisfies a generalized triangle inequality (Lemma A6). Notice that this Lemma is optimal in the sense that for α = 0, it is possible to recover the triangle inequality of the Hellinger distance. The following lemma shows how to bound the mixed divergence as a function of an α-divergence. Lemma A7. For any positive arrays l, h, r and α ̸= ±1, define η .= λ(1 − α)/(1 − α(2λ − 1)) ∈ [0, 1], gη with gi η .= (li)η(ri)1−η and aη with ai η .= ηli + (1 − η)ri. Then, we have: Mλ,α(l : h : r) ≤ 1 − α2(2λ − 1)2 1 − α2 Dα(2λ−1)(gη : h) +2(1 − α(2λ − 1)) 1 − α2 ∑ i � ai η − gi η � . Proof. For all index i, we have: Mλ,α(li : hi : ri) = λDα(li : hi) + (1 − λ)Dα(hi : ri) = 4 1 − α2 �λ(1 − α) 2 li + (1 − λ)(1 + α) 2 ri + 1 + α(2λ − 1) 2 hi (A15) −λ(li) 1−α 2 (hi) 1+α 2 − (1 − λ)(ri) 1+α 2 (hi) 1−α 2 � . (A16) 189 Entropy 2014, 16, 3273–3301 The arithmetic-geometric-harmonic (AGH) inequality implies: λ(li) 1−α 2 (hi) 1+α 2 + (1 − λ)(ri) 1+α 2 (hi) 1−α 2 ≥ (li) λ(1−α) 2 (ri) (1−λ)(1+α) 2 (hi) 1+α(2λ−1) 2 = � (li) λ(1−α) 1−α(2λ−1) (ri) (1−λ)(1+α) 1−α(2λ−1) � 1−α(2λ−1) 2 (hi) 1+α(2λ−1) 2 = � (li)η(ri)1−η� 1−α(2λ−1) 2 (hi) 1+α(2λ−1) 2 = (gi η) 1−α(2λ−1) 2 (hi) 1+α(2λ−1) 2 . It follows that (A16) yields: Mλ,α(li : hi : ri) ≤ 4 1 − α2 �1 − α(2λ − 1) 2 � ηli + (1 − η)ri� + (A17) 1 + α(2λ − 1) 2 hi − (gi η) 1−α(2λ−1) 2 (hi) 1+α(2λ−1) 2 � = 1 − α2(2λ − 1)2 1 − α2 Dα(2λ−1)(gi η : hi) + 2(1 − α(2λ − 1)) 1 − α2 � ai η − gi η � ,(A18) out of which we get the statement of the Lemma. Appendix Sided α-Centroids For the sake of completeness, we prove the following theorem: Theorem A1 (Sided positive α-centroids [34]). The left-sided lα and right-sided rα positive weighted α-centroid coordinates of a set of n positive histograms h1, ..., hn are weighted α-means: ri α = f −1 α � n ∑ j=1 wj fα(hi j) � , li α = ri −α with: fα(x) = � x 1−α 2 α ̸= ±1, log x α = 1. Proof. We distinguish three cases: α ̸= ±1, α = −1 and α = 1. First, consider the general case α ̸= ±1. We have to minimize: Rα(x, H) = 4 1 − α2 n ∑ j=1 wj× d ∑ i=1 �1 − α 2 hi j + 1 + α 2 xi − (hi j) 1−α 2 (xi) 1+α 2 � . Removing all additive terms independent of xi and the overall constant multiplicative factor 4 1−α2 ̸= 0, we get the following equivalent minimisation problem: R′ α(x, H) = d ∑ i=1 1 + α 2 xi − (xi) 1+α 2 � n ∑ j=1 wj(hi j) 1−α 2 � � �� � ¯hiα , (A19) 190 Entropy 2014, 16, 3273–3301 where ¯hi α denote the following aggregation term: ¯hi α = n ∑ j=1 wj(hi j) 1−α 2 . Setting coordinate-wise the derivative to zero of Equation (A19) (i.e., ∇xR′(x, H) = 0), we get: 1 + α 2 − 1 + α 2 (xi) α−1 2 ¯hi α = 0 Thus, we find that the coordinates of the right-sided α-centroids are: ci α = (¯hi α) 2 1−α = � n ∑ j=1 wj(hi j) 1−α 2 � 2 1−α = ˆhi α. We recognise the expression of a quasi-arithmetic mean for the strictly monotonous generator fα(x): ri α = f −1 α � n ∑ j=1 wj fα(hi j) � , (A20) with: fα(x) = x 1−α 2 , f −1 α (x) = x 2 1−α , α ̸= ±1. Therefore, we conclude that the coordinates of the positive α-centroid are the weighted α-means of the histogram coordinates (for α ̸= ±1). Quasi-arithmetic means are also called in the literature quasi-linear means or f-means. When α = −1, we search for the right-sided extended Kullback–Leibler divergence centroid by minimising: R−1(x; ˜H) = n ∑ j=1 wj d ∑ i=1 hi j log hi j xi + xi − hi j. It is equivalent to minimizing: R′ −1(x; ˜H) = d ∑ i=1 xi − � n ∑ j=1 wjhi j � � �� � a log xi, where a denotes the arithmetic mean. Solving coordinate-wise, we get ci = ai = ∑n j=1 wjhi j. When α = 1, the right-sided reverse extended KL centroid is a left-sided extended KL centroid. The minimisation problem is: R1(x; ˜H) = n ∑ j=1 wj d ∑ i=1 xi log xi hi j + hi j − xi. Since ∑j wj = 1, we solve coordinate-wise and find log x = ∑j wj log hj. That is, ri 1 is the geometric mean: ri 1 = n ∏ j=1 (hi j)wj. Both the arithmetic mean and the geometric mean are power means in the limit case (and hence quasi-arithmetic means). Thus, ri α = f −1 α � n ∑ j=1 wj fα(hi j) � , (A21) 191 Entropy 2014, 16, 3273–3301 with: fα(x) = � x 1−α 2 α ̸= ±1, log x α = 1. References 1. Baker, L.D.; McCallum, A.K.Distributional clustering of words for text classification. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 96–103. 2. Bigi, B. Using Kullback–Leibler distance for text categorization. In Proceedings of the 25th European conference on IR research (ECIR), Pisa, Italy, 14–16 April 2003; Springer-Verlag: Berlin/Heidelberg, Germany, 2003; ECIR’03, pp. 305–319. 3. Bag of Words Data Set. Available online: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words (accessed on 17 June 2014). 4. Csurka, G.; Bray, C.; Dance, C.; Fan, L. Visual Categorization with Bags of Keypoints; Workshop on Statistical Learning in Computer Vision (ECCV); Xerox Research Centre Europe: Meylan, France, 2004, pp. 1–22. 5. Jégou, H.; Douze, M.; Schmid, C. Improving Bag-of-Features for Large Scale Image Search. Int. J. Comput. Vis. 2010, 87, 316–336. 6. Yu, Z.; Li, A.; Au, O.; Xu, C. Bag of textons for image segmentation via soft clustering and convex shift. In Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 781–788. 7. Steinhaus, H. Sur la division des corp matériels en parties. Bull. Acad. Polon. Sci. 1956, 1, 801–804. (in French) 8. Lloyd, S.P. Least Squares Quantization in PCM; Technical Report RR-5497; Bell Laboratories: Murray Hill, NJ, USA, 1957. 9. Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. 10. Chandrasekhar, V.; Takacs, G.; Chen, D.M.; Tsai, S.S.; Reznik, Y.A.; Grzeszczuk, R.; Girod, B. Compressed histogram of gradients: A low-bitrate descriptor. Int. J. Comput. Vis. 2012, 96, 384–399. 11. Nock, R.; Nielsen, F.; Briys, E. Non-linear book manifolds: Learning from associations the dynamic geometry of digital libraries. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, New York, NY, USA, 2013; pp. 313–322. 12. Kwitt, R.; Vasconcelos, N.; Rasiwasia, N.; Uhl, A.; Davis, B.C.; Häfner, M.; Wrba, F. Endoscopic image analysis in semantic space. Med. Image Anal. 2012, 16, 1415–1422. 13. Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. 2010, arXiv:1009.4004. 14. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. 15. Nock, R.; Luosto, P.; Kivinen, J. Mixed Bregman clustering with approximation guarantees. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium, 15–19 September 2008; Springer-Verlag: Berlin/Heidelberg, Germany, 2008; pp. 154–169. 16. Amari, S. Integration of Stochastic Models by Minimizing α-Divergence. Neural Comput. 2007, 19, 2780–2796. 17. Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. 18. Olszewski, D.; Ster, B. Asymmetric clustering using the alpha-beta divergence. Pattern Recognit. 2014, 47, 2031–2041. 19. Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. 20. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. 21. Teboulle, M. A unified continuous optimization framework for center-based clustering methods. J. Mach. Learn. Res. 2007, 8, 65–102. 22. Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000. 23. Morimoto, T. Markov Processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. 192 Entropy 2014, 16, 3273–3301 24. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 1966, 28, 131–142. 25. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Studi. Sci. Math. Hung. 1967, 2, 229–318. 26. Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. 27. Zhu, H.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of Neural Networks; Operations Research/Computer Science Interfaces Series; Ellacott, S., Mason, J., Anderson, I., Eds.; Springer: New York, NY, USA, 1997; Volume 8, pp. 394–398. 28. Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. 29. Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. 30. Wu, J.; Rehg, J. Beyond the euclidean distance: creating effective visual codebooks using the histogram intersection kernel. In Proceedings of 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 630–637. 31. Bhattacharya, A.; Jaiswal, R.; Ailon, N. A tight lower bound instance for k-means++ in constant dimension. In Theory and Applications of Models of Computation; Lecture Notes in Computer Science; Gopal, T., Agrawal, M., Li, A., Cooper, S., Eds.; Springer International Publishing: New York, NY, USA, 2014; Volume 8402, pp. 7–22. 32. Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660. 33. Ben-Tal, A.; Charnes, A.; Teboulle, M. Entropic means. J. Math. Anal. Appl. 1989, 139, 537–551. 34. Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In Proceedings of International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June 2009; pp. 71–78. 35. Heinz, E. Beiträge zur Störungstheorie der Spektralzerlegung. Math. Anna. 1951, 123, 415–438. (in German) 36. Besenyei, A. On the invariance equation for Heinz means. Math. Inequal. Appl. 2012, 15, 973–979. 37. Barry, D.A.; Culligan-Hensley, P.J.; Barry, S.J. Real values of the W-function. ACM Trans. Math. Softw. 1995, 21, 161–171. 38. Veldhuis, R.N.J. The centroid of the symmetrical Kullback–Leibler distance. IEEE Signal Process. Lett. 2002, 9, 96–99. 39. Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. 2009, arXiv.org: 0911.4863. 40. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. 41. Romberg, S.; Lienhart, R. Bundle min-hashing for logo recognition. In Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, Dallas, TX, USA, 16–19 April 2013; ACM: New York, NY, USA, 2013; pp. 113–120. 42. Matsuyama, Y. The alpha-EM algorithm: Surrogate likelihood maximization using alpha-logarithmic information measures. IEEE Trans. Inf. Theory 2003, 49, 692–706. 43. Amari, S.I. New developments of information geometry (26): Information geometry of convex programming and game theory. In Mathematical Sciences (suurikagaku); Number 605; The Science Company: Denver, CO, USA, 2013; pp. 65–74. (In Japanese) c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 193 entropy Article New Riemannian Priors on the Univariate Normal Model Salem Said *, Lionel Bombrun and Yannick Berthoumieu Groupe Signal et Image, CNRS Laboratoire IMS, Institut Polytechnique de Bordeaux, Université de Bordeaux, UMR 5218, Talence, 33405, France; E-Mails: lionel.bombrun@u-bordeaux.fr (L.B.); yannick.berthoumieu@u-bordeaux.fr (Y.B.) * E-Mail: salem.said@u-bordeaux.fr; Tel.:+33-(0)5-4000-6185. Received: 17 April 2014; in revised form: 23 June 2014 / Accepted: 9 July 2014 / Published: 17 July 2014 Abstract: The current paper introduces new prior distributions on the univariate normal model, with the aim of applying them to the classification of univariate normal populations. These new prior distributions are entirely based on the Riemannian geometry of the univariate normal model, so that they can be thought of as “Riemannian priors”. Precisely, if {pθ; θ ∈ Θ} is any parametrization of the univariate normal model, the paper considers prior distributions G( ¯θ, γ) with hyperparameters ¯θ ∈ Θ and γ > 0, whose density with respect to Riemannian volume is proportional to exp(−d2(θ, ¯θ)/2γ2), where d2(θ, ¯θ) is the square of Rao’s Riemannian distance. The distributions G( ¯θ, γ) are termed Gaussian distributions on the univariate normal model. The motivation for considering a distribution G( ¯θ, γ) is that this distribution gives a geometric representation of a class or cluster of univariate normal populations. Indeed, G( ¯θ, γ) has a unique mode ¯θ (precisely, ¯θ is the unique Riemannian center of mass of G( ¯θ, γ), as shown in the paper), and its dispersion away from ¯θ is given by γ. Therefore, one thinks of members of the class represented by G( ¯θ, γ) as being centered around ¯θ and lying within a typical distance determined by γ. The paper defines rigorously the Gaussian distributions G( ¯θ, γ) and describes an algorithm for computing maximum likelihood estimates of their hyperparameters. Based on this algorithm and on the Laplace approximation, it describes how the distributions G( ¯θ, γ) can be used as prior distributions for Bayesian classification of large univariate normal populations. In a concrete application to texture image classification, it is shown that this leads to an improvement in performance over the use of conjugate priors. Keywords: Fisher information; Riemannian metric; prior distribution; univariate normal distribution; image classification 1. Introduction In this paper, a new class of prior distributions is introduced on the univariate normal model. The new prior distributions, which will be called Gaussian distributions, are based on the Riemannian geometry of the univariate normal model. The paper introduces these new distributions, uncovers some of their fundamental properties and applies them to the problem of the classification of univariate normal populations. It shows that, in the context of a real-life application to texture image classification, the use of these new prior distributions leads to improved performance in comparison with the use of more standard conjugate priors. To motivate the introduction of the new prior distributions, considered in the following, recall some general facts on the Riemannian geometry of parametric models. In information geometry [1], it is well known that a parametric model {pθ; θ ∈ Θ}, where Θ ⊂ Rp, can be equipped with a Riemannian geometry, determined by Fisher’s information matrix, say I(θ). Entropy 2014, 16, 4015–4031; doi:10.3390/e16074015 www.mdpi.com/journal/entropy 194 Entropy 2014, 16, 4015–4031 Indeed, assuming I(θ) is strictly positive definite, for each θ ∈ Θ, a Riemannian metric on Θ is defined by: ds2(θ) = p ∑ i,j=1 Iij(θ)dθidθj (1) The fact that the length element Equation (1) is invariant to any change of parametrization was realized by Rao [2], who was the first to propose the application of Riemannian geometry in statistics. Once the Riemannian metric Equation (1) is introduced, the whole machinery of Riemannian geometry becomes available for application to statistical problems relevant to the parametric model {pθ; θ ∈ Θ}. This includes the notion of Riemannian distance between two distributions, pθ and pθ′, which is known as Rao’s distance, say d(θ, θ′), the notion of Riemannian volume, which is exactly the same as Jeffreys prior [3], and the notion of Riemannian gradient, which can be used in numerical optimization and coincides with the so-called natural gradient of Amari [4]. It is quite natural to apply Rao’s distance to the problem of classifying populations that belong to the parametric model {pθ; θ ∈ Θ}. In the case where this parametric model is the univariate normal model, this approach to classification is implemented in [5]. For more general parametric models, beyond the univariate normal model, similar applications of Rao’s distance to problems of image segmentation and statistical tests can be found in [6–8]. The idea of [5] is quite elegant. In general, it requires that some classes {SL; L = 1, . . . , C}, (based on a learning sequence) have been identified with “centers” ¯θL ∈ Θ. Then, in order to assign a test population, given by the parameter θt, to a class L∗, it is proposed to choose L∗, which minimizes Rao’s distance d2(θt, ¯θL), over L = 1, . . . , C. In the specific context of the classification of univariate normal populations [5], this leads to the introduction of hyperbolic Voronoi diagrams. The present paper is also concerned with the case where the parametric model {pθ; θ ∈ Θ} is a univariate normal model. It starts from the idea that a class SL should be identified not only with a center ¯θL, as in [5], but also with a kind of “variance”, say γ2, which will be called a dispersion parameter. Accordingly, assigning a test population given by the parameter θt to a class L should be based on a tradeoff between the square of Rao’s distance d2(θt, ¯θL) and the dispersion parameter γ2. Of course, this idea has a strong Bayesian flavor. It proposes to give more “confidence” to classes that have a smaller dispersion parameter. Thus, in order to implement it, in a concrete way, the paper starts by introducing prior distributions on the univariate normal model, which it calls Gaussian distributions. By definition, a Gaussian distribution G( ¯θ, γ2) has a probability density function, with respect to Riemannian volume, given by: p(θ| ¯θ, γ) ∝ exp �−d2(θ, ¯θ) 2γ2 � (2) Given this definition of a Gaussian distribution (which is developed in a detailed way, in Section 3), classification of univariate normal populations can be carried out by associating to each class SL of univariate normal populations a Gaussian distribution G( ¯θL, γ2 L) and by assigning any test population with parameter θt to the class L∗, which maximizes the likelihood p(θt| ¯θL, γL), over L = 1, . . . , C. The present paper develops in a rigorous way the general approach to the classification of univariate normal populations, which has just been described. It proceeds as follows. Section 2, which is basically self-contained, provides the concepts, regarding the Riemannian geometry of the univariate normal model, which will be used throughout the paper. Section 3 introduces Gaussian distributions on the univariate normal model and uncovers some of their general properties. In particular, Section 3.2 of this section gives a Riemannian gradient descent algorithm for computing maximum likelihood estimates of the parameters ¯θ and γ of a Gaussian distribution. Section 4 states the general approach to classification of univariate normal populations proposed in this paper. It deals with two problems: (i) given a class S of univariate normal populations Si, how 195 Entropy 2014, 16, 4015–4031 to fit a Gaussian distribution G(¯z, γ) to this class; and (ii) given a test univariate normal population St and a set of classes {SL, L = 1, . . . , C}, how to assign St to a suitable class SL∗. In the present paper, the chosen approach for resolving these two problems is marginalized likelihood estimation, in the asymptotic framework where each univariate normal population contains a large number of data points. In this asymptotic framework, the Laplace approximation plays a major role [9]. In particular, it reduces the first problem, of fitting a Gaussian distribution to a class of univariate normal populations, to the problem of maximum likelihood estimation, covered in Section 3.2. The final result of Section 4 is the decision rule Equation (37). This generalizes the one developed in [5] and already explained above, by taking into account the dispersion parameter γ, in addition to the center ¯θ, for each class. In Section 5, the formalism of Section 4 is applied to texture image classification, using the VisTeX image database [10]. This database is used to compare the performance obtained using Gaussian distributions, as in Section 4, to that obtained using conjugate prior distributions. It is shown that Gaussian distributions, proposed in the current paper, lead to a significant improvement in performance. Before going on, it should be noted that probability density functions of the form (2), on general Riemannian manifolds, were considered by Pennec in [11]. However, they were not specifically used as prior distributions, but rather as a representation of uncertainty in medical image analysis and directional or shape statistics. 2. Riemannian Geometry of the Univariate Normal Model The current section presents in a self-contained way the results on the Riemannian geometry of the univariate normal model, which are required for the remainder of the paper. Section 2.1 recalls the fact that the univariate normal model can be reparametrized, so that its Riemannian geometry is essentially the same as that of the Poincaré upper half plane. Section 2.2 uses this fact to give analytic formulas for distance, geodesics and integration on the univariate normal model. Finally, Section 2.3 presents, in general form, the Riemannian gradient descent algorithm. 2.1. Derivation of the Fisher Metric This paper considers the Riemannian geometry of the univariate normal model, as based on the Fisher metric (1). To be precise, the univariate normal model has a two-dimensional parameter space Θ = {θ = (μ, σ)|μ ∈ R , σ > 0}, and is given by: pθ(x) = |2πσ2|−1/2 exp � −(x − μ)2 2σ2 � (3) where each pθ is a probability density function with respect to the Lebesgue measure on R. The Fisher information matrix, obtained from Equation (3), is the following: I(θ) = � 1 σ2 0 0 2 σ2 � As in [12], this expression can be made more symmetric by introducing the parametrization z = (x, y), where x = μ/ √ 2 and y = σ. This yields the Fisher information matrix: I(z) = 2 × � 1 y2 0 0 1 y2 � 196 Entropy 2014, 16, 4015–4031 It is suitable to drop the factor two in this expression and introduce the following Riemannian metric for the univariate normal model, ds2(z) = dx2 + dy2 y2 (4) This is essentially the same as the Fisher metric (up to the factor tow) and will be considered throughout the following. The resulting Rao’s distance and Riemannian geometry are given in the following paragraph. 2.2. Distance, Geodesics and Volume The Riemannian metric (4), obtained in the last paragraph, happens to be a very well-known object in differential geometry. Precisely, the parameter space H = {z = (x, y)|y > 0} equipped with the metric (4) is known as the Poincaré upper half plane and is a basic model of a two-dimensional hyperbolic space [13]. Rao’s distance between two points z1 = (x1, y1) and z2 = (x2, y2) in H can be expressed as follows (for results in the present paragraph, see [13], or any suitable reference on hyperbolic geometry), d(z1, z2) = acosh � 1 + (x1 − x2)2 + (y1 − y2)2 2y1y2 � (5) where acosh denotes the inverse hyperbolic cosine. Starting from z1, in any given direction, it is possible to draw a unique geodesic ray γ : R+ → H. This is a curve having the property that γ(0) = z1 and, for any t ∈ R+, if γ(t) = z2 then d(z1, z2) = t. In other words, the length of γ between z1 and z2 is equal to the distance between z1 and z2. The equation of a geodesic ray starting from z ∈ H is conveniently written down in complex notation (that is, by treating points of H as complex numbers). To begin, consider the case of z = i (which stands for x = 0 and y = 1). The geodesic in the direction making an angle ψ with the y-axis is the curve, γi(t) = et/2 cos(ψ/2) i − e−t/2 sin(ψ/2) et/2 sin(ψ/2) i + e−t/2 cos(ψ/2) (6) In particular ψ = 0 gives γi(t) = eti and ψ = π gives γi(t) = e−ti. If ψ is not a multiple of π, γi(t) traces out a portion of a circle, which is parallel to the y-axis, in the limit t → ∞. For a general starting point z, the geodesic ray in the direction making an angle ψ with the y-axis can be written: γz(t, ψ) = x + yγi(t/y, ψ) (7) where z = (x, y) and γi(t, ψ) is given by Equation (6). A more detailed treatment of Rao’s distance (5) and of geodesics in the Poincaré upper half plane, along with applications in image clustering, can be found in [5]. The Riemannian volume (or area, since H is of dimension 2) element corresponding to the Riemannian metric (4) is dA(z) = dxdy/y2. Accordingly, the integral of a function f : H → R with respect to dA is given by: � H f (z)dA(z) = � +∞ 0 � +∞ −∞ f (x, y) y2 dxdy (8) In many cases, the analytic computation of this integral can be greatly simplified by using polar coordinates (r, φ) defined with respect to some “origin” ¯z ∈ H. Polar coordinates (r, ϕ) map to the point z(r, ϕ) given by: z(r, ϕ) = γ¯z � r, π 2 − ϕ � (9) 197 Entropy 2014, 16, 4015–4031 where the right-hand side is defined according to Equation (7). The polar coordinates (r, ϕ) do indeed define a global coordinate system of H, in the sense that the application that takes a complex number reiϕ to the point z(r, ϕ) in H is a diffeomorphism. The standard notation from differential geometry is: exp¯z � reiϕ� = z(r, ϕ) (10) In these coordinates, the Riemannian metric (4) takes on the form: ds2(z) = dr2 + sinh2 rdϕ2 (11) The integral Equation (8) can be computed in polar coordinates using the formula [13], � H f (z)dA(z) = � 2π 0 � +∞ 0 ( f ◦ exp¯z) � reiϕ� sinh(r)drdϕ (12) where exp¯z was defined in Equation (10) and ◦ denotes composition. This is particularly useful when f ◦ exp¯z does not depend on ϕ. 2.3. Riemannian Gradient Descent In this paper, the problem of minimizing, or maximizing, a differentiable function f : H → R will play a central role. A popular way of handling the minimization of a differentiable function defined on a Riemannian manifold (such as H) is through Riemannian gradient descent [14]. Here, the definition of Riemannian gradient is reviewed, and a generic description of Riemannian gradient descent is provided. The Riemannian gradient of f is here defined as a mapping ∇ f : H → C with the following property: 1 y2 × Re {∇ f (z) h∗} = Re {d f (z) h∗} (13) for any complex number h, where Re denotes the real part, ∗ denotes conjugation and d f is the “derivative”, d f = (∂ f /∂x) + (∂ f /∂y) i. For example, if f (z) = y, it follows from Equation (13) that ∇ f (z) = y2. Riemannian gradient descent consists in following the direction of −∇ f at each step, with the length of the step (in other words, the step size) being determined by the user. The generic algorithm is, up to some variations, the following: INPUT ˆz ∈ H % Initial guess WHILE ∥∇ f (ˆz)∥ > ε % ε ≈ 0 machine precision ˆz ← expˆz (−λ∇ f (ˆz)) % λ > 0 step size, depends on ˆz END WHILE OUTPUT ˆz % near critical point of f Here, in the condition for the while loop, ∥∇ f (zk)∥ is the Riemannian norm of the gradient ∇ f (zk). In other words, ∥∇ f (zk)∥2 = 1 y2 k × Re {∇ f (zk) ∇ f (zk)∗} Just like a classical gradient descent algorithm, the above Riemannian gradient descent consists in following the direction of the negative gradient −∇ f (ˆz), in order to define a new estimate. This is repeated as long as the gradient is sensibly nonzero, in the sense of the loop condition. The generic algorithm described above has no guarantee of convergence. Convergence and behavior near limit points depends on the function f, on the initialization of the algorithm and on the step sizes λ. For these aspects, the reader may consult [14](Chapter 4). 198 Entropy 2014, 16, 4015–4031 3. Riemannian Prior on the Univariate Normal Model The current section introduces new prior distributions on the univariate normal model. These may be referred to as “Riemannian priors”, since they are entirely based on the Riemannian geometry of this model, and will also be called “Gaussian distributions”, when viewed as probability distributions on the Poincaré half plane. Here, Section 3.1 defines in a rigorous way Gaussian distributions on H (based on the intuitive Formula (2)). A Gaussian distribution G(¯z, γ) has two parameters, ¯z ∈ H, called the center of mass, and γ > 0, called the dispersion parameter. Section 3.2 uses the Riemannian gradient descent algorithm Section 2.3 to provide an algorithm for computing maximum likelihood estimates of ¯z and γ. Finally, Section 3.3 proves that ¯z is the Riemannian center of mass or Karcher mean of the distribution G(¯z, γ), (Historically, it is more correct to speak of the “Fréchet mean”, since this concept was proposed by Fréchet in 1948 [15]), and that γ is uniquely related to mean square Rao’s distance from ¯z. The reader may wish to note that the results of Section 3.3 are not used in the following, so this paragraph may be skipped on a first reading. 3.1. Gaussian Distributions on H A Gaussian distribution G(¯z, γ) on H is a probability distribution with the following probability density function: p(z|¯z, γ) = 1 Z(γ) exp � −d2(z, ¯z) 2γ2 � (14) Here, ¯z ∈ H is called the center of mass and γ > 0 the dispersion parameter of the distribution G(¯z, γ). The squared distance d2(z, ¯z) refers to Rao’s distance (5). The probability density function (14) is understood with respect to the Riemannian volume element dA(z). In other words, the normalization constant Z(γ) is given by: Z(γ) = � H f (z)dA(z) f (z) = exp � −d2(z, ¯z) 2γ2 � Using polar coordinates, as in Equation (12), it is possible to calculate this integral explicitly. To do so, let (r, ϕ), whose origin is ¯z. Then, d2(z, ¯z) = r2 when z = z(r, ϕ), as in Equation (9). It follows that: ( f ◦ exp¯z) (r, ϕ) = exp � −r2 2γ2 � (15) According to Equation (12), the integral Z(γ) reduces to: Z(γ) = � 2π 0 � +∞ 0 exp � −r2 2γ2 � sinh(r)drdϕ which is readily calculated, Z(γ) = 2π × � π 2 γ × e γ2 2 × erf � γ √ 2 � (16) where erf denotes the error function. Formula (16) completes the definition of the Gaussian distribution G(¯z, γ). This definition is the same as suggested in [11], with the difference that, in the present work, it has been possible to compute exactly the normalization constant Z(γ). It is noteworthy that the normalization constant Z(γ) depends only on γ and not on ¯z. This shows that the shape of the probability density function (14) does not depend on ¯z, which only plays the role of a location parameter. At a deeper mathematical level, this reflects the fact that H is a homogeneous Riemannian space [13]. 199 Entropy 2014, 16, 4015–4031 The probability density function (14) bears a clear resemblance to the usual Gaussian (or normal) probability density function. Indeed, both are proportional to the exponential minus the “square distance”, but in one case, the distance is interpreted as Euclidean distance and, in the other (that of Equation (14)) as Rao’s distance. 3.2. Maximum Likelihood Estimation of ¯z and γ Consider the problem of computing maximum likelihood estimates of the parameters ¯z and γ of the Gaussian distribution G(¯z, γ), based on independent samples {zi}N i=1 from this distribution. Given the expression (14) of the density p(z|¯z, γ), the log-likelihood function ℓ(¯z, γ) can be written, ℓ(¯z, γ) = −N log{Z(γ)} − 1 2γ2 N ∑ i=1 d2(zi, ¯z) (17) Since ¯z only appears in the second term, the maximum likelihood estimate of ¯z, say ˆz, can be computed first. It is given by the minimization problem: ˆz = argminz∈H 1 2 N ∑ i=1 d2(zi, z) (18) In other words, the maximum likelihood estimate ˆz minimizes the sum of squared Rao distances to the samples zi. This exhibits ˆz as the Riemannian center of mass, also called the Karcher or the Fréchet mean [16], of the samples zi. The notion of Riemannian center of mass is currently a widely popular one in signal and image processing, with applications ranging from blind source separation and radar signal processing [17,18] to shape and motion analysis [19,20]. The definition of Gaussian distributions, proposed in the present paper, shows how the notion of Riemannian center of mass is related to maximum likelihood estimation, thereby giving it a statistical foundation. An original result, due to Cartan and cited in Equation [16], states that ˆz, as defined in Equation (18), exists and is unique, since H, with the Riemannian distance (4), has constant negative curvature. Here, ˆz is computed using Riemannian gradient descent, as described in Section 2.3. The cost function f to be minimized is given by (the factor N−1 is conventional), f (z) = 1 2N N ∑ i=1 d2(zi, z) (19) Its Riemannian gradient ∇ f (z) is easily found by noting the following fact. Let fi(z) = (1/2)d2(z, zi). Then, the Riemannian gradient of this function is (see [21] (page 407)), ∇ fi(z) = logz(zi) (20) where logz : H → C is the inverse of expz : C → H. It follows from Equation (20) that, ∇ f (z) = 1 N N ∑ i=1 logz(zi) (21) The analytic expression of logz, for any z ∈ H, will be given below (see Equation (23)). Here, the gradient descent algorithm for computing ˆz is described. This algorithm uses a constant step size λ, which is fixed manually. 200 Entropy 2014, 16, 4015–4031 Once the maximum likelihood estimate ˆz has been computed, using the gradient descent algorithm, the maximum likelihood estimate of γ, say ˆγ, is found by solving the equation: F(γ) = 1 N N ∑ i=1 d2(zi, ˆz) where F(γ) = γ3 × d dγ log{Z(γ)} (22) The gradient descent algorithm for computing ˆz is the following, INPUT {z1, . . . , zN} % N independent samples from G(¯z, γ) ˆz ∈ H % Initial guess WHILE ∥∇ f (ˆz)∥ > ε % ε ≈ 0 machine precision ˆz ← expˆz (−λ∇ f (ˆz)) % ∇ f (ˆz) given by Equation (21) % step size λ is constant END WHILE OUTPUT ˆz % near Riemannian center of mass Application of Formula (21) requires computation of logˆz(zi) for i = 1, . . . , N. Fortunately, this can be done analytically as follows. In general, for ˆz = ( ¯x, ¯y), logˆz(z) = ¯y logi �z − ¯x ¯y � (23) where logi is found by inverting Equation (6). Precisely, logi(z) = reiϕ (24) where, for z = (x, y) with x ̸= 0, r = acosh � 1 + x2 + (y − 1)2 2y � and: cos(ϕ) = x y sinh(r) sin(ϕ) = cosh(r) − y−1 sinh(r) and, for z = (0, y), logi(z) = ln(y)i with ln denoting the natural logarithm. 3.3. Significance of ¯z and γ The parameters ¯z and γ of a Gaussian distribution G(¯z, γ) have been called the center of mass and the dispersion parameter. In the present paragraph, it is proven that, ¯z = argminz∈H 1 2 � H d2(z′, z)p(z′|¯z, γ)dA(z′) (25) and also that: F(γ) = � H d2(z′, ¯z)p(z′|¯z, γ)dA(z′) (26) where F(γ) was defined in Equation (22) and p(z′|¯z, γ) is the probability density function of G(¯z, γ), given in Equation (14). 201 Entropy 2014, 16, 4015–4031 Note that Equations (25) and (26) are asymptotic versions of Equations (18) and (22). Indeed, Equations (25) and (26) can be written: ¯z = argminz∈H 1 2E¯z,γd2(z′, z) F(γ) = E¯z,γd2(z, ¯z) (27) where E¯z,γ denotes the expectation with respect to G(¯z, γ), and the expectation is carried out on the variable z′ in the first formula. Now, these two formulae are the same as Equations (18) and (22), but with expectation instead of empirical mean. Note, moreover, that Equations (25) and (26) can be interpreted as follows. If z′ is distributed according to the Gaussian distribution G(¯z, γ), then Equation (25) states that ¯z is the unique point, out of all z ∈ H, which minimizes the expectation of squared Rao’s distance to z′. Moreover, Equation (26) states that the expectation of squared Rao’s distance between ¯z and z′ is equal to F(γ), so F(γ) is the least possible expected squared Rao’s distance between a point z ∈ H and z′. This interpretation justifies calling ¯z the center of mass of G(¯z, γ) and shows that γ is uniquely related to the expected dispersion, as measured by squared Rao’s distance, away from ¯z. In order to prove Equation (25), consider the log-likelihood function, ℓ(¯z, γ; z) = − log{Z(γ)} − 1 2γ2 d2(z, ¯z) (28) Let fz(¯z) = (1/2)d2(z, ¯z). The score function, with respect to ¯z is, by definition, ∇¯zℓ(¯z, γ; z) = ∇ fz(¯z) (29) where ∇¯z indicates the Riemannian gradient (defined in Equation (13) of Section 2.3) is with respect to the variable ¯z. Under certain regularity conditions, which are here easily verified, the expectation of the score function is identically zero, E¯z,γ∇ fz(¯z) = 0 (30) Let f (z) be defined by: f (z) = E¯z,γ fz′(z) = 1 2E¯z,γd2(z′, z) with the expectation carried out on the variable z′. Clearly, f (z) is the expression to be minimized in Equation (25) (or in the first formula in Equation (27), which is just the same). By interchanging Riemannian gradient and expectation, ∇ f (¯z) = E¯z,γ∇ fz(¯z) = 0 where the last equality follows from Equation (30). It has just been proved that ¯z is a stationary point of f (a point where the gradient is zero). Theorem 2.1 in [16] states the function f has one and only one stationary point, which is moreover a global minimizer. This concludes the Proof (25). The proof of Equation (26) follows exactly the same method, defining the score function with respect to γ and noting that its expectation is identically zero. 4. Classification of Univariate Normal Populations The previous section studied Gaussian distributions on H, “as they stand”, focusing on the fundamental issue of maximum likelihood estimation of their parameters. The present Section considers the use of Gaussian distributions as prior distributions on the univariate normal model. The main motivation behind the introduction of Gaussian distributions is that a Gaussian distribution G(¯z, γ) can be used to give a geometric representation of a cluster or class of univariate normal populations. Recall that each point (x, y) ∈ H is identified with a univariate normal population 202 Entropy 2014, 16, 4015–4031 with mean μ = √ 2x and standard deviation σ = y. The idea is that populations belonging to the same cluster, represented by G(¯z, γ), should be viewed as centered on ¯z and lying within a typical distance determined by γ. In the remainder of this Section, it is shown how the maximum likelihood estimation algorithm of Section 3.2 can be used to fit the hyperparameters ¯z and γ to data, consisting in a class S = {Si; i = 1, . . . , K} of univariate normal populations. This is then applied to the problem of the classification of univariate normal populations. The whole development is based on marginalized likelihood estimation, as follows. Assume each population Si contains Ni points, Si = {sj; j = 1, . . . , Ni}, and the points sj, in any class, are drawn from a univariate normal distribution with mean μ and standard deviation σ. The focus will be on the asymptotic case where the number Ni of points in each population Si is large. In order to fit the hyperparameters ¯z and γ to the data S, assume moreover that the distribution of z = (x, y), where (x, y) = (μ/ √ 2, σ), is a Gaussian distribution G(¯z, γ). Then, the distribution of S can be written in integral form: p(S|¯z, γ) = K ∏ i=1 � H p(Si|z)p(z|¯z, γ)dA(z) (31) where p(z|¯z, γ) is the probability density of a Gaussian distribution G(¯z, γ), defined in Equation (14). Moreover, expressing p(Si|z) as a product of univariate normal distributions p(sj|z), it follows, p(S|¯z, γ) = K ∏ i=1 � H Ni ∏ j=1 p(sj|z)p(z|¯z, γ)dA(z) (32) This expression, given the data S, is to be maximized over (¯z, γ). Using the Laplace approximation, this task is reduced to the maximum likelihood estimation problem, addressed in Section 3.2. The Laplace approximation will here be applied in its “basic form” [9]. That is, up to terms of order N−1 i . To do so, write each of the integrals in Equation (32), using Equation (8) of Section 2.2. These integrals then take on the form: � +∞ 0 � +∞ −∞ Ni ∏ j=1 |2πy2|−1/2 exp ⎛ ⎜ ⎝ − � sj − √ 2 x �2 2y2 ⎞ ⎟ ⎠ × p(z|¯z, γ) × 1 y2 dxdy (33) where the univariate normal distribution p(sj|z) has been replaced by its full expression. Now, this expression can be written p(sj|z) = exp [−Nih(x, y)], where: h(x, y) = −1 2 ln � 2πy2� − B2 i + V2 i 2y2 Here, B2 and V2 i are the empirical bias and variance, within population Si, Bi = ˆSi − √ 2 x V2 i = N−1 i Ni ∑ j=1 ( ˆSi − sj)2 where ˆSi is the empirical mean of the population ˆSi = N−1 i ∑Ni j=1 sj. The expression h(x, y) is maximized when x = ˆxi and y = ˆyi, where ˆzi = ( ˆxi, ˆyi) is the couple of maximum likelihood estimates of the parameters (x, y), based on the population Si. 203 Entropy 2014, 16, 4015–4031 According to the Laplace approximation, the integral Equation (33) is equal to: 2π ���∂2h( ˆxi, ˆyi) ��� −1/2 × exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) × 1 ˆy2 i + O(N−1 i ) where ∂2h( ˆxi, ˆyi) is the matrix of second derivatives of h, and | · | denotes the determinant. Now, since h is essentially the logarithm of p(sj|z), a direct calculation shows that ∂2h( ˆxi, ˆyi) is the same as the Fisher information matrix derived in Section 2.1 (where it was denoted I(z)). Thus, the first factor in the above expression is 2π ˆy2 i , and cancels out with the last factor. Finally, the Laplace approximation of the integral Equation (33) reads: 2π × exp [−Nih( ˆxi, ˆyi)] × p(ˆzi|¯z, γ) + O(N−1 i ) and the resulting approximation of the distribution of S, as given by Equation (32), can be written: p(S|¯z, γ) ≈ K ∏ i=1 α × p(ˆzi|¯z, γ) (34) where α is a constant, which does not depend either on the data or on the parameters, and p(ˆzi|¯z, γ) has the expression (14). Accepting this expression to give the distribution of the data S, conditionally on the hyperparameters (¯z, γ), the task of estimating these hyperparameters becomes the same as the maximum likelihood estimation problem, described in Section 3.2. In conclusion, if one assumes the populations Si belong to a single cluster or class S and wishes to fit the hyperparameters ¯z and γ of a Gaussian distribution representing this cluster, it is enough to start by computing the maximum likelihood estimates ˆxi and ˆyi for each population Si and then to consider these as input to the maximum likelihood estimation algorithm described in Section 3.2. The same reasoning just carried out, using the Laplace approximation, can be generalized to the problem of classification of univariate normal populations. Indeed, assume that classes {SL, L = 1, . . . , C}, each containing some number KL of univariate normal populations, have been identified based on some training sequence. Using the Laplace approximation and the maximum likelihood estimation approach of Section 3.2, to each one of these classes, it is possible to fit hyperparameters (¯zL, γL) of a Gaussian distribution G(¯zL, γL) on H. For a test population St, the maximum likelihood rule, for deciding which of the classes SL this test population St belongs to, requires finding the following maximum: L∗ = argmaxLp(St|¯zL, γL) (35) and assigning the test population St to the class with label L∗. If the number of points Nt in the population St is large, the Laplace approximation, in the same way used above, approximates the maximum in Equation (35) by: L∗ = argmaxLp(ˆzt|¯zL, γL) (36) where ˆzt = ( ˆxt, ˆyt) is the couple of maximum likelihood estimates computed based on the test population St and where p(ˆzt|¯zL, γL) is given by Equation (14). Now, writing out Equation (14), the decision rule becomes: L∗ = argmaxL � − log {Z(γL)} − 1 2γ2 L d2(ˆzt, ¯zL) � (37) 204 Entropy 2014, 16, 4015–4031 Under the homoscedasticity assumption, that all of the γL are equal, this decision rule essentially becomes the same as the one proposed in [5], which requires St to be assigned to the “nearest” cluster, in terms of Rao’s distance. Indeed, if all the γL are equal, then Equation (37) is the same as, L∗ = argminLd2(ˆzt, ¯zL) (38) This decision rule is expected to be less efficient that the one proposed in Equation (37), which also takes into account the uncertainty associated with each cluster, as measured by its dispersion parameter γL. 5. Application to Image Classification In this section, the framework proposed in Section 4, for classification of univariate normal populations, is applied to texture image classification using Gabor filters. Several authors have found that Gabor energy features are well-suited texture descriptors. In the following, consider 24 Gabor energy sub-bands that are the result of three scales and eight orientations. Hence, each texture image can be decomposed as the collection of those 24 sub-bands. For more information concerning the implementation, the interested reader is referred to [22]. Starting from the VisTeX database of 40 images [10] (these are displayed in Figure 1), each image was divided into 16 non-overlapping subimages of 128 × 128 pixels each. A training sequence was formed by choosing randomly eight subimages out of each image. To each subimage in the training sequence, a bank of 24 Gabor filters was applied. The result of applying a Gabor filter with scale s and orientation o to a subimage i belonging to an image L is a univariate normal population Si,s,o of 128 × 128 points (one point for each pixel, after the filter is applied). Figure 1. Forty images of the VisTex database. These populations Si,s,o (called sub-bands) are considered independent, each one of them univariate normal with mean μi,s,o = √ 2xi,s,o, standard deviation σi,s,o = yi,s,o and with zi,s,o = (xi,s,o, yi,s,o). The couple of maximum likelihood estimates for these parameters is denoted ˆzi,s,o = ( ˆxi,s,o, ˆyi,s,o). An image L (recall, there are 40 images) contains, in each sub-band, eight populations Si,s,o, with which hyperparameters ¯zL,s,o and γL,s,o are associated, by applying the maximum likelihood estimation algorithm of Section 3.2 to the inputs ˆzi,s,o. If St is a test subimage, then one should begin by applying the 24 Gabor filters to it, obtaining independent univariate normal populations St,s,o, and then compute for each population the couple of maximum likelihood estimates ˆzt,s,o = ( ˆxt,s,o, ˆyt,s,o). The decision rule Equation (37) of Section 4 requires that St should be assigned to the image L∗, which realizes the maximum: L∗ = argmaxL ∑ s,o − log{Z(γL,s,o)} − 1 2γ2 L,s,o d2(ˆzt,s,o, ¯zL,s,o) (39) 205 Entropy 2014, 16, 4015–4031 When considering the homoscedasticity assumption, i.e., γL,s,o = γs,o for all L, this decision rule becomes: L∗ = argminL ∑ s,o d2(ˆzt,s,o, ¯zL,s,o) (40) For this concrete application, to the VisTex database, it is pertinent to compare the rate of successful classification (or overall accuracy) obtained using the Riemannian prior, based on the framework of Section 4, to that obtained using a more classical conjugate prior, i.e., a normal-inverse gamma distribution of the mean μ = √ 2x and the standard deviation σ = y. This conjugate prior is given by: p(μ|σ, μp, κp) = √κp σ √ 2π exp � − κp 2σ2 (μ − μp)2� with an inverse gamma prior, on σ2, p(σ2|α, β) = βα Γ(α) � σ2�−(α+1) exp � − β σ2 � (41) Using this conjugate prior, instead of a Riemannian prior, and following the procedure of applying the Laplace approximation, a different decision rule is obtained, where L∗ is taken to be the maximum of the following expression: ∑ s,o ln κpL,s,o 2 − κpL,s,o 2 ˆy2 t,s,o �√ 2 ˆxt,s,o − μpL,s,o �2 + αL,s,o ln βL,s,o − ln Γ(αL,s,o) − 2(αL,s,o + 1) ln ˆyt,s,o − βL,s,o ˆy2 t (42) where, as in Equation (39), ˆxt,s,o and ˆyt,s,o are the maximum likelihood estimates computed for the population St,s,o. Both the Riemannian and conjugate priors have been applied to the VisTex database, with half of the database used for training and half for testing. In the course of 100 Monte Carlo runs, a significant gain of about 3% is observed with the Riemannian prior compared to the conjugate prior. This is summarized in the following table. Prior Model Overall Accuracy Riemannian prior Equation (39) 71.88% ± 2.16% Riemannian prior, homoscedasticity assumption Equation (40) 69.06% ± 1.96% Conjugate prior Equation (42) 68.73% ± 2.92% Recall that the overall accuracy is the ratio of the number of successfully classified subimages to the total number of subimages. The table shows that the use of a Riemannian prior, even under a homoscedasticity assumption, yields significant improvement upon the use of a conjugate prior. 6. Conclusions Motivated by the problem of the classification of univariate normal populations, this paper introduced a new class of prior distributions on the univariate normal model. With the univariate normal model viewed as the Poincaré half plane H, these new prior distributions, called Gaussian distributions, were meant to reflect the geometric picture (in terms of Rao’s distance) that a cluster or class of univariate normal populations can be represented as having a center ¯z ∈ H and a “variance” or dispersion γ2. Precisely, a Gaussian distribution G(¯z, γ) has a probability density function p(z), with respect to Riemannian volume of the Poincaré half plane, which is proportional to exp � − d2(z,¯z) 2γ2 � . 206 Entropy 2014, 16, 4015–4031 Using Gaussian distributions as prior distributions in the problem of the classification of univariate normal populations was shown to lead to a new, more general and efficient decision rule. This decision rule was implemented in a real-world application to texture image classification, where it led to significant improvement in performance, in comparison to decision rules obtained by using conjugate priors. The general approach proposed in this paper contains several simplifications and approximations, which could be improved upon in future work. First, it is possible to use different prior distributions, which are more geometrically rich than Gaussian distributions, to represent classes of univariate normal populations. For example, it may be helpful to replace Gaussian distributions that are “isotropic”, in the sense of having a scalar dispersion parameter γ, by non-isotropic distributions, with a dispersion matrix Γ (a 2 × 2 symmetric positive definite matrix). Another possibility would be to represent each class of univariate normal populations by a finite mixture of Gaussian distributions, instead of representing it by a single Gaussian distribution. These variants, which would allow classes with a more complex geometric structure to be taken into account, can be integrated in the general framework proposed in the paper, based on: (i) fitting each class to a prior distribution (Gaussian non-isotropic, mixture of Gaussians); and (ii) choosing, for a test population, the most adequate class, based on a decision rule. These two steps can be realized as above, through the Laplace approximation and maximum likelihood estimation, or through alternative techniques, based on Markov chain Monte Carlo stochastic optimization. In addition to generalizing the approach of this paper and improving its performance, a further important objective for future work will be to extend it to other parametric models, beyond univariate normal models. Indeed, there is an increasing number of parametric models (generalized Gaussian, elliptical models, etc.), whose Riemannian geometry is becoming well understood and where the present approach may be helpful. Author Contributions Salem Said carry out the mathematical development, and specify the algorithms, appearing in Sections 2, 3 and 4. Lionel Bombrun carry out all numerical simulations, and to propose the theoretical development of Section 4. Yannick Berthoumieu devise the main idea of the paper. That is, use of Riemannian priors as geometric representation of a class or cluster of univariate normal population. All authors have read and approved the final manuscript. Conflicts of Interest The authors declare no conflict of interest. References 1. Amari, S.I; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000. 2. Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. 3. Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234. 4. Amari, S.I. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276. 5. Nielsen, F; Nock, R. Hyperbolic Voronoi diagrams made easy. 2009 , arXiv:0903.3287. 6. Lenglet, C.; Rousson, M.; Deriche, R.; Fougeras, O. Statistics on the manifold of multivariate normal distributions: Theory and application to diffusion tensor MRI processing. J. Math. Imaging Vis. 2006, 25, 423–444. 7. Verdoolaege, G.; Scheunders, P. On the geometry of multivariate generalized Gaussian models. J. Math. Imaging Vis. 2012, 43, 180–193. 207 Entropy 2014, 16, 4015–4031 8. Berkane, M.; Oden, K. Geodesic estimation in elliptical distributions. J. Multival. Anal. 1997, 63, 35–46. 9. Erdélyi, A. Asymptotic Expansions; Dover Books: Mineola, New York, NY, USA, 2010. 10. MIT Vision and Modeling Group. Vision Texture. Available online: http://vismod.media.mit.edu/ pub/VisTex (accessed on 10 June 2014). 11. Pennec, X. Intrinsic statistics on Riemannian manifold: Basic tools for geometric measurements. J. Math. Imaging Vis. 2006, 25, 127–154. 12. Atkinson, C.; Mitchell, A.F.S. Rao’s distance measure. Sankhya Ser. A 1981, 43, 345–365. 13. Gallot, S.; Hulin, D.; Lafontaine, J. Riemannian Geometry; Springer-Verlag: Berlin, Germany, 2004. 14. Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Cambridge, MA, USA, 2006. 15. Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’I.H.P. 1948, 10, 215–310. (In French) 16. Afsari, B. Riemannian Lp center of mass: Existence, Uniqueness and convexity. Proc. Am. Math. Soc. 2011, 139, 655–673. 17. Manton, J.H. A centroid (Karcher mean) approach to the joint approximate diagonalisation problem: The real symmetric case. Digit. Sign. Process. 2006, 16, 468–478. 18. Arnaudon, M.; Barbaresco, F. Riemannian medians and means with applications to RADAR signal processing. IEEE J. Sel. Top. Sign. Process. 2013, 7, 595–604. 19. Le, H. On the consistency of procrustean mean shapes. Adv. Appl. Prob. 1998, 30, 53–63. 20. Turaga, P.; Veeraraghavan, A.; Chellappa, R. Statistical Snalysis on Stiefel and Grassmann Manifolds with Applications in Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; doi: 10.1109/CVPR.2008.4587733. 21. Chavel, I. Riemannian geometry: A modern introduction; Cambridge University Press: Princeton, MA, USA, 2008. 22. Grigorescu, S.E.; Petkov, N.; Kruizinga, P. Comparison of texture features based on Gabor filter. IEEE Trans. Image Process. 2002, 11, 1160–1167. c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 208 entropy Article Combinatorial Optimization with Information Geometry: The Newton Method Luigi Malagò 1 and Giovanni Pistone 2,* 1 Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico, 39/41, 20135 Milano, Italy; E-Mail: malago@di.unimi.it 2 de Castro Statistics, Collegio Carlo Alberto, Via Real Collegio 30, 10024 Moncalieri, Italy * E-Mail: giovanni.pistone@carloalberto.org; Tel.: +39-011-670-5033; Fax: +39-011-670-5082. Received: 31 March 2014; in revised form: 10 July 2014 / Accepted: 11 July 2014 / Published: 28 July 2014 Abstract: We discuss the use of the Newton method in the computation of max(p �→ Ep [ f ]), where p belongs to a statistical exponential family on a finite state space. In a number of papers, the authors have applied first order search methods based on information geometry. Second order methods have been widely used in optimization on manifolds, e.g., matrix manifolds, but appear to be new in statistical manifolds. These methods require the computation of the Riemannian Hessian in a statistical manifold. We use a non-parametric formulation of information geometry in view of further applications in the continuous state space cases, where the construction of a proper Riemannian structure is still an open problem. Keywords: statistical manifold; Riemannian Hessian; combinatorial optimization; Newton method 1. Introduction In this paper, statistical exponential families [1] are thought of as differentiable manifolds along the approach called information geometry [2] or the exponential statistical manifold [3]. Specifically, our aim is to discuss optimization on statistical manifolds using the Newton method, as is suggested in ([4] (Ch. 5 and 6)); see also the monograph [5]. This method is based on classical Riemannian geometry [6], but here, we put our emphasis on coordinate-free differential geometry; see [7,8]. We mainly refer to the above-mentioned references [2,4], with one notable exception in the description of the tangent space. Our manifold will be an exponential family EV of positive densities, V being a vector space of sufficient statistics. Given a one-dimensional statistical model p(t) ∈ EV, t ∈ I, we define its velocity at time t to be its Fisher score s(t) = d dt ln p(t) [9]. The Fisher score s(t) is a random variable with zero expectation with respect to p(t), Ep(t) [s(t)] = 0. Because of that, the tangent space at p ∈ EV is a vector space of random variables with zero expectation at p. A vector field is a mapping from p to a random variable V(p), such that for all p ∈ E, the random variable V(p) is centered at p, Ep [V(p)] = 0. In other words, each point of the manifold has a different tangent space, and this tangent space can be used as a non-parametric model space of the manifold. In this formalism, a vector field is a mapping from densities to centered random variables, that is, it is what in statistics is called a pivot of the statistical model. To avoid confusion with the product of random variables, we do not use the standard notation for the action of a vector field on a real function. This approach is possibly unusual in differential geometry, but it is fully natural from the statistical point of view, where the Fisher score has a central place. Moreover, this approach scales nicely from the finite state space to the general state space; see the discussion in [9] and the review in [3]. A complete construction of the geometric framework based on the idea of using the Fisher scores as elements of the tangent bundle has been actually worked out. In this paper, we go on by considering a second order geometry based on the non-parametric settings. Entropy 2014, 16, 4260–4289; doi:10.3390/e16084260 www.mdpi.com/journal/entropy 209 Entropy 2014, 16, 4260–4289 Our main motivation for such a geometrical construction is its application to combinatorial optimization using exponential families, whose first order version was developed in [10–14]. We give here an illustration of the methods in the following toy example. Consider the function f (x1, x2) = a0 + a1x1 + a2x2 + a12x1x2, with x1, x2 = ±1, a0, a1, a2, a12 ∈ R. The function f is a real random variable on the sample space Ω = {+1, −1}2 with the uniform probability λ. Note that the coordinate mappings X1, X2 of Ω generate an orthonormal basis 1, X1, X2, X1X2 of L2(Ω, λ) and that f is the general form of a real random variable on such a space. Let P> be the open simplex of positive densities on (Ω, λ), and let EV be a statistical model, i.e., a subset of P>. The relaxed mapping F: EV → R, F(p) = Ep [ f ] = a0 + a1 Ep [X1] + a2 Ep [X2] + a12 Ep [X1X2] , (1) is strictly bounded by the maximum of f, F(p) = Ep [ f ] < maxx∈Ω f (x), unless f is constant. We are looking for a sequence pn, n ∈ N, such that Epn [ f ] → maxx∈Ω f (x) as n → ∞. The existence of such a sequence is a nontrivial condition for the model E. Precisely, the closure of EV must contain a density, whose support is contained in the set of maxima {x ∈ Ω| f (x) = max f }. This condition is satisfied by the independence model, V = Span {X1, X2}, where we can write: F(η1, η2) = a0 + a1η1 + a2η2 + a12η1η2, ηi = Ep [Xi] , (2) See Figure 1. The gradient of Equation (2) has components ∂1F = a1 + a12η2, ∂2F = a2 + a12η1, and the flow along the gradient produces increasing values for F; however, the gradient flow does not converge to the maximum of F; see the dotted line in Figure 2. However, one can follow the suggestion by [15] and use a modified gradient (the “natural” gradient) flow that produces better results in our problem; see Figure 3. Full details on this example are given in Section 2.5.2. −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 η1 η2 −4 −2 0 2 4 6 −2 0 2 4 Expectation parameters Figure 1. Relaxation of the Function (2) on the independence model. a1 = 1, a2 = 2, a12 = 3. 210 Entropy 2014, 16, 4260–4289 −2 −1 0 1 2 −2 −1 0 1 2 η1 η2 −20 −10 0 10 20 −10 −10 0 0 10 10 20 Expectation parameters Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside the square [−1, +1]2. −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 η1 η2 −4 −2 0 2 4 6 −2 0 2 4 Gradient vs Natural gradient Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting at (−1/4, −1/4). In combinatorial optimization, the values of the function f are assumed to be available at each point, and the curve of steepest ascent of the relaxed function is learned through a simulation procedure based on exponential statistical models. 211 Entropy 2014, 16, 4260–4289 In this paper, we introduce, in Section 2, the geometry of exponential families and its first order calculus. The second order calculus and the Hessian are discussed in Section 3. Finally, in Section 4, we apply the formalism to the discussion of the Newton method in the context of the maximization of the relaxed function. 2. Models on a Finite State Space We consider here the exponential statistical manifold on the set of positive densities on a measure space (Ω, μ) with Ω finite and counting measure μ. The setup we describe below is not strictly required in the finite case, because in such a case, other approaches are possible, but it provides a mathematical formalism that has its own pros and that scales naturally to the infinite case. We provide below a schematic presentation of our formalism as an introduction to this section. • Two different exponential families can actually be the same statistical model, as the set of densities in the two exponential families are actually equal. This fact is due to both the arbitrariness of the reference density and the fact that sufficient statistics are actually a vector basis of the vector space generated by the sufficient statistics. In a non-parametric approach, we can refer directly to the vector space of centered log-densities, while the change of reference density is geometrically interpreted as a change of chart. The set of all possible such charts defines a manifold. • We make a specific interpretation of the tangent bundle as the vector space of Fisher’s scores at each density and use such tangent spaces as the space of coordinates. This produces a different tangent space/space of coordinates at each density, and different tangent spaces are mapped one onto another by a proper parallel transport, which is nothing else than the re-centering of random variables. • If a basis is chosen, a parametrization is given, and such a parametrization is, in fact, a new chart, whose values are real vectors. In the real parametrization, the natural scalar product in each scores space is given by Fisher’s information matrix. • Riemannian gradients are defined in the usual way. It is customary in information geometry to call “natural gradient” the real coordinate presentation of the Riemannian gradient. The natural gradient is computed by applying the inverse of the Fisher information matrix to the Euclidean gradient. It seems that there are tree gradients involved, but they all represent the same object when correctly understood. • The classical notion of expectation parameters for exponential families carries on as another chart on the statistical manifold, which gives rise to a further presentation of a geometrical object. • While the statistical manifold is unique, there are at least three relevant connections as structures on the vector bundles of the manifold: one relating to the exponential charts, one relating to the expectation charts and one depending on the Riemannian structure. 2.1. Exponential Families As Manifolds On the finite sample space Ω, #Ω = n, let a set of random variables B = {X1, . . . , Xm} be given, such that ∑J αjXj is constant if, and only if, the αj’s are zero, or, equivalently, such that X0 = 1, X1, . . . , Xm are affinely independent. The condition implies, necessarily, the linear independence of B. A common choice is to take a set of linearly independent and μ-centered random variables. We write V = Span {X1, . . . , Xm} and define the following exponential family of positive densities p ∈ P>: EV = � q ∈ P> ���q ∝ eV p, V ∈ V � . (3) Given any couple p, q ∈ EV, then there exist a unique set of parameters θ = θp(q), such that: q = exp � ∑ j θj eUpXj − ψp(θ) � · p (4) 212 Entropy 2014, 16, 4260–4289 where eUp is the centering at p, that is, eUp : V ∋ U �→ U − Ep [U] ∈ eUpV. (5) The linear mapping eUp is one-to-one on V and eUpXj, j = 1, . . . , m, and is a basis of eUpV. We view each choice of a specific reference p as providing a chart centered at p on the exponential family EV, namely: σp : exp � ∑ j θj eUpXj − ψp(θ) � · p �→ θ, (6) If: U = eUpU + Ep [U] = m ∑ j=1 θj eUpXj + Ep [U] , (7) then: Ep � U eUpXi � = m ∑ j=1 θj Ep �eUpXi eUpXj � , (8) so that θ = I−1 B (p) Ep [U eUpX], where: IB(p) = � Covp � Xi, Xj �� ij = Ep � XX′� − Ep [X] Ep � X′� (9) is the Fisher information matrix of the basis B = {X1, . . . , Xm}. The mappings: σp : EV ∋ q �→ U �→ θ ∈ Rm (10) where: sp : q �→ U = log � q p � − Ep � log � q p �� , (11) σp : q �→ θ = I−1 B (p) Ep � U eUpX � = I−1 B (p) Ep � log � q p � eUpX � , (12) are global charts in the non-parametric and parametric coordinates, respectively. Notice that Equation (12) provides the regression coefficients of the least squares estimate on eUpV of the log-likelihood. We denote by ep : Rm → EV the inverse of σp, i.e., ep(θ) = exp � m ∑ j=1 θj eUpXj − ψp(θ) � · p, (13) so that the representation of the divergence q �→ D (p ∥q) in the chart σp is ψp: ψp(θ) = log � Ep � e∑m j=1 θj eUpXj�� = Eθ � log � p ep(θ) �� = D � p ∥ep(θ) � . (14) The mapping IB : p �→ Covp (X, X) ∈ Rm×m is represented in the chart centered at p by: IB,p(θ) = IB(ep(θ)) = [Covep(θ) � Xi, Xj �]i,j = Hess ψp(θ), (15) See [1]. 213 Entropy 2014, 16, 4260–4289 2.2. Change of Chart Fix p, ¯p ∈ EV; then, we can express p in the chart centered at ¯p, p = exp � ¯U − kp( ¯U) � · ¯p, ¯U ∈ eU ¯pV, k ¯p( ¯U) = log � E ¯p � e ¯U�� . (16) In coordinates ¯U = ∑m j=1‘ ¯θj eU ¯pXj. For all q ∈ EV, q = exp � U − kp(U) � p, U ∈ eUpV, kp(U) = log �Ep � eU�� , in coordinates U = ∑m j=1‘ θj eUpXj, we can write: q = exp � U − kp(U) � · p = exp � U − kp(U) � exp � ¯U − k ¯p( ¯U) � · ¯p = exp � U − kp(U) + ¯U − k ¯p( ¯U) � · ¯p = exp ��(U + ¯U) − E ¯p [U] � − � kp(U) − k ¯p( ¯U) + E ¯p [U] �� · ¯p, (17) hence, the non-parametric coordinate of q in the chart centered at ¯p is U + ¯U − E ¯p [U] = eU ¯p(U) + ¯U. From Equation (12): σ¯p(q) = I−1 V ( ¯p) E ¯p � (eU ¯pU + ¯U) eU ¯pX � = θ + ¯θ (18) This provides the change of charts σ¯p ◦ σ−1 p : θ �→ θ + ¯θ. This atlas of charts defines the affine manifold (EV, (σp)). This fact has deep consequences that we do not discuss here, e.g., our manifold is an instance of a Hessian manifold [16]. 2.3. Tangent Bundle The space of Fisher scores at p is eUpV, and it is identified with the tangent space of the manifold at p, TqEV; see the discussion in [3,9]. Let us check the consistency of this statement with our θ- parametrization. Let: q(τ) = exp � m ∑ j=1 θj(τ) eUq(0)X − ψq(0)(τ) � · q(0), (19) τ ∈ I, I an open interval containing zero, a curve in EV. In the chart centered at q(0), we have from Equation (12): σq(0)(q(τ)) = I−1 B (q(0)) Eq(0) � log �q(τ) q(0) � eUq(0)X � = I−1 B (q(0)) Eq(0) �� m ∑ j=1 θj(τ) eUq(0)Xj − ψq(0)(θ(τ)) � eUq(0)X � = I−1 B (q(0)) m ∑ j=1 θj(τ) Eq(0) �eUq(0)Xj eUq(0)X � = I−1 B (q(0)) Eq(0) �eUq(0)X eUq(0)X � θ = θ(τ). (20) The vector space eUpV is represented by the coordinates in the base eUpB. The tangent bundle TEV as a manifold is defined by the charts (σp, ˙σp) on the domain: TEV = �(p, v) ��p ∈ EV, v ∈ TpEV � (21) 214 Entropy 2014, 16, 4260–4289 with: (σp, ˙σp): (q, V) �→ � I−1 B (p) Ep � log � q p � eUpX � , I−1 B (p) Ep � V eUpX �� . (22) The dot notation ˙σp for the charts on the tangent spaces is justified by the computation in Equation (23) below: d dtσq(0)(q(τ)) ���� τ=0 = I−1 B (q(0)) Eq(0) � d dτ log (q(τ)) ���� τ=0 eUq(0)X � = I−1 B (q(0)) Eq(0) � δq(0) eUq(0)X � = ˙σq(0)(δq(0)). (23) The velocity at τ = 0 is δq(0) = d dτ log (q(τ)) ��� τ=0 ∈ Tq(0)EV and: d dτ θ(τ) ���� τ=0 = I−1 B (q(0)) Eq(0) � d dτ log (q(τ)) ���� τ=0 eUq(0)X � = I−1 B (q(0)) Eq(0) � δq(0) eUq(0)X � , (24) which is consistent with both the definition of tangent space as set of Fisher scores and with the chart of the tangent bundle as defined in Equation (22). The velocity at a generic τ is δq(τ) = d dτ log (q(τ)) ∈ Tq(τ)EV and has coordinates at p: d dτ θ(τ) = I−1 B (q(0)) Eq(0) � d dτ log (q(τ)) eUq(0)X � = I−1 B (q(0)) Eq(0) � δq(τ) eUq(0)X � . (25) If V, W are vector fields on TEV, i.e., V(p), W(p) ∈ TpEV = eUpV, p ∈ EV, we define a Riemannian metric g(V, W)) by: g(V, W)(p) = gp(V(p), W(p)) = Ep [V(p)W(p)] (26) In coordinates at p, V(p) = ∑j ˙σj p(V) eUpXj, W(p) = ∑j ˙σj p(W) eUpXj, so that: gp(V(p), W(p)) = ˙σp(V)′IB(p) ˙σp(W). (27) 2.4. Gradients Given a function φ: EV → R let φp = φ ◦ ep, ep = σ−1 p , its representation in the chart centered at p: EV φ � R Rm ep � φp � (28) The derivative of θ �→ φp(θ) at θ = 0 along α ∈ Rm is: ∇φp(0)α = ∇φp(0)I−1 B (p)IB(p)α = � I−1 B (p)∇φp(0)′�′ IB(p)α = gp(I−1 B (p)∇φp(0)′, α). (29) The mapping �∇φ: p �→ I−1 B (p)(∇φp(0))′ ∈ Rm that appears in Equation (29) is Amari’s natural gradient of φ: EV; see [15]. It is a standard notion in Riemannian geometry; cf. [4] (p. 46). 215 Entropy 2014, 16, 4260–4289 More generally, the derivative of θ �→ φp(θ) at θ along α ∈ Rm is: ∇φp(θ)α = ∇φp(θ)I−1 B (ep(θ))IB(ep(θ))α = � I−1 B (ep(θ))∇φp(θ)′�′ IB(ep(θ))α = gep(θ)(I−1 B (ep(θ))∇φp(θ)′, α). (30) Let us compare ∇φq(0) and ∇φp(θ) when q = ep(θ). As φp = φ ◦ ep and φq = φ ◦ eq, we have the change of charts: φq = φ ◦ eq = φ ◦ ep ◦ σp ◦ eq = φp ◦ σp ◦ eq, (31) hence ∇φq(0) = ∇φp(σp(q))J(σp ◦ eq)(0), where J(σp ◦ eq) is the Jacobian of σp ◦ eq. As σp ◦ eq(θ) = θ + σp(q), we have J(σp ◦ eq) = Id, and in conclusion, ∇φep(θ)(0) = ∇φp(θ). For all p ∈ EV and θ ∈ Rm, �∇φ(ep(θ)) = I−1 B (ep(θ))∇φp(θ). (32) Alternatively, for all q, p ∈ EV, �∇φ: EV → Rm is defined by: �∇φ(q) = I−1 B (q)∇φp(σp(q)). (33) The Riemannian gradient of φ: EV is the vector field ∇φ, such that DYφ = g(∇φ, Y). Note that the Riemannian gradient takes values in the tangent bundle, while the natural gradient takes values in Rm. We compute the Riemannian gradient at p as follows. If y = ˙σp(Y(p)), DYφ(p) = dφp(0)y = gp( �∇φ(p), y) = Ep [∇φ(p)Y(p)] , (34) hence �∇φ(p) = I−1 B (p)∇φp(0)′ is the representation in the chart centered at p of the vector field ∇φ: EV. Explicitly, we have (see Equation (22)), �∇φ(p) = I−1 B (p)(∇φp(0))′ = I−1 B (p) Ep �∇φ(p) eUpX � , (35) ∇φ(p) = ∑ j ( �∇φ(p))j eUpXj (36) The Euclidean gradient ∇φp(θ) is sometimes called the “vanilla gradient.” It is equal to the covariance between the Riemannian gradient ∇φ(p) and the basis X, (∇φp(0))′ = Ep [∇φ(p) eUpX]. We summarize in a display the relations between our three gradients: Euclidean ∇φp(0), natural �∇φ(p) and Riemannian ∇φ(p). TEV (σp, ˙σp)� π � R2m π1 � EV σp � Rm TpEV ˙σp � Rm IB(p) � EV ∇φ(p) � ∇φp(0) � Rm ˙σp ◦ ∇φ(p) = I−1 B ∇φp(0) = �∇φ(p) (37) In the following, we shall frequently use the fact that the representation of the gradient vector field ∇φ in a generic chart centered at p is: (∇φ)p(θ) = ˙σp(∇φ(ep(θ))) = ( �∇φ)(ep(θ)) = I−1 B,p(θ)∇φp(θ). (38) It should be noted that the leftmost term (∇φ)p(θ) is the presentation of the gradient in the charts of the tangent bundle, while in the rightmost term, ∇φp(θ) denotes the Euclidean gradient of the presentation of the function φ in the charts of the manifold. 216 Entropy 2014, 16, 4260–4289 2.4.1. Expectation Parameters As ψp is strictly convex, the gradient mapping θ �→ (∇ψp(θ))′ is a homeomorphism from the space of parameters Rm to the interior of the convex set generated by the image of eUpX; see [1]. The function μp : EV defined by: μp(q) = Eq �eUpX � = Eq [X] − Ep [X] = (∇ψp(θ))′, θ = σp(q) (39) is a chart for all p ∈ EV. The value of the inverse q = Lp(μ) is characterized as the unique q ∈ EV, such that μ = Eq [eUpX], i.e., the maximum likelihood estimator. Let us compute the change of chart from p to ¯p: μ ¯p ◦ μ−1 p (η) = ¯η = η + Ep [X] − E ¯p [X] . (40) In fact, μ = ELp(μ) [eUpX] and ¯μ = μ ¯p(Lp(μ)) = ELp(μ) �eU ¯pX � . We do not discuss here the rich theory started in [2] about the duality between σp and μp. We limit ourselves to the computation of the Riemannian gradient in the expectation parameters. If φ: EV, φp(θ) = φ ◦ ep(θ) = φ ◦ Lp ◦ μp ◦ ep(θ) = (φ ◦ Lp) ◦ (∇ψp)(θ), (41) because μp ◦ ep(θ) = Eep(θ) [eUpX] = ∇φp(θ), hence: ∇φp(θ) = ∇(φ ◦ Lp)(∇ψp(θ)) Hess ψp(θ), (42) �∇φ(p) = IV(p)−1(∇(φ ◦ Lp)(0) Hess ψp(0))′ = (∇(φ ◦ Lp)(0))′, (43) ∇φ(p) = ∇(φ ◦ Lp)(0) eUpX, (44) that is, the natural gradient �∇φ at p = Lp(μ) is equal to the Euclidean gradient of μ �→ φ ◦ Lp(μ) at μ = 0. 2.4.2. Vector Fields If V is a vector field of TEV and φ: EV is a real function, then we define the action of V on φ, ∇Vφ, to be the real function: ∇Vφ: EV ∋ p �→ ∇Vφ(p) = ∇φp(0) ˙σp (V(p)) . (45) We prefer to avoid the standard notation Vφ, because in our setting, V(p) is a random variable, and the product V(p)φ(p) is otherwise defined as the ordinary product. Let us represent ∇Vφ in the chart centered at p: (∇Vφ)p(θ) = ∇Vφ(ep(θ)) = ∇φep(θ)(0) ˙σep(θ) � V(ep(θ)) � = ∇φp(θ)Vp(θ), (46) where we have used the equality ∇φep(θ)(0) = ∇φp(θ) and Vp(θ) = ˙σep(θ) � V(ep(θ)) � . If W is a vector field, we can compute ∇W∇Vφ at p as: ∇W∇Vφ(p) = ∇(∇Vφ)p(0) ˙σp(W(p)) = Vp(0)′ Hess φp(0)Wp(0) + ∇φp(0)JVp(0)Wp(0), (47) where J denotes the Jacobian matrix. The Lie bracket [W, V]φ (see [7] (§4.2), [8] (V, §1), [4] (Section 5.3.1)) is given by: [W, V]φ(p) = ∇W∇Vφ(p) − ∇V∇wφ(p) = ∇φp(0) � JVp(0)Wp(0) − JWp(0)Vp(0) � , (48) because of Equation (47) and the symmetry of the Hessian. 217 Entropy 2014, 16, 4260–4289 The flow of the smooth vector field V : EV is a family of curves γ(t, p), p ∈ EV, t ∈ Jp, Jp open real interval containing zero, such that for all p ∈ EV and t ∈ Jp, γ(0, p) = p, (49) δγ(t, p) = V(γ(t, p)). (50) As uniqueness holds in Equation (50) (see [8] (VI, §1) or [7] (§4.1)), we have semi-group property γ(s + t, p) = γ(s, γ(t, p)), and Equation (50) is equivalent to δγ(0, p) = V(γ(0, p)), p ∈ EV. If a flow of V is available, we have an interpretation of ∇Vφ as a derivative of φ along γ(t, p), d dtφ(γ(t, p)) ���� t=0 = ∇φp(σp(γ(t, p))) � d dtσp(γ(t, p)) ����� t=0 = ∇φp(0)V(p) = ∇Vφ(p). (51) 2.5. Examples The following examples are intended to show how the formalism of gradients is usable in performing basic computations. 2.5.1. Expectation Let f be any random variable, and define F: EV by F(p) = Ep [ f ]. In the chart centered at p, we have: Fp(θ) = � f exp � ∑ j θj eUpXj − ψp(θ) � · p dμ (52) and the Euclidean gradient: ∇Fp(0) = Covp ( f, X) ∈ (Rm)′. (53) The natural gradient is: �∇F(p) = Covp (X, X)−1 Covp (X, f ) ∈ Rm, (54) and the Riemannian gradient is: ∇F(p) = ( �∇F(p))′ eUpX = Covp ( f, X) Covp (X, X)−1 eUpX ∈ TpEV. (55) From Equation (55), it follows that ∇F(p) is the L2(p)-projection f onto eUpV, while �∇F(p) in Equation (54) are the coordinates of the projection. Let us consider the family of curves: γ(t, p) = exp � m ∑ j=1 t( �∇F(p))j eUpXj − ψp(t �∇F(p)) � · p, t ∈ R. (56) The velocity is: δγ(t, p) = d dt � m ∑ j=1 t( �∇F(p))j eUpXj − ψp(t �∇F(p)) � = ∇F(p) − Eγ(t,p) [∇F(p)] , (57) which is different from ∇F(γ(t, p)), unless f ∈ V ⊕ R. Then, γ is not, in general, the flow of ∇F, but it is a local approximation, as δγ(0, p) = ∇F(p). These computation are the basis of model-based methods in combinatorial optimization; see [10–14]. 218 Entropy 2014, 16, 4260–4289 2.5.2. Binary Independent Variables Here, we present, in full generality, the toy example of the Introduction; see [17] for more information on the application to combinatorial optimization. Our example is a very special case of Ising exactly solvable models [18], our aim being here to explore the geometric framework. Let Ω = {+1, −1}m with counting measure μ, and let the space V be generated by the coordinate projections B = {X1, . . . , Xd}. Note that we use here the coding +1, −1 (from physics) instead of the coding 0, 1, which is more common in combinatorial optimization. The exponential family is EV = � exp � ∑m J=1 θjXj − ψλ(θ) � · 2−m� , λ(x) = 2−m for x ∈ Ω being the uniform density. The independence of the sufficient statistics Xj under all distributions in EV implies: ψλ(θ) = m ∑ j=1 ψ(θj), ψ(θ) = log (cosh(θ)) . (58) We have: ∇ψλ(θ) = [tanh(θj): j = 1, . . . , d] = ηλ(θ), (59) Hess ψλ(θ) = diag � cosh−2(θj): j = 1, . . . , d � = diag � e−2ψ(θj) : j = 1, . . . , d � = IB,λ(θ), (60) IB,λ(θ)−1 = diag � cosh2(θj): j = 1, . . . , d � = diag � e2ψ(θj) : j = 1, . . . , d � . (61) The quadratic function f (X) = a0 + ∑j ajXj + ∑{i,j} ai,jXiXj has expected value at p = eλ(θ), i.e., relaxed value, equal to: F(p) = Fλ(θ) = Eθ [ f (X)] = a0 + ∑ j aj tanh(θj) + ∑ {i,j} ai,j tanh(θi) tanh(θj), (62) and covariance with Xk ∈ B equal to: Covθ ( f (X), Xk) = ∑ j aj Covθ � Xj, Xk � + ∑ {i,j} ai,j Covθ � XiXj, Xk � = ak Varθ (Xk) + ∑ i̸=k ai,k Eθ [Xi] Varθ (Xk) = cosh−2(θk) � ak + ∑ i̸=k ai,k tanh(θi) � . (63) In the computation, we have used the independence and the special algebra of ±1, which implies X2 i = 1, so that Covθ � XiXj, Xk � = 0 if i, j ̸= k, otherwise Covθ (XiXk, Xk) = Eθ [Xi] − Eθ [Xi] Eθ [Xk]2; see [13]. 219 Entropy 2014, 16, 4260–4289 The Euclidean gradient, the natural gradient and the Riemannian gradient are, respectively, ∇Fλ(θ) = � cosh−2(θj) � aj + ∑ i̸=j ai,j tanh(θi) � : j = 1, . . . , d � , (64) �∇F(eλ(θ)) = � aj + ∑ i̸=j ai,j tanh(θi): j = 1, . . . , d � , (65) ∇F(eλ(θ)) = m ∑ j=1 � aj + ∑ i̸=j ai,j Eθ [Xi] � � Xj − Eθ � Xj �� . (66) The (natural) gradient flow equations are: ˙θj(t) = aj + ∑ i̸=j ai,j tanh(θi(t)), j = 1, . . . , d. (67) Equations (64)–(66) are usable in practice if the aj’s and the ai,j’s are estimable. Otherwise, one can use Equation (63) and the following forms of the gradients: ∇Fλ(θ) = � Covθ � Xj, f (X) � : j = 1, . . . , d � , (68) �∇F(eλ(θ)) = � cosh2(θj) Covθ � f (X), Xj � : j = 1, . . . , d � , (69) in which case, the gradient flow equations are: ˙θj(t) = cosh2(θj) Covθ � f (X), Xj � , j = 1, . . . , d. (70) Let us study the relaxed function in the expectation parameters ηj = ηj(θ), j = 1, . . . , d, Fλ(η) = a0 + ∑ j ajηj + ∑ {i,j} ai,jηiηj, η ∈] − 1, +1[m. (71) The Euclidean gradient with respect to η has components: ∂jFλ(η) = aj + ∑ i̸=j ai,jηi, (72) which are equal to the components of the natural gradient; see Section 2.4.1. As: ˙ηj(t) = d dt tanh(θj(t)) = cosh−2(θj(t)) ˙θj(t) = � 1 − ηj(t)2� ˙θj(t), j = 1, . . . , m, (73) the gradient flow expressed in the η-parameters has equations: ˙ηj(t) = � 1 − ηj(t)2� � aj + ∑ i̸=j ai,jηi(t) � , j = 1, . . . , d. (74) Alternatively, in vector form, ˙η(t) = diag � 1 − ηj(t)2 : j = 1, . . . , d � (a + Aη(t)) , (75) where a = [aj : j = 1, . . . , d]t and Ai,j = 0 if i = j, Aij = ai,j. The matrix A is symmetric with zero diagonal, and it has the meaning of the adjacency matrix of the (weighted) interaction graph. We do not know a closed-form solution of Equation (74). An example of a numerical solution is shown in Figure 3. 220 Entropy 2014, 16, 4260–4289 2.5.3. Escort Probabilities For a given a > 0, consider the function C(a) : EV defined by C(a)(p) = � pa dμ. We have: C(a) p (θ) = � exp � a m ∑ j=1 θj eUpXj − aψp(θ) � pa dμ (76) and: dC(a) p (0)α = � a � m ∑ j=1 αj eUpXj � pa dμ = m ∑ j=1 αj � a eUpXjpa dμ = m ∑ j=1 αj Covp � Xj, apa−1� , (77) that is, the Euclidean gradient is ∇C(a) p (0) = Covp � apa−1, X � (row vector). The natural gradient is computed from Equation (35) as: �∇C(a)(p) = I−1 B (p)(∇C(a) p (0))′ = Covp (X, X)−1 Covp � X, apa−1� , (78) while the Riemannian gradient follows from Equation (36): ∇C(a)(p) = Covp � apa−1, X � Covp (X, X)−1 eUpX. (79) Note that the Riemannian gradient is the orthogonal projection of the random variable apa−1 onto the tangent space TpEV = eUpV. The probability density pa/C(p) is called the escort density in the literature on non-extensive statistical mechanics; see, e.g., [19] (Section 7.4). We compute now the tangent mapping of EV ∋ p �→ pa/C(a)(a) ∈ P>. Let us extend the basis X1, . . . , Xm to a basis X1, . . . , Xn, n ≥ m, whose exponential family is full, i.e., equal to P>. The non-parametric coordinate of q = � exp � ∑m j=1 θj eUpXj − ψp(θ) � p �a /C(a) p (θ) in the chart centered at ¯p = pa/C(a) p (0) is the ¯p-centering of the random variable: log � q ¯p � = log ⎛ ⎜ ⎝ � exp � ∑m j=1 θj eUpXj − ψp(θ) � p �a /C(a) p (θ) pa/C(a) p (0) ⎞ ⎟ ⎠ = a m ∑ j=1 θj eUpXj − aψp(θ) + ln C(a) ( 0) − ln C(a) p (θ), (80) that is, v = a m ∑ j=1 θj eU ¯pXj. (81) The coordinates of v in the basis eU ¯pX1, . . . , eU ¯pXn are (aθ1, . . . , aθm, 0, . . . , 0), and the Jacobian of θ �→ (aθ, 0n−m) is the m × n matrix [aIm|0m×(n−m)]. 221 Entropy 2014, 16, 4260–4289 2.5.4. Polarization Measure The polarization measure has been introduced in Economics by [20]. Here, we consider the qualitative version of [21]. If π is a distribution of a finite set, the probability that in three independent samples from π there are exactly two equal is 3 ∑j π2 j (1 − πj). If p ∈ EV, define: G(p) = � p2(1 − p) dμ = C(2)(p) − C(3)(p), (82) where C(2) and C(3) are defined as in Example 2.5.3. From Equation (78), we find the natural gradient: �∇G(p) = Covp (X, X)−1 Covp � X, 2p − 3p2� . (83) Note that �∇G(p) = 0 if p is constant; see Figure 4. Figure 4. Normalized polarization. 3. Second Order Calculus In this section, we turn to considering second order calculus, in particular Hessians, in order to prepare the discussion of the Newton method for the relaxed optimization of Section 4. 3.1. Metric Derivative (Levi–Civita connection) Let V, W : EV be vector fields, that is, V(p), W(p) ∈ TpEV = eUpV, p ∈ EV. Consider the real function R = g(V, W): EV → R, whose value at p ∈ EV is R(p) = gp(V(p), W(p)) = Ep [V(p)W(p)]. Assuming smoothness, we want to compute the derivative of R along the vector field Y : EV, that is, (DYR)(p) = dRp(0)α, with α = ˙σp(Y(p)). The expression of R in the chart centered at p is, according to Equation (27), θ �→ Rp(θ) = ˙σp(V(ep(θ)))′IB(ep(θ)) ˙σp(W(ep(θ))) = Vp(θ)′IB,p(θ)Wp(θ), (84) where Vp and Wp are the presentation in the chart of the vector fields V and W, respectively. 222 Entropy 2014, 16, 4260–4289 The i-th component ∂iRp(θ) of the Euclidean gradient ∇Rp(θ) is: ∂iRp(θ) = ∂i � Vp(θ)′IB,p(θ)Wp(θ) � = ∂iVp(θ)′IB,p(θ)Wp(θ) + Vp(θ)′∂iIB,p(θ)Wp(θ) + Vp(θ)′IB,p(θ)∂iWp(θ) = � ∂iVp(θ) + 1 2 I−1 B,p(θ)∂iIB,p(θ)Vp(θ) �′ IB,p(θ)Wp(θ)+ Vp(θ)′IB,p(θ) � ∂iWp(θ) + 1 2 I−1 B,p(θ)∂iIB,p(θ)Wp(θ) � , (85) so that the derivative at θ along α = ˙σep(θ)(Y(ep(θ))) is: dRp(θ)α = � dVp(θ)α + 1 2 I−1 B,p(θ) � dIB,p(θ)α � Vp(θ) �′ IB,p(θ)Wp(θ)+ Vp(θ)′IB,p(θ) � dWp(θ)α + 1 2 I−1 B,p(θ) � dIB,p(θ)α � Wp(θ) � . (86) Proposition 1. If we define DYV to be the vector field on EV, whose value at q = ep(θ) has coordinates centered at p given by: ˙σp(DYV(q)) = dVp(θ)α + 1 2 I−1 B (p) � dIB,p(θ)α � Vp(θ), α = ˙σp(Y(q)), (87) then: DYg(V, W) = g(DYV, W) + g(V, DYW), (88) i.e., Equation (87) is a metric covariant derivative; see [6] (Ch. 2 §3), [8] (VIII §4), [4] (§5.3.2). The metric derivative Equation (87) could be computed from the flow of the vector field Y. Let (t, p) �→ γ(t, p) be the flow of the vector field V, i.e., δγ(t, p) = V(γ(t, p)) and γ(0, p) = p. Using Equation (23), we have: d dt ˙σ(V(γ(t, p))) ���� t=0 = d dtVp(σp(γ(t, p))) ���� t=0 = dVp(σp(γ(t, p))) d dtσp(γ(t, p)) ���� t=0 = dVp(0) ˙σp(δγ(0, p)) = dVp(0) ˙σp(Y(p)), (89) and: d dt IV(γ(t, p)) ���� t=0 = d dt IB,p(σpγ(t, p)) ���� t=0 = dIB,p(0) ˙σp(δγ(0, p)) = dIB,p(0) ˙σp(Y(p))Vp(0), (90) so that: ˙σ(DYV(p)) = d dt ˙σV(γ(t, p)) ���� t=0 + 1 2 I−1 V (p) d dt IV(γ(t, p)) ���� t=0 . (91) Let us check the symmetry of the metric covariant derivative to show that it is actually the unique Riemannian or Levi–Civita affine connection; see [6] (Th. 3.6). The Lie bracket of the vector fields V and W is the vector field [V, W], whose coordinates are: [V, W]p(θ) = dVp(0)˙σp(W(p)) − dWp(0) ˙σp(V(p)). (92) 223 Entropy 2014, 16, 4260–4289 As the ij entry of ∂kIB,p(0) is ∂k∂i∂jψp(0), then the symmetry (dIB,p(0)α)β = (dIB,p(0)β)α holds, and we have: ˙σp (DWV(p) − DVW(p)) = dVp(0)˙σp(W(p)) + 1 2 I−1 B (p) � dIB,p(0) ˙σp(W(p)) � Vp(0) − dWp(0) ˙σp(V(p)) − 1 2 I−1 B (p) � dIB,p(0) ˙σp(V(p)) � Wp(0) = ˙σ[V, W](p). (93) The term Γk(p) = 1 2 I−1 p (0)∂kdIB,p(0) of Equation (87) is sometimes referred to as the Christoffel matrix, but we do not use this terminology in this paper. As: IB,p(θ) = IB(ep(θ)) = � Covep(θ) � Xi, Xj �� i,j=1,...,m = � ∂i∂jψp(θ) � i,j=i,...,m , (94) we have ∂kIB(ep(θ)) = [∂i∂j∂kψp(θ)]i,j=i,...,m = � Covep(θ) � Xi, Xj, Xk �� i,j=i,...,m and: Γk(p) = 1 2 � Covp � Xi, Xj ��−1 i,j=i,...,m � Covp � Xi, Xj, Xk �� i,j=i,...,m (95) . If V, W are vector fields of TEV, we have: Γ(p, V, W) = 1 2 I−1 B (p) Covp (X, V, W) = 1 2 I−1 B (p) Ep �eUpXVW � , (96) which is the projection of V(p)W(p)/2 on eUpV. Notice also that: (dI−1 p (0)α)IB,p(0) = −I−1 p (0)(dIB,p(0)α)I−1 p (0)IB,p(0)y = −I−1 p (0) � dIB,p(0)α � . (97) 3.2. Acceleration Let p(t), t ∈ I, be a smooth curve in EV. Then, the velocity δp(t) = d dt log (p(t)) is a vector field V(p(t)) = δp(t), defined on the support p(I) of the curve. As the curve is the flow of the velocity field, we can compute the metric derivative of the velocity along the the velocity itself Dδpδp from Equation (91) with V(p(0)) = δp(0); we can use Equation (91) to get: ˙σp(Dδpδp)(p(0)) = d dt ˙σp(0) (δ(p(t))) ���� t=0 + 1 2 I−1 B (p(0)) d dt IB(p(t)) ���� t=0 = d2 dt2 σp(0)(p(t)) ���� t=0 + 1 2 I−1 B (p(0)) d dt IB(p(t)) ���� t=0 . (98) which can be defined to be the Riemannian acceleration of the curve at t = 0. Let us write θ(t) = σp(p(t)), p = p(0) and: p(t) = exp � m ∑ j=1 θj(t) eUpXj − ψp(θ(t)) � · p, (99) 224 Entropy 2014, 16, 4260–4289 so that ˙σp(δp)(0) = ˙θ(0) and d2 dt2 σp(p(t)) ��� t=0 = ¨θ(0). We have: d dt IB(p(t)) ���� t=0 = d dt IB,p(θ(t)) ���� t=0 = d dt Hess ψp(θ(t)) ���� t=0 = Covp(X, X, m ∑ j=1 ˙θj(t)Xj) (100) so that the acceleration at p has coordinates: ¨θ(0) + 1 2 m ∑ i,j=1 ˙θi(0) ˙θj(0) Covp (X, X)−1 Covp(X, Xi, Xj) = ¨θ(0) + 1 2 Covp (X, X)−1 Covp(X, m ∑ i ˙θi(0)Xi, m ∑ j=1 ˙θj(0)Xj). (101) A geodesic is a curve whose acceleration is zero at each point. The exponential map is the mapping Exp: TEV → EV defined by: (p, U) �→ Expp U = p(1), (102) where t �→ p(t) is the geodesic, such that p(0) = p and δp(0) = U, for all U, such that the geodesic exists for t = 1. The exponential map is a particular retraction, that is, a family of mappings Rp, p ∈ E, from the tangent space at p to the manifold; here R: TpE → E, such that Rp(0) = p and dRp(0) = Id; see [4] (§5.4). It should be noted that exponential manifolds have natural retractions other than Exp, a notable one being the exponential family itself. A retraction provides a crucial step in a gradient search algorithms by mapping a direction of increase of the objective function to a new trial point. 3.2.1. Example: Binary Independent 2.5.2 Continued. Let us consider the binary independent model of Section 2.5.2. We have IB(eλ(θ)) = IB,λ(θ) = diag � cosh−2(θj): j = 1, . . . , d � , (103) it follows that ∂kIB,λ(θ) = ∂k diag � cosh−2(θj): j = 1, . . . , d � = −2 cosh−3(θk) sinh(θk)Ekk, (104) where Ekk is the d × d matrix with entry one at (k, k), zero otherwise. The k-th Christoffel’s matrix in the second term in the definition of the metric derivative (aka Levi–Civita connection) is: Γk B(eλ(θ)) = Γk λ(θ) = 1 2 I−1 B,λ(θ)∂kIB,λ(θ) = − tanh(θk)Ekk. (105) In terms of the moments, we have IB,λ(θ) = Covθ (X, X′) = Hess ψλ(θ). As ∂k∂i∂jψλ(θ) = Covθ � Xk, Xi, Xj � , we that can write: ∂kIB,λ(θ) = ∂k diag � Varθ � Xj � : j = 1, . . . , d � = Covθ (Xk, Xk, Xk) Ekk (106) and: Γk λ(θ) = 1 2 Covθ (Xk, Xk)−1 Covθ (Xk, Xk, Xk) Ekk = 1 2(1 − (ηk)2)−1(−2ηk + 2(ηk)3)Ekk = −ηkEkk. (107) 225 Entropy 2014, 16, 4260–4289 The equations for the geodesics starting from θ(0) with velocity ˙θ(0) = u are: ¨θk(t) + m ∑ ij=1 Γk ij(θ(t)) ˙θi(t) ˙θj(t) = ¨θk(t) − tanh(θk(t))( ˙θk(t))2 = 0, k = 1, . . . , d. (108) The ordinary differential equation: ¨θ − tanh(θ) ˙θ2 = 0 (109) has the closed form solution: θ(t) = gd−1 � gd(θ(0)) + ˙θ(0) cosh(θ(0))t � = tanh−1 � sin � gd(θ(0)) + ˙θ(0) cosh(θ(0))t �� (110) for all t, such that: − π/2 < gd(θ(0)) + ˙θ(0) cosh(θ(0))t < π/2, (111) where gd: R →] − π/2, +π/2[ is the Gudermannian function, that is, gd′(x) = 1/ cosh x, gd(0) = 0; in closed form, gd(x) = arcsin(tanh(x)). In fact, if θ is a solution of Equation (109), then: d dt gd(θ(t)) = ˙θ(t) cosh(θ(t)) (112) d2 dt2 gd(θ(t)) = −sinh(θ(t))˙(θ(t))2 cosh2(θ(t)) + ¨θ(t) cosh(θ(t)) = 1 cosh(θ(t)) � ¨θ(t) − tanh(θ(t))( ˙θ(t))2� = 0, (113) so that t �→ gd(θ(t)) coincides (where it is defined) with an affine function characterized by the initial conditions. In particular, at t = 1, the geodesic Equation (110) defines the Riemannian exponential Exp: TEV → EV. If (p, U) ∈ TEV, that is, p ∈ EV and U ∈ TpEV, then σλ(p) = θ(0) and U = ∑ uj eUpXj, ˙σλ(U) = u. If: − π/2 < gd(θj) + uj cosh(θj) < π/2, (114) then we can take ˙θ(0) = u and t = 1, so that: Expp : U ˙σλ �−→ u �→ � gd−1 � gd(θj) + uj cosh(θj) � : j = 1, . . . , d � eλ �−→ m ∏ j=1 exp � gd−1 � gd(θj) + uj cosh(θj) � Xj − ψ � gd−1 � gd(θj) + uj cosh(θj) ��� 2−m. (115) We have: exp � gd−1(v) � = exp � tanh−1(sin(v)) � = � 1 + sin v 1 − sin v (116) and: ψ � gd−1(v) � = + log � gd−1(sin v) � = log � 1 cos v � , (117) 226 Entropy 2014, 16, 4260–4289 hence u �→ Expp � ∑d j=1 uj eUpXj � is given for: u ∈ d× j=1 � cosh(θj)(−π/2 − gd(θj)), cosh(θj)(π/2 − gd(θj)) � , (118) by: Expθ(u) = m ∏ j=1 cos � gd(θj) + uj cosh(θj) � ⎛ ⎝ 1 + sin � gd(θj) + uj cosh(θj) � 1 − sin � gd(θj) + uj cosh(θj) � ⎞ ⎠ Xj 2 = m ∏ j=1 � 1 + sin � gd(θj) + uj cosh(θj) � Xj � 2−m ∈ EV. (119) The expectation parameters are: ηi(t) = Eθ=0 � Xi m ∏ j=1 � 1 + sin � gd(θj) + tuj cosh(θj) � Xj �� = sin � gd(θj) + tuj cosh(θj) � , (120) and: gd(θj) = arcsin(ηj), cosh(θj) = 1 � 1 − (ηj)2� 1 2 , (121) so that the exponential in terms of the expectation parameters is: Expη(u) = � sin � arcsin ηj + � 1 − (ηj)2� 1 2 uj � : j = 1, . . . , m � . (122) The inverse of the Riemannian exponential provides a notion of translation between two elements of the exponential model, which is a particular parametrization of the model: −−→ η1η2 = Exp−1 η1 η2 = �� (1 − (ηj i)2�− 1 2 � arcsin ηj 2 − arcsin ηj 1 � : j = 1, . . . , m � (123) In particular, at θ = 0, we have the geodesic: t �→ d ∏ j=1 � 1 + sin(tuj)Xj � 2−m, |t| < π 2 max ��uj �� (124) See in Figure 5 some geodesic curves. 227 Entropy 2014, 16, 4260–4289 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Expectation parameters η1 η2 Figure 5. Geodesics from η = (0.75, 0.75). 3.3. Riemannian Hessian Let φ: EV → R with Riemannian gradient ∇φ(p) = ∑i( �∇φ)i(p) eUpXi, �∇φ(p) = I−1 B (p)∇φp(0). The Riemannian Hessian of φ is the metric derivative of the gradient ∇φ along the vector field Y, that is, HessY φ = DY∇φ; see [6] (Ch. 6, Ex. 11), [4] (§5.5). in the following, we denote by the symbol Hess, without a subscript, the ordinary Hessian matrix. From Equation (87), we have the coordinates of HessY φ(p). Given a generic tangent vector α, we compute from Equation (38): d(∇φ)p(θ)α �� θ=0 = d � I−1 B,p(θ)∇φp(θ) � α ��� θ=0 = (dI−1 B,p(0)α)∇φp(0) + I−1 B,p(0) Hess φp(0)α = −I−1 B (p)(dIB,p(0)α) �∇φ(p) + I−1 B (p) Hess φp(0)α (125) and, upon substitution of (∇φ)p to Vp in Equation (87), ˙σp(HessY φ(p)) = d(∇φ)p(0)α + 1 2 I−1 B (p) � dIB,p(0)α � (∇φ)p(0), α = Sp(Y(p)) = −I−1 B (p)(dIB,p(0)α) �∇φ(p) + I−1 B (p) Hess φp(0) + 1 2 I−1 B (p) � dIB,p(0)α � �∇φ(p) = I−1 B (p) Hess φp(0)α − 1 2 I−1 B (p) � dIB,p(0)α � �∇φ(p) = I−1 B (p) � Hess φp(0)α − 1 2 � dIB,p(0)α � �∇φ(p) � (126) HessY φ is characterized by knowing the value of g(HessY φ, X): EV for all vector fields X. We have from Equation (126), with α = ˙σp(Y(p)) and β = ˙σp(X(p)), gp(HessY(p) φ(p), X(p)) = β′ Hess φp(0)α − 1 2 β′ � dIB,p(0)α � �∇φ(p). (127) 228 Entropy 2014, 16, 4260–4289 This is the presentation of the Riemannian Hessian as a bi-linear form on TEV; see the comments in [4] (Prop. 5.5.2-3). Note that the Riemannian Hessian is positive definite if: α′ Hess φp(0)α ≥ 1 2α′ � dIB,p(0)α � �∇φ(p), α ∈ Rm. (128) 4. Application to Combinatorial Optimization We conclude our paper by showing how the geometric method applies to the problem of finding the maximum of the expected value of a function. 4.1. Hessian of a Relaxed Function Here is a key example of vector field. Let f be any bounded random variable, and define the relaxed function to be φ(p) = Ep [ f ], p ∈ P>. Define F(p) to be the projection of f, as an element of L2(p), onto TpEV = eUpV, i.e., F(p) is the element of eUpV, such that: Ep [( f − F(p))v] = 0, v ∈ eUpV (129) In the basis eUpB, we have F(p) = ∑i ˆfp,i eUpXi and: Covp � f, Xj � = ∑ i ˆfp,i Ep �eUpXi eUpXj � , j = 1, . . . , m, (130) so that ˆfp = I−1 B (p) Covp (X, f ) and F(p) = ˆf ′ p eUpX = Covp ( f, X) I−1 B (p) eUpX. (131) Let us compute the gradient of the relaxed function φ = E· [ f ] : EV. We have φp(θ) = Eep(θ) [ f ], and from the properties of exponential families, the Euclidean gradient is ∇φp(0) = Covp ( f, X). It follows that the natural gradient is: �∇φp(0) = I−1 B (p) Covp (X, f ) = ˆf, (132) and the Riemannian gradient is ∇φ(p) = F(p). From the properties of exponential families, we have: Hess φp(0) = Covp (X, X, f ) , so that, in this case, Equation (127), when written in terms of the moments, is: β′ Covp (X, X, f ) α − 1 2 β′ Covp (X, X, α · X) Covp (X, X)−1 Covp (X, f ) . (133) 4.1.1. Example: Binary Independent 2.5.2 and 3.2.1 Continued We list below the computation of the Hessian in the case of two binary independent variables. Computations were done with Sage [22], which allows both the reduction x2 i = 1 in the ring of polynomials and the simplifications in the symbolic ring of parameters. Covη (X, f ) = � − � η2 1 − 1 � a1 − � η2 1η2 − η2 � a12 − � η2 2 − 1 � a2 − � η1η2 2 − η1 � a12 � = � −(η1 − 1)(η1 + 1)(a12η2 + a1) −(η2 − 1)(η2 + 1)(a12η1 + a2) � (134) Covη (X, X) = � −η2 1 + 1 0 0 −η2 2 + 1 � = � −(η1 − 1)(η1 + 1) 0 0 −(η2 − 1)(η2 + 1) � (135) 229 Entropy 2014, 16, 4260–4289 Covη (X, X)−1 Covη (X, f ) = � a12η2 + a1 a12η1 + a2 � = ∇F(η) (136) Covη (X, X, f ) = � 2 � η3 1 − η1 � a1 + 2 � η3 1η2 − η1η2 � a12 � η2 1η2 2 − η2 1 − η2 2 + 1 � a12 � η2 1η2 2 − η2 1 − η2 2 + 1 � a12 2 � η1η3 2 − η1η2 � a12 + 2 � η3 2 − η2 � a2 � = � 2 (η1 − 1)(η1 + 1)(a12η2 + a1)η1 (η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12 (η2 − 1)(η2 + 1)(η1 − 1)(η1 + 1)a12 2 (η2 − 1)(η2 + 1)(a12η1 + a2)η2 � (137) Covη (X, X)−1 Covη (X, X, f ) = � −2 (a12η2 + a1)η1 −a12η2 2 + a12 −a12η2 1 + a12 −2 (a12η1 + a2)η2 � (138) Covη (X, X, ∇F(η)) = � 2 (a12η2 + a1)(η1 + 1)(η1 − 1)η1 0 0 2 (a12η1 + a2)(η2 + 1)(η2 − 1)η2 � (139) Covη (X, X)−1 Covη (X, X, ∇F(η)) = � −2 (a12η2 + a1)η1 0 0 −2 (a12η1 + a2)η2 � (140) The Riemannian Hessian as a matrix in the basis of the tangent space is: Hess F(η) = Covη (X, X)−1 � Covη (X, X, f ) − 1 2 Covη (X, X, ∇F(η)) � = � −(a12η2 + a1)η1 −a12(η2 + 1)(η2 − 1) −a12(η1 + 1)(η1 − 1) −(a12η1 + a2)η2 � (141) As a check, let us compute the Riemannian Hessian as a natural Hessian in the Riemannian parameters, Hess φ ◦ Expp(u) ��� u=0; see [4] (Prop. 5.5.4). We have: F ◦ Expη(u) = a12 sin �� −η2 1 + 1u1 + arcsin (η1) � sin �� −η2 2 + 1u2 + arcsin (η2) � + a1 sin �� −η2 1 + 1u1 + arcsin (η1) � + a2 sin �� −η2 2 + 1u2 + arcsin (η2) � (142) and: Hess F ◦ Expη(u) ��� u=0 = � � η2 1 − 1 � a12η1η2 + � η2 1 − 1 � a1η1 � η2 1 − 1 �� η2 2 − 1 � a12 � η2 1 − 1 �� η2 2 − 1 � a12 � η2 2 − 1 � a12η1η2 + � η2 2 − 1 � a2η2 � = � (a12η2 + a1)(η1 + 1)(η1 − 1)η1 a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1) a12(η1 + 1)(η1 − 1)(η2 + 1)(η2 − 1) (a12η1 + a2)(η2 + 1)(η2 − 1)η2 � . (143) 230 Entropy 2014, 16, 4260–4289 Note the presence of the factor Covη (X, X). 4.2. Newton Method The Newton method is an iterative method that generates a sequence of points pt, with t = 0, 1, . . . , that converges towards a stationary point ˆp of a F(p) = Ep [ f ], p ∈ EV, that is, a critical point of the vector field p �→ ∇F(p), ∇F( ˆp) = 0. Here, we follow [4] (Ch. 5–6), and in particular Algorithm 5 on Page 113. Let ∇F be a gradient field. We reproduce in our case the basic derivation of the Newton method in the following. Note that, in this section, we use the notation Hess •[α] to denote Hessα •. Using the definition of metric derivative, we have for a geodesic curve [0, 1] ∋ t �→ p(t) ∈ EV connecting p = p(0) to ˆp = p(1) that: d dt gp(t) (∇F(p(t)), δp(t)) = gp(t) (Hess F(p(t))[δp(t)], δp(t)) (144) hence the increment from p to ˆp is: g ˆp (∇F( ˆp), δp(1)) − gp (∇F(p), δp(0)) = � 1 0 gp(t) (Hess F(p(t))[δp(t)], δp(t)) dt. (145) Now, we assume that ∇F( ˆp) = 0 and that in Equation (145), the integral is approximated by the initial value of the integrand, that is to say, the Hessian is approximately constant on the geodesic from p to ˆp; we obtain: − gp (∇F(p), δp(0)) = gp (Hess F(p)[δp(0)], δp(0)) + ϵ. (146) If we can solve the Newton equation: Hess F(p(t))[u] = −∇F(p) (147) then u is approximately equal to the initial velocity of the geodesic connecting p to ˆp, that is, ˆp = Expp(u). The particular structure of the exponential manifold suggests at least two natural retractions that could be used to move from u to ˆp. Namely, we have the Riemannian exponential (θt, θt+1) �→ Expθt(θt+1 − θt) and the e-retraction coming from the exponential family itself and defined by (θt, θt+1) �→ eθt(θt+1 − θt), with θt+1 − θt = ut. In the θ parameters, with the e-retraction, the Newton method generates a sequence (θt) according to the following updating rule: θt+1 = θt − λ Hess F(θt)−1 �∇F(θt) (148) where λ > 0 is an extra parameter intended to control the step size and, in turn, the convergence to ˆθ; see [5]. We can rewrite Equation (148) in terms of covariances as: θt+1 = θt − λ � Covθt(X, X, f ) − 1 2 Covθt(X, X, �∇F(θt)) �−1 �∇F(θt). (149) 4.3. Example: Binary Independent In the η parameters, the Newton step is: u = − Hess F(η)−1∇F(η) = ⎛ ⎜ ⎝ a2 12η1+a12a2+(a1a12η1+a1a2)η2 a2 12η2 1+(a12a2η1+a2 12)η2 2−a2 12+(a1a12η2 1+a1a2η1)η2 a1a2η1+a1a12+(a12a2η1+a2 12)η2 a2 12η2 1+(a12a2η1+a2 12)η2 2−a2 12+(a1a12η2 1+a1a2η1)η2 ⎞ ⎟ ⎠ (150) 231 Entropy 2014, 16, 4260–4289 and the new η in the Riemannian retraction is: Expη(u) = ⎛ ⎜ ⎜ ⎝ sin � (a2 12η1+a12a2+(a1a12η1+a1a2)η2)√ −η2 1+1 a2 12η2 1+(a12a2η1+a2 12)η2 2−a2 12+(a1a12η2 1+a1a2η1)η2 + arcsin (η1) � sin � (a1a2η1+a1a12+(a12a2η1+a2 12)η2)√ −η2 2+1 a2 12η2 1+(a12a2η1+a2 12)η2 2−a2 12+(a1a12η2 1+a1a2η1)η2 + arcsin (η2) � . ⎞ ⎟ ⎟ ⎠ (151) In Figure 6, we represented the vector field associated with the Newton step in the η parameters, with λ = 0.05, using the Riemannian retraction, for the case a1 = 1, a2 = 2 and a12 = 3, with: Expη(u) = ⎛ ⎜ ⎜ ⎝ sin � λ √ −η2 1+1((3 η1+2)η2+9 η1+6) 3 (2 η1+3)η2 2+9 η2 1+(3 η2 1+2 η1)η2−9 + arcsin (η1) � sin � λ (3 (2 η1+3)η2+2 η1+3)√ −η2 2+1 3 (2 η1+3)η2 2+9 η2 1+(3 η2 1+2 η1)η2−9 + arcsin (η2) � ⎞ ⎟ ⎟ ⎠ . (152) The red dotted lines represented in the figure identify the basins of attraction of the vector field and correspond to the solutions of the explicit equation in η for which the Newton step u is not defined. This vector field can be compared to that in Figure 7, associated with the Newton step for F(η) using the Euclidean geometry. In the Euclidean geometry, F(η) is a quadratic function with one saddle point, so that from any η, the Newton step points in the direction of the critical point. This makes the Newton step unsuitable for an optimization algorithm. On the other side, in the Riemannian geometry, the vertices of the polytope are critical points for F(η), and they determine the presence of multiple basins of attraction, as expected. −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 η1 η2 −4 −2 0 2 4 6 −2 0 2 4 Expectation parameters Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows. 232 Entropy 2014, 16, 4260–4289 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 η1 η2 −4 −2 0 2 4 6 −2 0 2 4 Expectation parameters Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05. −2 −1 0 1 2 −2 −1 0 1 2 θ1 θ2 −2 0 2 4 −2 0 2 4 Natural parameters 0 0 0 0 Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented. 233 Entropy 2014, 16, 4260–4289 −2 −1 0 1 2 −2 −1 0 1 2 θ1 θ2 −2 0 2 4 −2 0 2 4 Natural parameters 0 0 0 0 Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented. Figure 8 shows the Newton step in the θ parameters based on the e-retraction of Equation (149), while Figure 9 represents the Newton step evaluated with respect to the Euclidean geometry. A comparison of the two vector fields shows that, differently from the η parameters, the number of basins of attraction is the same in the two geometries; however, the scale of the vectors is different. In particular, notice how on the plateau, for diverging θ, the Newton step in the Euclidean geometry vanishes, while in the Riemannian geometry, it gets larger. This behavior suggests better convergence properties for an optimization algorithm based on the Newton step evaluated using the proper Riemannian geometry. In the θ parameters, the boundaries of the basins of attraction represented by the red dotted lines have been computed numerically and correspond to the values of θ for which the update step is not defined. Finally, notice that in both the η and θ parameters, the step is not always in the direction of descent for the function, a common behavior of the Newton method, which converges to the critical points. 5. Discussion and Conclusions In this paper, we introduced second-order calculus over a statistical manifold, following the approach described in [4], which has been adapted to the special case of exponential statistical models [2,3]. By defining the Riemannian Hessian and using the notion of retraction, we developed the proper machinery necessary for the definition of the updating rule of the Newton method for the optimization of a function defined over an exponential family. The examples discussed in the paper show that by taking into account the proper Riemannian geometry of a statistical exponential family, the vector fields associated with the Newton step in the different parametrizations change profoundly. Not only new basins of attraction associated with local and global minima appear, as for the expectation parameters, but also the magnitude of the Newton step is affected, as over the plateau in the natural parameters. Such differences are expected to have a strong impact on the performance of an optimization algorithm based on the Newton step, from both the point of view of achievable convergence and the speed of convergence to the optimum. 234 Entropy 2014, 16, 4260–4289 The Newton method is a popular second order optimization technique based on the computation of the Hessian of the function to be optimized and is well known for its super-linear convergence properties. However, the use of the Newton method poses a number of issues in practice. First of all, as the examples in Figures 6 and 8 show, the Newton step does not always point in the direction of the natural gradient, and the algorithm may not converge to a (local) optimum of the function. Such behavior is not unexpected; indeed the Newton method tends to converge to critical points of the function to be optimized, which include local minima, local maxima and saddle points. In order to obtain a direction of ascent for the function to be optimized, the Hessian must be negative-definite, i.e., its eigenvalues must be strictly negative, which is not guaranteed in the general case. Another important remark is related to the computational complexity associated with the evaluation of the Hessian, compared to the (natural) gradient. Indeed, to obtain the Newton step d, Christoffel matrices have to be evaluated, together with the third order covariances between sufficient statistics and the function, and the Hessian has to be inverted. Finally, notice that when the Hessian is close to being non-invertible, numerical problems may arise in the computation of the Newton step, and the algorithm may become unstable and diverge. In the literature, different methods have been proposed to overcome these issues. Among them, we mention quasi-Newton methods, where the update vector is obtained using a modified Hessian, which has been made negative-definite, for instance, by adding a proper correction matrix. This paper represents the first step in the design of an algorithm based on the Newton method for the optimization over a statistical model. The authors are working on the computational aspects related to the implementation of the method, and a new paper with experimental results is in progress. Acknowledgments: Luigi Malagò was supported by the Xerox University Affairs Committee Award and by de Castro Statistics, Collegio Carlo Alberto, Moncalieri. Giovanni Pistone is supported by de Castro Statistics, Collegio Carlo Alberto, Moncalieri, and is a member of GNAMPA–INdAM, Roma. Author Contributions All authors contributed to the design of the research. The research was carried out by all authors. The study of the Hessian and of the Newton method in statistical manifolds was originally suggested by Luigi Malagò. The manuscript was written by Luigi Malagò and Giovanni Pistone. All authors have read and approved the final manuscript. Conflicts of Interest: Conflicts of Interest The authors declare no conflict of interest. References 1. Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes. Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1986; p. 283. 2. Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000; p. 206. 3. Pistone, G. Nonparametric Information Geometry. In Geometric Science of Information, Proceedings of the First International Conference, GSI 2013, Paris, France, 28–30 August 2013; Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science, Volume 8085; Springer: Berlin/Heidelberg, Germany, 2013; pp. 5–36. 4. Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008; pp. xvi+224. 5. Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research and Financial Engineering; Springer: New York, NY, USA, 2006; pp. xxii+664. 6. Do Carmo, M.P. Riemannian geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston, MA, USA, 1992; pp. xiv+300. 7. Abraham, R.; Marsden, J.E.; Ratiu, T. Manifolds, Tensor Analysis, and Applications, 2nd ed.; Applied Mathematical Sciences, Volume 75; Springer: New York, NY, USA, 1988; pp. x+654. 235 Entropy 2014, 16, 4260–4289 8. Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: New York, NY, USA, 1995; pp. xiv+364. 9. Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P., Eds.; Cambridge University Press: Cambridge, UK, 2010. 10. Malagò, L.; Matteucci, M.; Dal Seno, B. An information geometry perspective on estimation of distribution algorithms: Boundary analysis. In Proceedings of the 2008 GECCO Conference Companion On Genetic and Evolutionary Computation (GECCO ’08); ACM: New York, NY, USA, 2008; pp. 2081–2088. 11. Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming. In Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009. 12. Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by Estimation of Empirical Covariances. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA, USA, 5–8 June 2011; pp. 949–956. 13. Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family. In Proceedings of the 11th Workshop on Foundations of Genetic Algorithms (FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011 ; ACM: New York, NY, USA, 2011; pp. 230–242. 14. Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying perspective. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico, 20–23 June 2013; pp. 486–493. 15. Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. 16. Shima, H. The Geometry of Hessian Structures; World Scientific Publishing Co. Pte. Ltd.: Hackensack, NJ, USA, 2007; pp. xiv+246. 17. Malagò, L. On the Geometry of Optimization Based on the Exponential Family Relaxation. Ph.D. Thesis, Politecnico di Milano, Milano, Italy, 2012. 18. Gallavotti, G. Statistical Mechanics: A Short Treatise; Texts and Monographs in Physics; Springer: Berlin, Germany, 1999; pp. xiv+339. 19. Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149. 20. Esteban, J.; Ray, D. On the Measurement of Polarization. Econometrica 1994, 62, 819–851. 21. Montalvo, J.; Reynal-Querol, M. Ethnic polarization, potential conflict, and civil wars. Am. Econ. Rev. 2005, 796–816. 22. Stein, W. et al. Sage Mathematics Software (Version 6.0). The Sage Development Team, 2013. Available online: http://www.sagemath.org (accessed on 27 March 2014). c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 236 entropy Article Information Geometric Complexity of a Trivariate Gaussian Statistical Model Domenico Felice 1,2,*, Carlo Cafaro 3 and Stefano Mancini 1,2 1 School of Science and Technology, University of Camerino, I-62032 Camerino, Italy; E-Mail: stefano.mancini@unicam.it 2 INFN-Sezione di Perugia, Via A. Pascoli, I-06123 Perugia, Italy 3 Department of Mathematics, Clarkson University, Potsdam, 13699 NY, USA; E-Mail: carlocafaro2000@yahoo.it * E-Mail: domenico.felice@unicam.it Received: 1 April 2014; in revised form: 21 May 2014 / Accepted: 22 May 2014 / Published: 26 May 2014 Abstract: We evaluate the information geometric complexity of entropic motion on low-dimensional Gaussian statistical manifolds in order to quantify how difficult it is to make macroscopic predictions about systems in the presence of limited information. Specifically, we observe that the complexity of such entropic inferences not only depends on the amount of available pieces of information but also on the manner in which such pieces are correlated. Finally, we uncover that, for certain correlational structures, the impossibility of reaching the most favorable configuration from an entropic inference viewpoint seems to lead to an information geometric analog of the well-known frustration effect that occurs in statistical physics. Keywords: probability theory; Riemannian geometry; complexity 1. Introduction One of the main efforts in physics is modeling and predicting natural phenomena using relevant information about the system under consideration. Theoretical physics has had a general measure of the uncertainty associated with the behavior of a probabilistic process for more than 100 years: the Shannon entropy [1]. The Shannon information theory was applied to dynamical systems and became successful in describing their unpredictability [2]. Along a similar avenue we may set Entropic Dynamics [3] which makes use of inductive inference (Maximum Entropy Methods [4]) and Information Geometry [5]. This is clearly remarkable given that microscopic dynamics can be far removed from the phenomena of interest, such as in complex biological or ecological systems. Extension of ED to temporally-complex dynamical systems on curved statistical manifolds led to relevant measures of chaoticity [6]. In particular, an information geometric approach to chaos (IGAC) has been pursued studying chaos in informational geodesic flows describing physical, biological or chemical systems. It is the information geometric analogue of conventional geometrodynamical approaches [7] where the classical configuration space is being replaced by a statistical manifold with the additional possibility of considering chaotic dynamics arising from non conformally flat metrics. Within this framework, it seems natural to consider as a complexity measure the (time average) statistical volume explored by geodesic flows, namely an Information Geometry Complexity (IGC). This quantity might help uncover connections between microscopic dynamics and experimentally observable macroscopic dynamics which is a fundamental issue in physics [8]. An interesting manifestation of such a relationship appears in the study of the effects of microscopic external noise (noise imposed on the microscopic variables of the system) on the observed collective motion (macroscopic variables) of a globally coupled map [9]. These effects are quantified in terms of the complexity of the collective motion. Furthermore, it turns out that noise at a microscopic level reduces Entropy 2014, 16, 2944–2958; doi:10.3390/e16062944 www.mdpi.com/journal/entropy 237 Entropy 2014, 16, 2944–2958 the complexity of the macroscopic motion, which in turn is characterized by the number of effective degrees of freedom of the system. The investigation of the macroscopic behavior of complex systems in terms of the underlying statistical structure of its microscopic degrees of freedom also reveals effects due to the presence of microcorrelations [10]. In this article we first show which macro-states should be considered in a Gaussian statistical model in order to have a reduction in time of the Information Geometry Complexity. Then, dealing with correlated bivariate and trivariate Gaussian statistical models, the ratio between the IGC in the presence and in the absence of microcorrelations is explicitly computed, finding an intriguing, even though non yet deep understood, connection with the phenomenon of geometric frustration [11]. The layout of the article is as follows. In Section 2 we introduce a general statistical model discussing its geometry and describing both its dynamics and information geometry complexity. In Section 3, Gaussian statistical models (up to a trivariate model) are considered. There, we compute the asymptotic temporal behaviors of their IGCs. Finally, in Section 4 we draw our conclusions by outlining our findings and proposing possible further investigations. 2. Statistical Models and Information Geometry Complexity Given n real-valued random variables X1, . . . , Xn defined on the sample space Ω with joint probability density p : Rn → R satisfying the conditions p(x) ≥ 0 (∀x ∈ Rn) and � Rn dx p(x) = 1, (1) let us consider a family P of such distributions and suppose that they can be parametrized using m real-valued variables (θ1, . . . , θm) so that P = {pθ = p(x|θ)|θ = (θ1, . . . , θm) ∈ Θ}, (2) where Θ ⊆ Rm is the parameter space and the mapping θ → pθ is injective. In such a way, P is an m-dimensional statistical model on Rn. The mapping ϕ : P → Rm defined by ϕ(pθ) = θ allows us to consider ϕ = [θi] as a coordinate system for P. Assuming parametrizations which are C∞, we can turn P into a C∞ differentiable manifold (thus, P is called statistical manifold) [5]. The values x1, . . . , xn taken by the random variables define the micro-state of the system, while the values θ1, . . . , θm taken by parameters define the macro-state of the system. Let P = {pθ|θ ∈ Θ} be an m-dimensional statistical model. Given a point θ, the Fisher information matrix of P in θ is the m × m matrix G(θ) = [gij], where the (i, j) entry is defined by gij(θ) := � Rn dxp(x|θ)∂i log p(x|θ)∂j log p(x|θ), (3) with ∂i standing for ∂ ∂θi . The matrix G(θ) is symmetric, positive semidefinite and determines a Riemannian metric on the parameter space Θ [5]. Hence, it is possible to define a Riemannian statistical manifold M := (Θ, g), where g = gijdθi ⊗ dθj (i, j = 1, . . . , m) is the metric whose components gij are given by Equation (3) (throughout the paper we use the Einstein sum convention). Given the Riemannian manifold M = (Θ, g), it is well known that there exists only one linear connection ∇(the Levi–Civita connection) on M that is compatible with the metric g and symmetric [12]. We remark that the manifold M has one chart, being Θ an open set of Rm, and the Levi-Civita connection is uniquely defined by means of the Christoffel coefficients Γk ij = 1 2 gkl�∂glj ∂θi + ∂gil ∂θj − ∂gij ∂θl � , (i, j, k = 1, . . . , m) (4) 238 Entropy 2014, 16, 2944–2958 where gkl is the (k, l) entry of the inverse of the Fisher matrix G(θ). The idea of curvature is the fundamental tool to understand the geometry of the manifold M = (Θ, g). Actually, it is the basic geometric invariant and the intrinsic way to obtain it is by means of geodesics. It is well-known, that given any point θ ∈ M and any vector v tangent to M at θ, there is a unique geodesic starting at θ with initial tangent vector v. Indeed, within the considered coordinate system, the geodesics are solutions of the following nonlinear second order coupled ordinary differential equations [12] d2θk dτ2 + Γk ij dθi dτ dθj dτ = 0, (5) with τ denoting the time. The recipe to compute some curvatures at a point θ ∈ M is the following: first, select a 2-dimensional subspace Π of the tangent space to M at θ; second, follow the geodesics through θ whose initial tangent vectors lie in Π and consider the 2-dimensional submanifolds SΠ swiped out by them inheriting a Riemannian metric from M; finally, compute the Gaussian curvature of SΠ at θ, which can be obtained from its Riemannian metric as stated in the Theorema Egregium [13]. The number K(Π) found in such manner is called the sectional curvature of M at θ associated with the plane Π. In terms of local coordinates, to compute the sectional curvature we need the curvature tensor, Rh ijk = ∂Γh jk ∂θi − ∂Γh ik ∂θj + Γl jkΓh il − Γl ikΓh jl. (6) For any basis (ξ, η) for a 2-plane Π ⊂ TθM, the sectional curvature at θ ∈ M is given by [12] K(ξ, η) = R(ξ, η, η, ξ) |ξ|2|η|2 − ⟨ξ, η⟩, (7) where R is the Riemann curvature tensor which is written in coordinates as R = Rijkldθi ⊗ dθj ⊗ dθk ⊗ dθl with Rijkl = glhRh ijk and ⟨·, ·⟩ is the inner product defined by the metric g. The sectional curvature is directly related to the topology of the manifold; along this direction the Cartan-Hadamard Theorem [13] is enlightening by stating that any complete, simply connected n-dimensional manifold with non positive sectional curvature is diffeomorphic to Rn. We can consider upon the statistical manifold M = (Θ, g) the macro-variables θ as accessible information and then derive the information dynamical Equation (5) from a standard principle of least action of Jacobi type [3]. The geodesic Equations (5) describe a reversible dynamics whose solution is the trajectory between an initial and a final macrostate θinitial and θfinal, respectively. The trajectory can be equally traversed in both directions [10]. Actually, an equation relating instability with geometry exists and it makes hope that some global information about the average degree of instability (chaos) of the dynamics is encoded in global properties of the statistical manifolds [7]. The fact that this might happen is proved by the special case of constant-curvature manifolds, for which the Jacobi-Levi-Civita equation simplifies to [7] d2Ji dτ2 + KJi = 0, (8) where K is the constant sectional curvature of the manifold (see Equation (7)) and J is the geodesic deviation vector field. On a positively curved manifold, the norm of the separating vector J does not grow, whereas on a negatively curved manifold, the norm of J grows exponentially in time, and if the manifold is compact, so that its geodesic are sooner or later obliged to fold, this provide an example of chaotic geodesic motion [14]. 239 Entropy 2014, 16, 2944–2958 Taking into consideration these facts, we single out as suitable indicator of dynamical (temporal) complexity, the information geometric complexity defined as the average dynamical statistical volume [15] � vol � D(geodesic) Θ (τ) � := 1 τ � τ 0 dτ′vol � D(geodesic) Θ (τ′) � , (9) where vol � D(geodesic) Θ (τ′) � := � D(geodesic) Θ (τ′) � det(G(θ)) dθ, (10) with G(θ) the information matrix whose components are given by Equation (3). The integration space D(geodesic) Θ (τ′) is defined as follows D(geodesic) Θ (τ′) := � θ = (θ1, . . . , θm) : θk(0) ≤ θk ≤ θk(τ′) � , (11) where θk ≡ θk(s) with 0 ≤ s ≤ τ′ such that θk(s) satisfies (5). The quantity vol � D(geodesic) Θ (τ′) � is the volume of the effective parameter space explored by the system at time τ′. The temporal average has been introduced in order to average out the possibly very complex fine details of the entropic dynamical description of the system’s complexity dynamics. Relevant properties, concerning complexity of geodesic paths on curved statistical manifolds, of the quantity (10) compared to the Jacobi vector field are discussed in [16]. 3. The Gaussian Statistical Model In the following we devote our attention to a Gaussian statistical model P whose element are multivariate normal joint distributions for n real-valued variables X1, . . . , Xn given by p(x|θ) = 1 � (2π)n det C exp � −1 2(x − μ)tC−1(x − μ) � , (12) where μ = �E(X1), . . . , E(Xn) � is the n-dimensional mean vector and C denotes the n × n covariance matrix with entries cij = E(XiXj) − E(Xi)E(Xj), i, j = 1, . . . , n. Since μ is a n-dimensional real vector and C is a n × n symmetric matrix, the parameters involved in this model should be n + n(n+1) 2 . Moreover C is a symmetric, positive definite matrix, hence we have the parameter space given by Θ := {(μ, C)|μ ∈ Rn, C ∈ Rn×n, C > 0}. (13) Hereafter we consider the statistical model given by Equation (12) when the covariance matrix C has only variances σ2 i = E(X2 i ) − (E(Xi))2 as parameters. In fact we assume that the non diagonal entry (i, j) of the covariance matrix C equals ρσiσj with ρ ∈ R quantifying the degree of correlation. We may further notice that the function fij(x) := ∂i log p(x|θ)∂j log p(x|θ), when p(x|θ) is given by Equation (12), is a polynomial in the variables xi (i = 1, . . . , n) whose degree is not grater than four. Indeed, we have that ∂i log p(x|θ) = 1 p(x|θ)∂ip(x|θ) = ∂i 1 � (2π)n det C + ∂i � −1 2(x − μ)tC−1(x − μ) � , (14) and, therefore, the differentiation does not affect variables xi. With this in mind, in order to compute the integral in (3), we can use the following formula [17] 1 � (2π)n det C � dx fij(x) exp � −1 2(x − μ)tC−1(x − μ) � = exp � 1 2 n ∑ h,k=1 chk ∂ ∂xh ∂ ∂xk � fij|x=μ, (15) where the exponential denotes the power series over its argument (the differential operator). 240 Entropy 2014, 16, 2944–2958 3.1. The monovariate Gaussian Statistical Model We now start to apply the concepts of the previous section to a Gaussian statistical model of Equation (12) for n = 1. In this case, the dimension of the statistical Riemannian manifold M = (Θ, g) is at most two. Indeed, to describe elements of the statistical model P given by Equation (12), we basically need the mean μ = E(X) and variance σ2 = E(X − μ)2. We deal separately with the cases when the monovariate model has only μ as macro-variable (Case 1), when σ is the unique macro-variable (Case 2), and finally when both μ and σ are macro-variables (Case 3). 3.1.1. Case 1 Consider the monovariate model with only μ as macro-variable by setting σ = 1. In this case the manifold M is trivially the real flat straight line, since μ ∈ (−∞, +∞). Indeed, the integral in (3) is equal to 1 when the distribution p(x|θ) reads as p(x|μ) = exp � − 1 2 (x−μ)2� √ 2π ; so the metric is g = dμ2. Furthermore, from Equations (4) and (5) the information dynamics is described by the geodesic μ(τ) = A1τ + A2, where A1, A2 ∈ R. Hence, the volume of Equation (10) results vol � D(geodesic) Θ (τ′) � = � dμ = A1τ + A2; since this quantity must be positive we assume A1, A2 > 0. Finally, the asymptotic behavior of the IGC (9) is � vol � D(geodesic) Θ (τ) � ≈ � A1 2 � τ. (16) This shows that the complexity linearly increases in time meaning that acquiring information about μ and updating it, is not enough to increase our knowledge about the micro state of the system. 3.1.2. Case 2 Consider now the monovariate Gaussian statistical model of Equation(12) when μ = E(X) = 0 and the macro-variable is only σ. In this case the probability distribution function reads p(x|σ) = exp � − x2 2σ2 � √ 2πσ while the Fisher–Rao metric becomes g = 2 σ2 dσ2. Emphasizing that also in this case the manifold is flat as well, we derive the information dynamics by means of Equations (4) and (5) and we obtain the geodesic σ(τ) = A1 exp � A2τ � . The volume in Equation (10) then results vol � D(geodesic) Θ (τ′) � = � √ 2 σ dσ = √ 2 log � A1 exp � A2τ �� . (17) Again, to have positive volume we have to assume A1, A2 > 0. Finally, the (asymptotic) IGC (9) becomes � vol � D(geodesic) Θ (τ) � ≈ �√ 2A2 2 � τ. (18) This shows that also in this case the complexity linearly increases in time meaning that acquiring information about σ and updating it, is not enough to increase our knowledge about the micro-state of the system. 3.1.3. Case 3 The take home message of the previous cases is that we have to account for both mean μ and variance σ as macro-variables to look for possible non increasing complexity. Hence, consider the probability distribution function is given by, p(x1, x2|μ, σ) = exp � − 1 2 (x−μ)2 σ2 � σ √ 2π . (19) 241 Entropy 2014, 16, 2944–2958 The dimension of the Riemannian manifold M = (Θ, g) is two, where the parameter space Θ is given by Θ = {(μ, σ)|μ ∈ (−∞, +∞), σ > 0} and the Fisher–Rao metric reads as g = 1 σ2 dμ2 + 2 σ2 dσ2. Here, the sectional curvature given by Equation (7) is a negative function and despite the fact that is not constant, we expect a decreasing behavior in time of the IGC. Thanks to Equation (4), we find that the only non negative Christoffel coefficients are Γ1 12 = − 1 σ, Γ2 11 = 1 2σ and Γ2 22 = − 1 σ. Substituting them into Equation (5) we derive the following geodesic equations ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ d2μ(τ) dτ2 − 2 σ dσ dτ dμ dτ = 0, d2σ(τ) dτ2 − 1 σ � dσ dτ �2 + 1 2σ � dμ dτ �2 = 0. (20) The integration of the above coupled differential equations is non-trivial. We follow the method described in [10] and arrive at σ(τ) = 2σ0 exp � σ0|A1| √ 2 τ � 1 + exp � 2σ0|A1| √ 2 τ �, μ(τ) = − 2σ0 √ 2A1 |A1| � 1 + exp � 2σ0|A1| √ 2 τ ��, (21) where σ0 and A1 are real constants. Then, using (21), the volume of Equation (10) results vol � D(geodesic) Θ (τ′) � = � √ 2 σ2 dσdμ = √ 2A1 |A1| exp � − σ0|A1| √ 2 τ � . (22) Since the last quantity must be positive, we assume A1 > 0. Finally, employing the above expression into Equation (9) we arrive at � vol � D(geodesic) Θ (τ) � ≈ � 2 σ0A1 � 1 τ . (23) We can now see a reduction in time of the complexity meaning that acquiring information about both μ and σ and updating them allows us to increase our knowledge about the micro state of the system. Hence, comparing Equations (16), (18) and (23) we conclude that the entropic inferences on a Gaussian distributed micro-variable is carried out in a more efficient manner when both its mean and the variance in the form of information constraints are available. Macroscopic predictions when only one of these pieces of information are available are more complex. 3.2. Bivariate Gaussian Statistical Model Consider now the Gaussian statistical model P of the Equation (12) when n = 2. In this case the dimension of the Riemannian manifold M = (Θ, g) is at most four. From the analysis of the monovariate Gaussian model in Section 3.1 we have understood that both mean and variance should be considered. Hence the minimal assumption is to consider E(X1) = E(X2) = μ and E(X1 − μ)2 = E(X2 − μ)2 = σ2. Furthermore, in this case we have also to take into account the possible presence of (micro) correlations, which appear at the level of macro-states as off-diagonal terms in the covariance matrix. In short, this implies considering the following probability distribution function p(x1, x2|μ, σ) = exp � − 1 2σ2(1−ρ2) � (x1 − μ)2 − 2ρ(x1 − μ)(x2 − μ) + (x2 − μ)2�� 2πσ2� 1 − ρ2 , (24) where ρ ∈ (−1, 1). Thanks to Equation (15) we compute the Fisher-Information matrix G and find g = g11dμ2 + g22dσ2 with, g11 = 2 σ2(ρ + 1); g22 = 4 σ2 . (25) 242 Entropy 2014, 16, 2944–2958 The only non trivial Christoffel coefficients (4) are Γ1 12 = − 1 σ, Γ2 11 = 1 2σ(ρ+1) and Γ2 22 = − 1 σ. In this case as well, the sectional curvature (Equation (7)) of the manifold M is a negative function and so we may expect a decreasing asymptotic behavior for the IGC. From Equation (5) it follows that the geodesic equations are, ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ d2μ(τ) dτ2 − 2 σ dσ dτ dμ dτ = 0 d2σ(τ) dτ2 − 1 σ � dσ dτ �2 + 1 2(1+ρ)σ � dμ dτ �2 = 0, (26) whose solutions are, σ(τ) = 2σ0 exp � σ0|A1| √ 2(1+ρ)τ � 1 + exp � 2σ0|A1| √ 2(1+ρ)τ �, μ(τ) = − 2σ0 � 2(1 + ρ)A1 |A1| � 1 + exp � 2σ0|A1| √ 2(1+ρ)τ ��. (27) Using (27) in Equation (10) gives the volume, vol � D(geodesic) Θ (τ′) � = � 2 √ 2 � 1 + ρ σ2 dσdμ = 4A1 |A1| exp � − σ0|A1| � 2(1 + ρ) τ � . (28) To have it positive we have to assume A1 > 0. Finally, employing (28) in (9) leads to the IGC, � vol � D(geodesic) Θ (τ) � ≈ � 4 √ 2 σ0A1 �� 1 + ρ τ , (29) with ρ ∈ (−1, 1). We may compare the asymptotic expression of the ICGs in the presence and in the absence of correlations, obtaining R strong bivariate(ρ) := � vol � D(geodesic) Θ (τ) � � vol � D(geodesic) Θ (τ) � ρ=0 = � 1 + ρ, (30) where “strong” stands for the fully connected lattice underlying the micro-variables. The ratio R strong bivariate(ρ) results a monotonic increasing function of ρ. While the temporal behavior of the IGC (29) is similar to the IGC in (23), here correlations play a fundamental role. From Equation (30), we conclude that entropic inferences on two Gaussian distributed micro-variables on a fully connected lattice is carried out in a more efficient manner when the two micro-variables are negatively correlated. Instead, when such micro-variables are positively correlated, macroscopic predictions become more complex than in the absence of correlations. Intuitively, this is due to the fact that for anticorrelated variables, an increase in one variable implies a decrease in the other one (different directional change): variables become more distant, thus more distinguishable in the Fisher–Rao information metric sense. Similarly, for positively correlated variables, an increase or decrease in one variable always predicts the same directional change for the second variable: variables do not become more distant, thus more distinguishable in the Fisher–Rao information metric sense. This may lead us to guess that in the presence of anticorrelations, motion on curved statistical manifolds via the Maximum Entropy updating methods becomes less complex. 3.3. Trivariate Gaussian Statistical Model In this section we consider a Gaussian statistical model P of the Equation (12) when n = 3. In this case as well, in order to understand the asymptotic behavior of the IGC in the presence of correlations between the micro-states, we make the minimal assumption that, given the random vector X = (X1, X2, X3) distributed according to a trivariate Gaussian, then E(X1) = E(X2) = E(X3) = μ 243 Entropy 2014, 16, 2944–2958 and E(X1 − μ)2 = E(X2 − μ)2 = E(X2 − μ)2 = σ2. Therefore, the space of the parameters of P is given by Θ = {(μ, σ)|μ ∈ R, σ > 0}. The manifold M = (Θ, g) changes its metric structure depending on the number of correlations between micro-variables, namely, one, two, or three . The covariance matrices corresponding to these cases read, modulo the congruence via a permutation matrix [17], C1 = σ2 ⎛ ⎜ ⎝ 1 ρ 0 ρ 1 0 0 0 1 ⎞ ⎟ ⎠ , C2 = σ2 ⎛ ⎜ ⎝ 1 ρ ρ ρ 1 0 ρ 0 1 ⎞ ⎟ ⎠ , C3 = σ2 ⎛ ⎜ ⎝ 1 ρ ρ ρ 1 ρ ρ ρ 1 ⎞ ⎟ ⎠ . (31) 3.3.1. Case 1 First, we consider the trivariate Gaussian statistical model of Equation (12) when C ≡ C1. Then proceeding like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 = 3+ρ (1+ρ)σ2 and g22 = 6 σ2 . Also in this case we find that the sectional curvature of Equation (7) is a negative function. Hence, as we state in Section 2, we may expect a decreasing (in time) behavior of the information geometry complexity. Furthermore, we obtain the geodesics σ(τ) = 2σ0 exp � σ0 � A(ρ) τ � 1 + exp � 2σ0 � A(ρ) τ �, μ(τ) = − 2σ0A1 � A(ρ) 1 1 + exp � 2σ0 � A(ρ) τ �, (32) where A(ρ) = A2 1(3+ρ) 6(1+ρ) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (−1, 1). Then, the volume (10) becomes vol � D(geodesic) Θ (τ′) � = � � 6(3 − 4ρ) (1 − 2ρ2) 1 σ2 dσdμ = 6A1 |A1| exp � − σ0 � A(ρ) τ � , (33) requiring A1 > 0 for its positivity. Finally, using (33) in (9) we arrive at the asymptotic behavior of the IGC � vol � D(geodesic) Θ (τ) � ≈ � 6 √ 6 σ0A1 �� 1 + ρ 3 + ρ 1 τ . (34) Comparing (34) in the presence and in the absence of correlations yields R weak trivariate(ρ) := � vol � D(geodesic) Θ (τ) � � vol � D(geodesic) Θ (τ) � ρ=0 = √ 3 � 1 + ρ 3 + ρ, (35) where “weak” stands for low degree of connection in the lattice underlying the micro-variables Notice that Rweak trivariate(ρ) is a monotonic increasing function of the argument ρ ∈ (−1, 1). 3.3.2. Case 2 When the trivariate Gaussian statistical model of Equation (12) has C ≡ C2, the condition C > 0 constraints the correlation coefficient to be ρ ∈ (− √ 2 2 , √ 2 2 ). Proceeding again like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 = 3−4ρ (1−2ρ2)σ2 and g22 = 6 σ2 . The sectional curvature of Equation (7) is a negative function as well and so we may apply the arguments of Section 2 expecting a decreasing in time of the complexity. Furthermore, we obtain the geodesics σ(τ) = 2σ0 exp � σ0 � A(ρ) τ � 1 + exp � 2σ0 � A(ρ) τ �, μ(τ) = − 2σ0A1 � A(ρ) 1 1 + exp � 2σ0 � A(ρ) τ �, (36) 244 Entropy 2014, 16, 2944–2958 where A(ρ) = A2 1(3−4ρ) 6(1−2ρ2) and A1 ∈ R. We remark that A(ρ) > 0 for all ρ ∈ (− √ 2 2 , √ 2 2 ). Then, the volume (10) becomes vol � D(geodesic) Θ (τ′) � = � � 6(3 − 4ρ) (1 − 2ρ2) 1 σ2 dσdμ = 6A1 |A1| exp � − σ0 � A(ρ) τ � . (37) We have to set A1 > 0 for the positivity of the volume (37), and using it in (9) we arrive at the asymptotic behavior of the IGC � vol � D(geodesic) Θ (τ) � ≈ � 6 √ 6 σ0A1 �� 1 − 2ρ2 3 − 4ρ 1 τ . (38) Then, comparing (38) in the presence and in the absence of correlations yields R mildly weak trivariate (ρ) := � vol � D(geodesic) Θ (τ) � � vol � D(geodesic) Θ (τ) � ρ=0 = √ 3 � 1 − 2ρ2 3 − 4ρ , (39) where “mildly weak” stands for a lattice (underlying micro-variables) neither fully connected nor with minimal connection. This is a function of the argument ρ ∈ (− √ 2 2 , √ 2 2 ) that attains the maximum � 3 2 at ρ = 1 2, while in the extrema of the interval (− √ 2 2 , √ 2 2 ) it tends to zero. 3.3.3. Case 3 Last, we consider the trivariate Gaussian statistical model of the Equation (12) when C ≡ C3. In this case, the condition C > 0 requires the correlation coefficient to be ρ ∈ (− 1 2, 1). Proceeding again like in Section 3.2 we have g = g11dμ2 + g22dσ2, where g11 = 3 (1+2ρ)σ2 and g22 = 6 σ2 . We find that the sectional curvature of Equation (7) is a negative function; hence, we may expect a decreasing (in time) behavior of the complexity. It follows the geodesics σ(τ) = 2σ0 exp � σ0 � A(ρ) τ � 1 + exp � 2σ0 � A(ρ) τ �, μ(τ) = − 2σ0A1 � A(ρ) 1 1 + exp � 2σ0 � A(ρ) τ �, (40) where A(ρ) = A2 1 2(1+2ρ) and A1 ∈ R. We note that A(ρ) > 0 for all ρ ∈ (− 1 2, 1). Using (40), we compute vol � D(geodesic) Θ (τ′) � = � 3 √ 2 � (1 + 2ρ) 1 σ2 dσdμ = 6 √ 2A1 |A1| exp � − σ0 � A(ρ) τ � . (41) Also in this case we need to assume A1 > 0 to have positive volume. Finally, substituting Equation (41) into Equation (9), the asymptotic behavior of the IGC results � vol � D(geodesic) Θ (τ) � ≈ � 12 σ0A1 �� 1 + 2ρ 1 τ . (42) The comparison of (42) in the presence and in the absence of correlations yields R strong trivariate(ρ) := � vol � D(geodesic) Θ (τ) � � vol � D(geodesic) Θ (τ) � ρ=0 = � 1 + 2ρ, (43) 245 Entropy 2014, 16, 2944–2958 where “strong” stands for a fully connected lattice underlying the (three) micro-variables. We remark the latter ratio is a monotonically increasing function of the argument ρ ∈ (− 1 2, 1). The behaviors of R(ρ) of Equations (30), (35), (39) and (43) are reported in Figure 1. −1 −0.5 0 0.5 1 ρpeak ρ R(ρ) Figure 1. Ratio R(ρ) of volumes vs. degree of correlations ρ. Solid line refers to R strong bivariate(ρ); Dotted line refers to Rweak trivariate(ρ); Dashed line referes to R mildly weak trivariate (ρ); Dash-dotted refers to R strong trivariate(ρ). The non-monotonic behavior of the ratio R mildly weak trivariate (ρ) in Equation (39) corresponds to the information geometric complexities for the mildly weak connected three-dimensional lattice. Interestingly, the growth stops at a critical value ρpeak = 1 2 at which R mildly weak trivariate (ρpeak) = R strong bivariate(ρpeak). From Equation (30), we conclude that entropic inferences on three Gaussian distributed micro-variables on a fully connected lattice is carried out in a more efficient manner when the two micro-variables are negatively correlated. Instead, when such micro-variables are positively correlated, macroscopic predictions become more complex that in the absence of correlations. Furthermore, the ratio R strong trivariate(ρ) of the information geometric complexities for this fully connected three-dimensional lattice increases in a monotonic fashion. These conclusions are similar to those presented for the bivariate case. However, there is a key-feature of the IGC to emphasize when passing from the two-dimensional to the three-dimensional manifolds associated with fully connected lattices: the effects of negative-correlations and positive-correlations are amplified with respect to the respective absence of correlations scenarios, R strong trivariate(ρ) R strong bivariate(ρ) = � 1 + 2ρ 1 + ρ , (44) where ρ ∈ (− 1 2, 1). Specifically, carrying out entropic inferences on the higher-dimensional manifold in the presence of anti-correlations, that is for ρ ∈ � − 1 2, 0 � , is less complex than on the lower-dimensional manifold as evident form Equation (44). The vice-versa is true in the presence of positive-correlations, that is for ρ ∈ (0, 1). 4. Concluding Remarks In summary, we considered low dimensional Gaussian statistical models (up to a trivariate model) and have investigated their dynamical (temporal) complexity. This has been quantified by the volume of geodesics for parameters characterizing the probability distribution functions. To the best of our knowledge, there is no dynamic measure of complexity of geodesic paths on curved statistical manifolds that could be compared to our IGC. However, it could be worthwhile to understand the connection, if 246 Entropy 2014, 16, 2944–2958 any, between our IGC and the complexity of paths of dynamic systems introduced in [20]. Specifically, according to the Alekseev-Brudno theorem in the algorithmic theory of dynamical systems [21], a way to predict each new segment of chaotic trajectory is obtained by adding information proportional to the length of this segment and independent of the full previous length of trajectory. This means that this information cannot be extracted from observation of the previous motion, even an infinitely long one! If the instability is a power law, then the required information per unit time is inversely proportional to the full previous length of the trajectory and, asymptotically, the prediction becomes possible. For the sake of completeness, we also point out that the relevance of volumes in quantifying the static model complexity of statistical models was already pointed out in [22] and [23]: complexity is related to the volume of a model in the space of distributions regarded as a Riemannian manifold of distributions with a natural metric defined by the Fisher–Rao metric tensor. Finally, we would like to point out that two of the Authors have recently associated Gaussian statistical models to networks [17]. Specifically, it is assumed that random variables are located on the vertices of the network while correlations between random variables are regarded as weighted edges of the network. Within this framework, a static network complexity measure has been proposed as the volume of the corresponding statistical manifold. We emphasize that such a static measure could be, in principle, applied to time-dependent networks by accommodating time-varying weights on the edges [24]. This requires the consideration of a time-sequence of different statistical manifolds. Thus, we could follow the time-evolution of a network complexity through the time evolution of the volumes of the associated manifolds. In this work we uncover that in order to have a reduction in time of the complexity one has to consider both mean and variance as macro-variables. This leads to different topological structures of the parameter space in (13); in particular, we have to consider at least a 2-dimensional manifold in order to have effects such as a power law decay of the complexity. Hence, the minimal hypothesis in a multivariate Gaussian model consists in considering all mean values equal and all covariances equal. In such a case, however, the complexity shows interesting features depending on the correlation among micro-variables (as summarized in Figure 1). For a trivariate model with only two correlations the information geometric complexity ratio exhibits a non monotonic behavior in ρ (correlation parameter) taking zero value at the extrema of the range of ρ. In contrast to closed configurations (bivariate and trivariate models with all micro-variables correlated each other) the complexity ratio exhibits a monotonic behavior in terms of the correlation parameter. The fact that in such a case this ratio cannot be zero at the extrema of the range of ρ is reminiscent of the geometric frustration phenomena that occurs in the presence of loops [11]. Specifically, recall that a geometrically frustrated system cannot simultaneously minimize all interactions because of geometric constraints [11,18]. For example, geometric frustration can occur in an Ising model which is an array of spins (for instance, atoms that can take states ±1) that are magnetically coupled to each other. If one spin is, say, in the +1 state then it is energetically favorable for its immediate neighbors to be in the same state in the case of a ferromagnetic model. On the contrary, in antiferromagnetic systems, nearest neighbor spins want to align in opposite directions. This rule can be easily satisfied on a square. However, due to geometrical frustration, it is not possible to satisfy it on a triangle: for an antiferromagnetic triangular Ising model, any three neighboring spins are frustrated. Geometric frustration in triangular Ising models can be observed by considering spin configurations with total spin J = ±1 and analyzing the fluctuations in energy of the spin system as a function of temperature. There is no peak at all in the standard deviation of the energy in the case J = −1, and a monotonic behavior is recorded. This indicates that the antiferromagnetic system does not have a phase transition to a state with long-range order. Instead, in the case J = +1, a peak in the energy fluctuations emerges. This significant change in the behavior of energy fluctuations as a function of temperature in triangular configurations of spin systems is a signature of the presence of frustrated interactions in the system [19]. 247 Entropy 2014, 16, 2944–2958 In this article, we observe a significant change in the behavior of the information geometric complexity ratios as a function of the correlation coefficient in the trivariate Gaussian statistical models. Specifically, in the fully connected trivariate case, no peak arises and a monotonic behavior in ρ of the information geometric complexity ratio is observed. In the mildly weak connected trivariate case, instead, a peak in the information geometric complexity ratio is recorded at ρpeak ≥ 0. This dramatic disparity of behavior can be ascribed to the fact that when carrying out statistical inferences with positively correlated Gaussian random variables, the maximum entropy favorable scenario is incompatible with these working hypothesis. Thus, the system appears frustrated. These considerations lead us to conclude that we have uncovered a very interesting information geometric resemblance of the more standard geometric frustration effect in Ising spin models. However, for a conclusive claim of the existence of an information geometric analog of the frustration effect, we feel we have to further deepen our understanding. A forthcoming research project along these lines will be a detailed investigation of both arbitrary triangular and square configurations of correlated Gaussian random variables where we take into consideration both the presence of different intensities and signs of pairwise interactions (ρij ̸= ρik if j ̸= k, ∀i). Acknowledgments: Domenico Felice and Stefano Mancini acknowledge the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under the FET-Open grant agreement TOPDRIM, number FP7-ICT-318121. Author Contributions: The authors have equally contributed to the paper. All authors read and approved the final manuscript. Conflicts of Interest: Conflicts of Interest The authors declare no conflict of interest. References 1. Feldman, D.F.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252. 2. Kolmogorov, A.N. A new metric invariant of transitive dynamical systems and of automorphism of Lebesgue spaces. Doklady Akademii Nauk SSSR 1958, 119, 861–864. 3. Caticha, A. Entropic Dynamics. Bayesian Inference and Maximum Entropy Methods in Science and Engineering, the 22nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Moscow, Idaho, 3-7 August 2002; Fry, R.L., Ed.; American Institute of Physics: College Park, MD, USA, 2002; Volume 617, p. 302. 4. Caticha, A.; Preuss, R. Maximum entropy and Bayesian data analysis: Entropic prior distributions. Phys. Rev. E 2004, 70, 046127. 5. Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NY, USA, 2000. 6. Cafaro, C. Works on an information geometrodynamical approach to chaos. Chaos Solitons Fractals 2009, 41, 886–891. 7. Pettini, M. Geometry and Topology in Hamiltonian Dynamics and Statistical Mechanics; Springer-Verlag: Berlin/Heidelberg, Germany, 2007. 8. Lebowitz, J.L. Microscopic Dynamics and Macroscopic Laws. Ann. N. Y. Acad. Sci. 1981, 373, 220–233. 9. Shibata, T.; Chawanya, T.; Chawanya, K. Noiseless Collective Motion out of Noisy Chaos. Phys. Rev. Lett. 1999, 82, doi: http://dx.doi.org/10.1103/PhysRevLett.82.4424. 10. Ali, S.A.; Cafaro, C.; Kim, D.-H.; Mancini, S. The effect of the microscopic correlations on the information geometric complexity of Gaussian statistical models. Physica A 2010, 389, 3117–3127. 11. Sadoc, J.F.; Mosseri, R. Geometrical Frustration; Cambridge University Press: Cambridge, UK, 2006. 12. Lee, J.M. Riemannian Manifolds: An Introduction to Curvature; Springer: New York, NY, USA, 1997. 13. Do Carmo, M.P. Riemannian Geometry; Springer: New York, NY, USA, 1992. 14. Cafaro, C.; Ali, S.A. Jacobi fields on statistical manifolds of negative curvature. Physica D 2007, 234, 70–80. 15. Cafaro, C.; Giffin, A.; Ali, S.A.; Kim, D.-H. Reexamination of an information geometric construction of entropic indicators of complexity. Appl. Math. Comput. 2010, 217, 2944–2951. 16. Cafaro, C.; Mancini, S. Quantifying the complexity of geodesic paths on curved statistical manifolds through information geometric entropies and Jacobi fields. Physica D 2011, 240, 607–618. 248 Entropy 2014, 16, 2944–2958 17. Felice, D.; Mancini, S.; Pettini, M. Quantifying Networks Complexity from Information Geometry Viewpoint. J. Math. Phys. 2014, 55, 043505. 18. Moessner, R.; Ramirez, A.P. Geometrical Frustration. Phys. Today 2006, 59, 24–29. 19. MacKay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. 20. Brudno, A.A. The complexity of the trajectories of a dynamical system. Uspekhi Mat. Nauk 1978, 33, 207–208. 21. Alekseev, V.M.; Yacobson, M.V. Symbolic dynamics and hyperbolic dynamic systems. Phys. Rep. 1981, 75, 290–325. 22. Myung, J.; Balasubramanian, V.; Pitt, M.A. Counting probability distributions: differential geometry and model selection. Proc. Natl. Acad. Sci. USA 2000, 97, 11170. 23. Rodriguez, C.C. The volume of bitnets. AIP Conf. Proc. 2004, 735, 555–564. 24. Motter, A.E.; Albert, R. Networks in motion. Phys. Today 2012, 65, 43–48. c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 249 entropy Article Learning from Complex Systems: On the Roles of Entropy and Fisher Information in Pairwise Isotropic Gaussian Markov Random Fields Alexandre Levada Computing Department, Federal University of São Carlos, Rod. Washington Luiz, km 235, São Carlos, SP, Brazil; E-Mail: alexandre@dc.ufscar.br Received: 4 December 2013; / Accepted: 30 January 2014 / Published: 18 February 2014 Abstract: Markov random field models are powerful tools for the study of complex systems. However, little is known about how the interactions between the elements of such systems are encoded, especially from an information-theoretic perspective. In this paper, our goal is to enlighten the connection between Fisher information, Shannon entropy, information geometry and the behavior of complex systems modeled by isotropic pairwise Gaussian Markov random fields. We propose analytical expressions to compute local and global versions of these measures using Besag’s pseudo-likelihood function, characterizing the system’s behavior through its Fisher curve , a parametric trajectory across the information space that provides a geometric representation for the study of complex systems in which temperature deviates from infinity. Computational experiments show how the proposed tools can be useful in extracting relevant information from complex patterns. The obtained results quantify and support our main conclusion, which is: in terms of information, moving towards higher entropy states (A –> B) is different from moving towards lower entropy states (B –> A), since the Fisher curves are not the same, given a natural orientation (the direction of time). Keywords: Markov random fields; information theory; Fisher information; entropy; maximum pseudo-likelihood estimation 1. Introduction With the increasing value of information in modern society and the massive volume of digital data that is available, there is an urgent need for developing novel methodologies for data filtering and analysis in complex systems. In this scenario, the notion of what is informative or not is a top priority. Sometimes, patterns that at first may appear to be locally irrelevant may turn out to be extremely informative in a more global perspective. In complex systems, this is a direct consequence of the intricate non-linear relationship between the pieces of data along different locations and scales. Within this context, information theoretic measures play a fundamental role in a huge variety of applications once they represent statistical knowledge in a systematic, elegant and formal framework. Since the first works of Shannon [1], and later with many other generalizations [2–4], the concept of entropy has been adapted and successfully applied to almost every field of science, among which we can cite physics [5], mathematics [6–8], economics [9] and, fundamentally, information theory [10–12]. Similarly, the concept of Fisher information [13,14] has been shown to reveal important properties of statistical procedures, from lower bounds on estimation methods [15–17] to information geometry [18,19]. Roughly speaking, Fisher information can be thought of as the likelihood analog of entropy, which is a probability-based measure of uncertainty. In general, classical statistical inference is focused on capturing information about location and dispersion of unknown parameters of a given family of distribution and studying how this information is related to uncertainty in estimation procedures. In typical situations, an exponential family of Entropy 2014, 16, 1002–1036; doi:10.3390/e16021002 www.mdpi.com/journal/entropy 250 Entropy 2014, 16, 1002–1036 distributions and independence hypothesis (independent random variables) are often assumed, giving the likelihood function a series of desirable mathematical properties [15–17]. Although mathematically convenient for many problems, in complex systems modeling, independence assumption is not reasonable, because much of the information is somehow encoded in the relations between the random variables [20,21]. In order to overcome this limitation, Markov random field (MRF) models appear to be a natural generalization of the classical approach by the replacement of the independence assumption by a more realistic conditional independence assumption. Basically, in every MRF, knowledge of a finite-support neighborhood around a given variable isolates it from all the remaining variables. A further simplification consists in considering a pairwise interaction model, constraining the size of the maximum clique to be two (in other words, the model captures only binary relationships). Moreover, if the MRF model is isotropic, which means that the parameter controlling the interactions between neighboring variables is invariant to change in the directions, all the information regarding the spatial dependence structure of the system is conveyed by a single parameter, from now on denoted by β (or simply, the inverse temperature). In this paper, we assume an isotropic pairwise Gaussian Markov random field (GMRF) model [22,23], also known as an auto-normal model or a conditional auto-regressive model [24,25]. Basically, the question that motivated this work and that we are trying to elucidate here is: What kind of information is encoded by the β parameter in such a model? We want to know how this parameter, and as a consequence, the whole spatial dependence structure of a complex system modeled by a Gaussian Markov random field, is related to both local and global information theoretic measures, more precisely the observed and expected Fisher information, as well as self-information and Shannon entropy. In searching for answers for our fundamental question, investigations led us to an exact expression for the asymptotic variance of the maximum pseudo-likelihood (MPL) estimator of β in an isotropic pairwise GMRF model, suggesting that asymptotic efficiency is not granted. In the context of statistical data analysis, Fisher information plays a central role in providing tools and insights for modeling the interactions between complex systems and their components. The advantage of MRF models over the traditional statistical ones is that MRFs take into account the dependence between pieces of information as a function of the system’s temperature, which may even be variable along time. Briefly speaking, this investigation aims to explore ways to measure and quantify distances between complex systems operating in different thermodynamical conditions. By analyzing and comparing the behavior of local patterns observed throughout the system (defined over a regular 2D lattice), it is possible to measure how informative those patterns for a given inverse temperature are, or simply β (which encodes the expected global behavior). In summary, our idea is to describe the behavior of a complex system in terms of information as its temperature deviates from infinity (when the particles are statistically independent) to a lower bound. The obtained results suggest that, in the beginning, when the temperature is infinite and the information equilibrium prevails, the information is somehow spread along the system. However, when temperature is low and this equilibrium condition does not hold anymore, we have a more sparse representation in terms of information, since this information is concentrated in the boundaries of the regions that define a smooth global configuration. In the vast remaining of this “universe”, due to this smooth constraint, the strong alignment between the particles prevails, which is exactly the expected global behavior for temperatures below a critical value (making the majority of the interaction patterns along the system uninformative). The remainder of the paper is organized as follows: Section 2 discusses a technique for the estimation of the inverse temperature parameter, called the maximum pseudo-likelihood (MPL) approach, and provides derivations for the observed Fisher information in an isotropic pairwise GMRF model. Intuitive interpretations for the two versions of this local measure are discussed. In Section 3, we derive analytical expressions for the computation of the expected Fisher information, which allows us to assign a global information measure for a given system configuration. Similarly, in Section 4, an expression for the global entropy of a system modeled by a GMRF is shown. The results suggest a 251 Entropy 2014, 16, 1002–1036 connection between maximum pseudo-likelihood and minimum entropy criteria in the estimation of the inverse temperature parameter on GMRFs. Section 5 discusses the uncertainty in the estimation of this important parameter by defining an expression for the asymptotic variance of its maximum pseudo-likelihood estimator in terms of both forms of Fisher information. In Section 6, the definition of the Fisher curve of a system as a parametric trajectory in the information space is proposed. Section 7 shows the experimental setup. Computational simulations with both Markov chain Monte Carlo algorithms and some real data were conducted, showing how the proposed tools can be used to extract relevant information from complex systems. Finally, Section 8 presents our conclusions, final remarks and possibilities for future works. 2. Fisher Information in Isotropic Pairwise GMRFs The remarkable Hammersley–Clifford theorem [26] states the equivalence between Gibbs random fields (GRF) and Markov random fields (MRF), which implies that any MRF can be defined either in terms of a global (joint Gibbs distribution) or a local (set of local conditional density functions) model. For our purposes, we will choose the latter representation. Definition 1. An isotropic pairwise Gaussian Markov random field regarding a local neighborhood system, ηi, defined on a lattice S = {s1, s2, . . . , sn} is completely characterized by a set of n local conditional density functions p(xi|ηi,⃗θ), given by: p � xi|ηi,⃗θ � = 1 √ 2πσ exp ⎧ ⎨ ⎩− 1 2σ2 � xi − μ − β ∑ j∈ηi � xj − μ � �2⎫ ⎬ ⎭ (1) with⃗θ = (μ, σ2, β), where μ and σ2 are the expected value and the variance of the random variables, and β = 1/T is the parameter that controls the interaction between the variables (inverse temperature). Note that, for β = 0, the model degenerates to the usual Gaussian distribution. From an information geometry perspective [18,19], this means that we are constrained to a sub-manifold within the Riemannian manifold of probability distributions, where the natural Riemannian metric (tensor) is given by the Fisher information. It has been shown that the geometric structure of exponential family distributions exhibits constant curvature. However, little is known about information geometry on more general statistical models, such as GMRFs. For β > 0, some degree of correlation between the observations is expected, making the interactions grow stronger. Typical choices for ηi are the first and second order non-causal neighborhood systems, defined by the sets of four and eight nearest neighbors, respectively. 2.1. Maximum Pseudo-Likelihood Estimation Maximum likelihood estimation is intractable in MRF parameter estimation, due to the existence of the partition function in the joint Gibbs distribution. An alternative, proposed by Besag [24], is maximum pseudo-likelihood estimation, which is based on the conditional independence principle. The pseudo-likelihood function is defined as the product of the LCDFs for all the n variables of the system, modeled as a random field. Definition 2. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. Assuming that X(t) = {x(t) 1 , x(t) 2 , . . . , x(t) n } denotes the set corresponding to the observations at time t, the pseudo-likelihood function of the model is defined by: L � ⃗θ; X(t)� = n ∏ i=1 p(xi|ηi,⃗θ) (2) 252 Entropy 2014, 16, 1002–1036 Note that the pseudo-likelihood function is a function of the parameters. For better mathematical tractability, it is usual to take the logarithm of L(⃗θ; X(t)). Plugging Equation (1) into Equation (2) and taking the logarithm leads to: log L � ⃗θ; X(t)� = −n 2 log � 2πσ2� − 1 2σ2 n ∑ i=1 � xi − μ − β ∑ j∈ηi � xj − μ � �2 (3) By differentiating Equation (3) with respect to each parameter and properly solving the pseudo-likelihood equations, we obtain the following maximum pseudo-likelihood estimators for the parameters, μ, σ2 and β: ˆβMPL = n ∑ i=1 � (xi − μ) ∑ j∈ηi � xj − μ � � n ∑ i=1 � ∑ j∈ηi � xj − μ � �2 (4) ˆμMPL = 1 n (1 − kβ) n ∑ i=1 � xi − β ∑ j∈ηi xj � (5) ˆσ2 MPL = 1 n n ∑ i=1 � xi − μ − β ∑ j∈ηi � xj − μ � �2 (6) where k denotes the cardinality of the non-causal neighborhood set ηi. Note that if β = 0, the MPL estimators of both μ and σ2 become the widely known sample mean and sample variance. Since the cardinality of the neighborhood system, k = |ηi|, is spatially invariant (we are assuming a regular neighborhood system) and each variable is dependent on a fixed number of neighbors on a lattice, ˆβMPL can be rewritten in terms of cross-covariances: ˆβMPL = ∑ j∈ηi ˆσij ∑ j∈ηi ∑ k∈ηi ˆσjk (7) where σij denotes the sample covariance between the central variable, xi, and xj ∈ ηi. Similarly, σjk denotes the sample covariance between two variables belonging to the neighborhood system, ηi (the definition of the neighborhood system, ηi, does not include the the location, si). 2.2. Fisher Information of Spatial Dependence Parameters Basically, Fisher information measures the amount of information a sample conveys about an unknown parameter. It can be thought of as the likelihood analog of entropy, which is a probability-based measure of uncertainty. Often, when we are dealing with independent and identically distributed (i.i.d) random variables, the computation of the global Fisher information presented in a random sample X(t) = {x(t) 1 , x(t) 2 , . . . , x(t) n } is quite straightforward, since each observation, xi, i = 1, 2, . . . , n, brings exactly the same amount of information (when we are dealing with independent samples, the superscript, t, is usually suppressed, since the underlying dependence structure does not change through time). However, this is not true for spatial dependence parameters in MRFs, since different configuration patterns (xi ∪ ηi) provide distinct contributions to the local observed Fisher information, which can be used to derive a reasonable approximation to the global Fisher information [27]. 253 Entropy 2014, 16, 1002–1036 2.3. The Information Equality It is widely known from statistical inference theory that, under certain regularity conditions, information equality holds in the case of independent observations in the exponential family [15–17]. In other words, we can compute the Fisher information of a random sample regarding a parameter of interest, θ, by: I � θ; X(t)� = E �� ∂ ∂θ logL � θ; X(t)��2� = −E � ∂2 ∂θ2 logL � θ; X(t)�� (8) where L � θ; X(t)� denotes the likelihood function at a time instant, t. In our investigations, to avoid the joint Gibbs distribution, often intractable due to the presence of the partition function (global Gibbs field), we replace the usual likelihood function by Besag’s pseudo-likelihood function, and then, we work with the local model instead (local Markov field). However, given the intrinsic spatial dependence structure of Gaussian Markov random field models, information equilibrium is not a natural condition. As we will discuss later, in general, information equality fails. Thus, in a GMRF model, we have to consider two kinds of Fisher information, from now on denoted by Type I (due to the first derivative of the pseudo-likelihood function) and Type II (due to the second derivative of the pseudo-likelihood function). Eventually, when certain conditions are satisfied, these two values of information will converge to a unique bound. Essentially, β is the parameter responsible to control whether both forms of information converge or diverge. Knowing the role of β (inverse temperature) in a GMRF model, it is expected that for β = 0 (or T → ∞), information equilibrium prevails. In fact, we will see in the following sections that as β deviates from zero (and long-term correlations start to emerge), the divergence between the two kinds of information increases. In terms of information geometry, it has been shown that the geometric structure of the exponential family of distributions is basically given by the Fisher information matrix, which is the natural Riemmanian metric (metric tensor) [18,19]. So, when the inverse temperature parameter is zero, the geometric structure of the model is a surface since the parametric space is 2D (μ and σ2). However, as the inverse temperature parameter starts to increase, the original surface is gradually transformed to a 3D Riemmanian manifold, equipped with a novel metric tensor (the 3 × 3 Fisher information matrix for μ, σ2 and β). In this context, by measuring the Fisher information regarding the inverse temperature parameter along an interval ranging from βMIN = A = 0 to βMAX = B, we are essentially trying to capture part of the deformation in the geometric structure of the model. In this paper, we focus on the computation of this measure. In future works we expect to derive the complete Fisher information matrix in order to completely characterize the transformations in the metric tensor. 2.4. Observed Fisher Information In order to quantify the amount of information conveyed by a local configuration pattern in a complex system, the concept of observed Fisher information must be defined. Definition 3. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The Type I local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β, is defined in terms of its local conditional density function as: φβ(xi) = � ∂ ∂βlog p � xi|ηi,⃗θ ��2 (9) 254 Entropy 2014, 16, 1002–1036 Hence, for an isotropic pairwise GMRF model, the Type I local observed Fisher information regarding β for the observation, xi, is given by: φβ(xi) = 1 σ4 �� xi − μ − β ∑ j∈ηi � xj − μ � � � ∑ j∈ηi � xj − μ � ��2 = 1 σ4 � ∑ j∈ηi (xi − μ) � xj − μ � − β ∑ j∈ηi ∑ k∈ηi � xj − μ � (xk − μ) �2 (10) Definition 4. Consider an MRF defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. The Type II local observed Fisher information for the observation, xi, regarding the spatial dependence parameter, β, is defined in terms of its local conditional density function as: ψβ(xi) = − ∂2 ∂β2 log p � xi|ηi,⃗θ � (11) In case of an isotropic pairwise GMRF model, the Type II local observed Fisher information regarding β for the observation, xi, is given by: φβ(xi) = 1 σ2 � ∑ j∈ηi ∑ k∈ηi � xj − μ � (xk − μ) � (12) Note that φβ(xi) does not depend on xi, only on the neighborhood system, ηi. Therefore, we have two local measures, φβ(xi) and ψβ(xi), that can be assigned to every element of a system modeled by an isotropic pairwise GMRF. In the following, we will discuss some interpretations for what is being measured with the proposed tools and how to define global versions for these measures by means of the expected Fisher information. 2.5. The Role of Fisher Information in GMRF Models At this point, a relevant issue is the interpretation of these Fisher information measures in a complex system modeled by an isotropic pairwise GMRF. Roughly speaking, φβ(xi) is the quadratic rate of change of the logarithm of the local likelihood function at xi, given a global value of β. As this global value of β determines what would be the expected global behavior (if β is large, a high degree of correlation among the observations is expected and if β is close to zero, the observations are independent), it is reasonable to admit that configuration patterns showing values of φβ(xi) close to zero are more likely to be observed throughout the field, once their likelihood values are high (close to the maximum local likelihood condition). In other words, these patterns are more “aligned” to what is considered to be the expected global behavior, and therefore, they convey little information about the spatial dependence structure (these samples are not informative once they are expected to exist in a system operating at that particular value of inverse temperature). Now, let us move on to configuration patterns showing high values of φβ(xi). Those samples can be considered landmarks, because they convey a large amount of information about the global spatial dependence structure. Roughly speaking, those points are very informative once they are not expected to exist for that particular value of β (which guides the expected global behavior of the system). Therefore, Type I local observed Fisher information minimization in GMRFs can be a useful tool in producing novel configuration patterns that are more likely to exist given the chosen value of inverse temperature. Basically, φβ(xi) tells us how informative a given pattern is for that specific global behavior (represented by a single parameter in an isotropic pairwise GMRF model). In summary, this 255 Entropy 2014, 16, 1002–1036 measure quantifies the degree of agreement between an observation, xi, and the configuration defined by its neighborhood system for a given β. As we will see later in the experiments section, typical informative patterns (those showing high values of φβ(xi)) in an organized system are located at the boundaries of the regions defining homogeneous areas (since these boundary samples show an unexpected behavior for large β, which is: there is no strong agreement between xi and its neighbors). Let us analyze the Type II local observed Fisher information, ψβ(xi). Informally speaking, this measure can be interpreted as a curvature measure, that is, how curved is the local likelihood function at xi. Thus, patterns showing low values of ψβ(xi) tend to have a nearly flat local likelihood function. This means that we are dealing with a pattern that could have been observed for a variety of β values (a large set of β values have approximately the same likelihood). An implication of this fact is that in a system dominated by this kind of patterns (patterns for which ψβ(xi) is close to zero), small perturbations may cause a sharp change in β (and, therefore, in the expected global behavior). In other words, these patterns are more susceptible to changes once they do not have a “stable” configuration (it raises our uncertainty about the true value of β). On the other hand, if the global configuration is mostly composed of patterns exhibiting large values of ψβ(xi), changes on the global structure are unlikely to happen (uncertainty on β is sufficiently small). Basically, ψβ(xi) measures the degree of agreement or dependence among the observations belonging to the same neighborhood system. If at a given xi, the observations belonging to ηi are totally symmetric around the mean value, ψβ(xi) would be zero. It is reasonable to expect that in this situation, as there is no information about the induced spatial dependence structure (this means that there is no contextual information available at this point). Notice that the role of ψβ(xi) is not the same as φβ(xi). Actually, these two measures are almost inversely related, since if at xi the value of φβ(xi) is high (it is a landmark or boundary pattern), then it is expected that ψβ(xi) will be low (in decision boundaries or edges, the uncertainty about β is higher, causing ψβ(xi) to be small). In fact, we will observe this behavior in some computational experiments conducted in future sections of the paper. It is important to mention that these rather informal arguments define the basis for understanding the meaning of the asymptotic variance of maximum pseudo-likelihood estimators, as we will discuss ahead. In summary, ψβ(xi) is a measure of how sure or confident we are about the local spatial dependence structure (at a given point, xi), since a high average curvature is desired for predicting the system’s global behavior in a reasonable manner (reducing the uncertainty of β estimation). 3. Expected Fisher Information In order to avoid the use of approximations in the computation of the global Fisher information in an isotropic pairwise GMRF, in this section, we provide an exact expression for ˆφβ and ˆψβ as Type I and Type II expected Fisher information. One advantage of using the expected Fisher information instead of its global observed counterpart is the faster computing time. As we will see, instead of computing a single local measure for each observation ,xi ∈ X, and then taking the average, both Φβ and Ψβ expressions depend only on the covariance matrix of the configuration patterns observed along the random field. 3.1. The Type I Expected Fisher Information Recall that the Type I expected Fisher information, from now on denoted by Φβ, is given by: Φβ = E �� ∂ ∂βlog L � ⃗θ; X(t)��2� (13) The Type II expected Fisher information, from now on denoted by Ψβ, is given by: Ψβ = −E � ∂2 ∂β2 log L � ⃗θ; X(t)�� (14) 256 Entropy 2014, 16, 1002–1036 We first proceed to the definition of Φβ. Plugging Equation (3) in Equation (13), and after some algebra, we obtain the following expression, which is composed by four main terms: Φβ = 1 σ4 E ⎧ ⎨ ⎩ � n ∑ s=1 � xs − μ − β ∑ j∈ηs � xj − μ � � � ∑ j∈ηs � xj − μ � ��2⎫ ⎬ ⎭ (15) = 1 σ4 E � n ∑ s=1 n ∑ r=1 � xs − μ − β ∑ j∈ηs � xj − μ � � � xr − μ − β ∑ k∈ηr (xk − μ) � × � ∑ j∈ηs � xj − μ � � � ∑ k∈ηr (xk − μ) �� = 1 σ4 E � n ∑ s=1 n ∑ r=1 � (xs − μ) (xr − μ) − β ∑ k∈ηr (xs − μ) (xk − μ) − β ∑ j∈ηs (xr − μ) � xj − μ � +β2 ∑ j∈ηs ∑ k∈ηr � xj − μ � (xk − μ) � � ∑ j∈ηs ∑ k∈ηr � xj − μ � (xk − μ) �� = 1 σ4 n ∑ s=1 n ∑ r=1 � ∑ j∈ηs ∑ k∈ηr E �(xs − μ) (xr − μ) � xj − μ � (xk − μ) � −β ∑ j∈ηs ∑ k∈ηr ∑ l∈ηr E �(xs − μ) � xj − μ � (xk − μ) (xl − μ) � −β ∑ m∈ηs ∑ j∈ηs ∑ k∈ηr E �(xr − μ) (xm − μ) � xj − μ � (xk − μ) � +β2 ∑ m∈ηs ∑ j∈ηs ∑ k∈ηr ∑ l∈ηr E �(xm − μ) � xj − μ � (xk − μ) (xl − μ) � � Hence, the expression for Φβ is composed by four main terms, each one of them involving a summation of higher-order cross-moments. According to Isserlis’ theorem [28], for normally distributed random variables, we can compute higher order moments in terms of the covariance matrix through the following identity: E [X1X2X3X4] = E [X1X2] E [X3X4] + E [X1X3] E [X2X4] + E [X2X3] E [X1X4] (16) Then, the first term of Equation (15) is reduced to: ∑ j∈ηs ∑ k∈ηr E �(xs − μ) (xr − μ) � xj − μ � (xk − μ) � = (17) ∑ j∈ηs ∑ k∈ηr � E [(xs − μ) (xr − μ)] E �� xj − μ � (xk − μ) � + E �(xs − μ) � xj − μ �� E [(xr − μ) (xk − μ)] + E �(xr − μ) � xj − μ �� E [(xs − μ) (xk − μ)] � = ∑ j∈ηs ∑ k∈ηr � σsrσjk + σsjσrk + σrjσsk � 257 Entropy 2014, 16, 1002–1036 where σsr denotes the covariance between variables xs and xr (note that in an MRF, we have σsr = 0 if xr /∈ ηs). We now proceed to the expansion of the second main term of Equation (15). Similarly, by applying Isserlis’ identity, we have: ∑ j∈ηs ∑ k∈ηr ∑ l∈ηr E �(xs − μ) � xj − μ � (xk − μ) (xl − μ) � = ∑ j∈ηs ∑ k∈ηr ∑ l∈ηr � σsjσkl + σskσjl + σjkσsl � (18) The third term of Equation (15) can be rewritten as: ∑ m∈ηs ∑ j∈ηs ∑ k∈ηr E �(xr − μ) (xm − μ) � xj − μ � (xk − μ) � = (19) = ∑ m∈ηs ∑ j∈ηs ∑ k∈ηr � σrmσjk + σrjσmk + σmjσrk � Finally, the fourth term of it is: ∑ m∈ηs ∑ j∈ηs ∑ k∈ηr ∑ l∈ηr E �(xm − μ) � xj − μ � (xk − μ) (xl − μ) � = (20) = ∑ m∈ηs ∑ j∈ηs ∑ k∈ηr ∑ l∈ηr � σmjσkl + σmkσjl + σmlσjk � Therefore, by combining Expressions Equations (17)–(20), we have the complete expression for Φβ, the Type I expected Fisher information for an isotropic pairwise GMRF model regarding the inverse temperature parameter, as: Φβ = 1 σ4 n ∑ s=1 n ∑ r=1 � ∑ j∈ηs ∑ k∈ηr � σsrσjk + σsjσrk + σrjσsk � (21) −β ∑ j∈ηs ∑ k∈ηr ∑ l∈ηr � σsjσkl + σskσjl + σjkσsl � −β ∑ m∈ηs ∑ j∈ηs ∑ k∈ηr � σrmσjk + σrjσmk + σmjσrk � +β2 ∑ m∈ηs ∑ j∈ηs ∑ k∈ηr ∑ l∈ηr � σmjσkl + σmkσjl + σmlσjk �� However, since we are interested in studying how the spatial correlations change as the system evolves, we need to estimate a value for Φβ given a single global state X(t) = � x(t) 1 , x(t) 2 , . . . , x(t) n � . Hence, to compute Φβ from a single static configuration X(t) (a photograph of the system at a given moment), we consider n = 1 in the previous equation, which means, among other things, that s = r (which implies ηs = ηr) and that observations belonging to different local neighborhoods are independent from each other (as we are dealing with a pairwise interaction Markovian process, it does not make sense to model the interactions between variables that are far away from each other in the lattice). Before proceeding, we would like to clarify some points regarding the estimation of the β parameter and the computation of the expected Fisher information in the isotropic pairwise GMRF model. Basically, there are two main possibilities: (1) the parameter is spatially-invariant, which means that we have a unique value, ˆβ(t), for a global configuration of the system, X(t) (this is our assumption); or (2) the parameter is spatially-variant, which means that we have a set of ˆβs values, for s = 1, 2, . . . , n, each one of them estimated from Xs = � x(1) s , x(2) s , . . . , x(t) s � (we are observing the outcomes of a random pattern along time in a fixed position of the lattice). When we are dealing with the first model (β is spatially-invariant), all possible observation patterns (samples) are extracted from the global configuration by a sliding window (with the shape of the neighborhood system) that moves 258 Entropy 2014, 16, 1002–1036 through the lattice at a fixed time instant, t. In this case, we are interested in studying the spatial correlations, not the temporal ones. In other words, we would like to investigate how the the spatial structure of a GMRF model is related to Fisher information (this is exactly the scenario described above, for which n = 1). Our motivation here is to characterize, via information-theoretic measures, the behavior of the system as it evolves from states of minimum entropy to states of maximum entropy (and vice versa) by providing a geometrical tool based on the definition of the Fisher curve , which will be introduced in the following sections. Therefore, in our case (n = 1), Equation (21) is further simplified for practical usage. By unifying s and r to a unique index, i, we have a final expression for Φβ in terms of the local covariances between the random variables in a given neighborhood system (i.e., for the eight nearest neighbors): Φβ = 1 σ4 � ∑ j∈ηi ∑ k∈ηi � σ2σjk + 2σijσik � − 2β ∑ j∈ηi ∑ k∈ηi ∑ l∈ηi � σijσkl + σikσjl + σilσjk � (22) +β2 ∑ j∈ηi ∑ k∈ηi ∑ l∈ηi ∑ m∈ηi � σjkσlm + σjlσkm + σjmσkl �� Note that we have two types of covariances in the definition of Φβ for an isotropic pairwise GMRF: (1) covariances between the central variable, xi, and a neighboring variable, xj, denoted by σij, for j ∈ ηi; and (2) covariances between two neighboring variables, xj and xk, for j, k ∈ ηi. In the next sections, we will see how to compute the value of Ψβ directly from the covariance matrix of the local patterns. 3.2. The Type II Expected Fisher Information Following the same methodology of replacing the likelihood function by the pseudo-likelihood function of the GMRF model, a closed form expression for Ψβ is developed. Plugging Equation (3) into Equation (14) leads us to: Ψβ = 1 σ2 n ∑ i=1 E ⎧ ⎨ ⎩ � ∑ xj∈ηi � xj − μ � �2⎫ ⎬ ⎭ (23) = 1 σ2 n ∑ i=1 E � ∑ xj∈ηi ∑ xk∈ηi � xj − μ � (xk − μ) � = = 1 σ2 n ∑ i=1 � ∑ xj∈ηi ∑ xk∈ηi E �� xj − μ � (xk − μ) � � = 1 σ2 n ∑ i=1 ∑ j∈ηi ∑ k∈ηi σjk Note that unlike Φβ, Ψβ does not depend explicitly on β (inverse temperature). As we have seen before, Φβ is a quadratic function of the spatial dependence parameter. In order to simplify the notations and also to make computations easier, the expressions for Φβ and Ψβ can be rewritten in a matrix-vector form. Let Σp be the covariance matrix of the random vectors ⃗pi, i = 1, 2, . . . , n, obtained by lexicographic ordering of the local configuration patterns xi ∪ ηi. Thus, 259 Entropy 2014, 16, 1002–1036 considering a neighborhood system, ηi, of size K, we have Σp given by a (K + 1) × (K + 1) symmetric matrix (for K + 1 odd, i.e., K = 4, 8, 12, . . .): Σp = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ σ1,1 · · · σ1,K/2 σ1,(K/2)+1 σ1,(K/2)+2 · · · σ1,K+1 ... ... ... ... ... ... ... σK/2,1 · · · σK/2,K/2 σK/2,(K/2)+1 σK/2,(K/2)+2 · · · σK/2,K+1 σ(K/2)+1,1 · · · σ(K/2)+1,K/2 σ(K/2)+1,(K/2)+1 σ(K/2)+1,(K/2)+2 · · · σ(K/2)+1,K+1 σ(K/2)+2,1 · · · σ(K/2)+2,K/2 σ(K/2)+2,(K/2)+1 σ(K/2)+2,(K/2)+2 · · · σ(K/2)+2,K+1 ... ... ... ... ... ... ... σK+1,1 · · · σK+1,K/2 σK+1,(K/2)+1 σK+1,(K/2)+2 · · · σK+1,K+1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ Let Σ− p be the submatrix of dimensions K × K obtained by removing the central row and central column of Σp (the covariances between xi and each one of its neighbors, xj). Then, for K + 1 odd, we have: Σ− p = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ σ1,1 · · · σ1,K/2 σ1,(K/2)+2 · · · σ1,K+1 ... ... ... ... ... ... σK/2,1 · · · σK/2,K/2 σK/2,(K/2)+2 · · · σK/2,K+1 σ(K/2)+2,1 · · · σ(K/2)+2,K/2 σ(K/2)+2,(K/2)+2 · · · σ(K/2)+2,K+1 ... ... ... ... ... ... σK+1,1 · · · σK+1,K/2 σK+1,(K/2)+2 · · · σK+1,K+1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ (24) Thus, Σ− p is a matrix that stores only the covariances among the neighboring variables. Furthermore, let ⃗ρ be the vector of dimensions K × 1 formed by all the elements of the central row of Σp, excluding the middle one (which is a variance actually), that is: ⃗ρ = � σ(K/2)+1,1 · · · σ(K/2)+1,K/2 σ(K/2)+1,(K/2)+2 · · · σ(K/2)+1,K+1 � (25) Therefore, we can rewrite Equation (23) (for n = 1) using Kronecker products. The following definition provides a fast way to compute Φβ exploring these tensor products. Definition 5. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi, of size K (usual choices for K are even values: four, eight, 12, 20 or 24). Assuming that X(t) = {x(t) 1 , x(t) 2 , . . . , x(t) n } denotes the global configuration of the system at time t and ⃗ρ and Σ− p are defined as Equations (25) and (24), the Type I expected Fisher information, Φβ, for this state, X(t), is: Φβ = 1 σ4 � σ2 ���Σ− p ��� + + 2 ���⃗ρ ⊗⃗ρT��� + − 6β ���⃗ρT ⊗ Σ− p ��� + + 3β2 ���Σ− p ⊗ Σ− p ��� + � (26) where ∥A∥+ denotes the summation of all the entries of the matrix, A (not to be confused with a matrix norm) and ⊗ denotes the Kronecker (tensor) product. From an information geometry perspective, the presence of tensor products indicates the intrinsic differential geometry of a manifold in the form of the Riemann curvature tensor [18]. Note that all the necessary information for computing the Fisher information is somehow encoded in the covariance matrix of the local configuration patterns, (xi ∪ ηi), i = 1, 2, . . . , n, as would be expected in the case of Gaussian variables (second-order statistics). The same procedure is applied to the Type II expected Fisher information. Definition 6. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi, of size K (usual choices for K are four, eight, 12, 20 or 24). Assuming that X(t) = {x(t) 1 , x(t) 2 , . . . , x(t) n } 260 Entropy 2014, 16, 1002–1036 denotes the global configuration of the system at time t and Σ− p is defined as Equation (24), the Type II expected Fisher information, Ψβ, for this state, X(t), is given by: Ψβ = 1 σ2 ���Σ− p ��� + (27) 3.3. Information Equilibrium in GMRF Models From the definition of both Φβ and Ψβ, a natural question that raises would be: under what conditions do we have Φβ = Ψβ in an isotropic pairwise GMRF model? As we can see from Equations (26) and (27), the difference between Φβ and Ψβ, from now on denoted by Δβ � ⃗ρ, Σ− p � , is simply: Δβ � ⃗ρ, Σ− p � = 1 σ4 � 2 ���⃗ρ ⊗⃗ρT��� + − 6β ���⃗ρT ⊗ Σ− p ��� + + 3β2 ���Σ− p ⊗ Σ− p ��� + � (28) Then, intuitively, the condition for information equality is achieved when Δβ � ⃗ρ, Σ− p � = 0. As Δβ � ⃗ρ, Σ− p � is a simple quadratic function of the inverse temperature parameter, β, we can easily find that the value, β∗, for which Δβ � ⃗ρ, Σ− p � = 0, is: β∗ = ���⃗ρT ⊗ Σ− p ��� + ��Σ−p ⊗ Σ−p �� + ± √ 3 3 � 3 ��⃗ρT ⊗ Σ−p ��2 + − 2 ��Σ−p ⊗ Σ−p �� + ∥⃗ρ ⊗⃗ρT∥+ ��Σ−p ⊗ Σ−p �� + (29) provided that 3 ���⃗ρT ⊗ Σ− p ��� 2 + ≥ 2 ���Σ− p ⊗ Σ− p ��� + ��⃗ρ ⊗⃗ρT�� + and ���Σ− p ⊗ Σ− p ��� + ̸= 0. Note that if ��⃗ρ ⊗⃗ρT�� + = 0, then one solution for the above equation is β∗ = 0. In other words, when σij = 0, ∀j ∈ ηi (no correlation between xi and its neighbors, xj), information equilibrium is achieved for β∗ = 0, which in this case, is the maximum pseudo-likelihood estimate of β, since in this matrix-vector notation, ˆβMPL is given by: ˆβMPL = ∑ j∈ηi ˆσij ∑ j∈ηi ∑ k∈ηi ˆσjk = ∥⃗ρ∥+ ��Σ−p �� + (30) In the isotropic pairwise GMRF model, if β = 0, then we have ∥⃗ρ∥+ = 0, and as a consequence, Φβ = Ψβ. However, the opposite is not necessarily true, that is, we may observe that Φβ = Ψβ for a non-zero β. One example is for β∗, a solution of Δβ � ⃗ρ, Σ− p � = 0. 4. Entropy in Isotropic Pairwise GMRFs Our definition of entropy is done by repeating the same process employed to derive Φβ and Ψβ. Knowing that the entropy of random variable x is defined by the expected value of self-information, given by −log p(x), it can be thought of as a probability-based counterpart to the Fisher information. Definition 7. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. Assuming that X(t) = {x(t) 1 , x(t) 2 , . . . , x(t) n } denotes the global configuration of the system at time t, then the entropy, Hβ, for this state X(t) is given by: 261 Entropy 2014, 16, 1002–1036 Hβ = −E � log L � ⃗θ; X(t)�� = −E � log n ∏ i=1 p � xi|ηi,⃗θ �� = (31) = n 2 log � 2πσ2� + 1 2σ2 n ∑ i=1 E ⎧ ⎨ ⎩ � xi − μ − β ∑ j∈ηi � xj − μ � �2⎫ ⎬ ⎭ = = n 2 log � 2πσ2� + 1 2σ2 n ∑ i=1 � E � (xi − μ)2� − 2βE � ∑ j∈ηi (xi − μ) � xj − μ � � + β2E ⎧ ⎨ ⎩ � ∑ j∈ηi � xj − μ � �2⎫ ⎬ ⎭ ⎫ ⎬ ⎭ After some algebra, the expression for Hβ becomes: Hβ = n 2 log � 2πσ2� + 1 2σ2 n ∑ i=1 � σ2 − 2β ∑ j∈ηi σij + β2 ∑ j∈ηi ∑ k∈ηi σjk � = (32) = �n 2 log(2πσ2) + n 2 � − β σ2 n ∑ i=1 � ∑ j∈ηi σij � + β2 2σ2 n ∑ i=1 � ∑ j∈ηi ∑ k∈ηi σjk � Using the same matrix-vector notation introduced in the previous sections, we can further simplify the expression for Hβ (considering n = 1). Definition 8. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. Assuming that X(t) = {x(t) 1 , x(t) 2 , . . . , x(t) n } denotes the global configuration of the system at time t and ⃗ρ and Σ− p are defined as Equations (25) and (24), the entropy, Hβ, for this state, X(t), is given by: Hβ = HG − � β σ2 ∥⃗ρ∥+ − β2 2σ2 ���Σ− p ��� + � = HG − � β σ2 ∥⃗ρ∥+ − β2 2 Ψβ � (33) where HG denotes the entropy of a Gaussian random variable with variance σ2 and Ψβ is the Type II expected Fisher information. Note that Shannon entropy is a quadratic function of the spatial dependence parameter, β. Since the coefficient of the quadratic term is strictly non-negative (Ψβ is the Type II expected Fisher information), entropy is a convex function of β. Furthermore, as expected, when β = 0 and there is no induced spatial dependence in the system, the resulting expression for Hβ is the usual entropy of a Gaussian random variable, HG. Thus, there is a value, ˆ βMH, for the inverse temperature parameter, which minimizes the entropy of the system. In fact, ˆβMH is given by: ∂Hβ ∂β = β σ2 ���Σ− p ��� + − 1 σ2 ∥⃗ρ∥+ = 0 (34) ˆβMH = ∥⃗ρ∥+ ��Σ−p �� + = ˆβMPL 262 Entropy 2014, 16, 1002–1036 showing that the maximum pseudo-likelihood and the minimum-entropy estimates are equivalent in an isotropic pairwise GMRF model. Moreover, using the derived equations, we see a relationship between Φβ, Ψβ and Hβ: Φβ − Ψβ = Δβ � ⃗ρ, Σ− p � (35) ∂2Hβ ∂β2 = Ψβ where the functional Δβ � ⃗ρ, Σ− p � that represents the difference between Φβ and Ψβ is defined by Equation (28). These equations relate the entropy and one form of Fisher information (Ψβ) in GMRF models, showing that Ψβ can be roughly viewed as the curvature of Hβ. In this sense, in a hypothetical information equilibrium condition Ψβ = Φβ = 0, the entropy’s curvature would be null (Hβ would never change). These results suggest that an increase in the value of Ψβ, which means stability (a measure of agreement between the neighboring observations of a given point), contributes to the curve and, therefore, to inducing a change in the entropy of the system. In this context, the analysis of the Fisher information could bring us insights in predicting the entropy of a system. 5. Asymptotic Variance of MPL Estimators It is known from the statistical inference literature that unbiasedness is a property that is not granted by maximum likelihood estimation, nor by maximum pseudo-likelihood (MPL) estimation. Actually, there is no universal method that guarantees the existence of unbiased estimators for a fixed n-sized sample. Often, in the exponential family of distributions, maximum likelihood estimators (MLEs) coincide with the UMVU (uniform minimum variance unbiased) estimators, because MLEs are functions of complete sufficient statistics. There is an important result in statistical inference that shows that if the MLE is unique, then it is a function of sufficient statistics. We could enumerate and make a huge list of several properties that make maximum likelihood estimation a reference method [15–17]. One of the most important properties concerns the asymptotic behavior of MLEs: when we make the sample size grow infinitely (n → ∞), MLEs become asymptotically unbiased and efficient. Unfortunately, there is no result showing that the same occurs in maximum pseudo-likelihood estimation. The objective of this section is to propose a closed expression for the asymptotic variance of the maximum pseudo-likelihood of β in an isotropic pairwise GMRF model. Unsurprisingly, this variance is completely defined as a function of both forms of expected Fisher information, Ψβ and Φβ; as for general values of the inverse temperature parameter, the information equality condition fails. 5.1. The Asymptotic Variance of the Inverse Temperature Parameter In mathematical statistics, asymptotic evaluations uncover several fundamental properties of inference methods, providing a powerful and general tool for studying and characterizing the behavior of estimators. In this section, our objective is to derive an expression for the asymptotic variance of the maximum pseudo-likelihood estimator of the inverse temperature parameter (β) in isotropic pairwise GMRF models. It is known from the statistical inference literature that both maximum likelihood and maximum pseudo-likelihood estimators share two important properties: consistency and asymptotic normality [29,30]. It is possible, therefore, to completely characterize their behaviors in the limiting case. In other words, the asymptotic distribution of ˆβMPL is normal, centered around the real parameter value (since consistency means that the estimator is asymptotically unbiased), with the asymptotic variance representing the uncertainty about how far we are from the mean (real value). From a statistical perspective, ˆβMPL ≈ N � β, υβ � , where υβ denotes the asymptotic variance 263 Entropy 2014, 16, 1002–1036 of the maximum pseudo-likelihood estimator. It is known that the asymptotic covariance matrix of maximum pseudo-likelihood estimators is given by [31]: C(⃗θ) = H−1(⃗θ)J(⃗θ)H−1(⃗θ) (36) with: H(⃗θ) = Eβ � ∇2log L � ⃗θ; X(t)�� (37) J(⃗θ) = Varβ � ∇log L � ⃗θ; X(t)�� (38) where H and J denote, respectively, the Jacobian and Hessian matrices regarding the logarithm of the pseudo-likelihood function. Thus, considering the parameter of interest, β, we have the following definition for its asymptotic variance, υβ (the derivatives are taken with respect to β): υβ = Varβ � ∂ ∂βlog L � ⃗θ; X(t)�� E2 β � ∂2 ∂β2 log L � ⃗θ; X(t) �� = Eβ �� ∂ ∂βlog L � ⃗θ; X(t)��2� − E2 β � ∂ ∂βlog L � ⃗θ; X(t)�� E2 β � ∂2 ∂β2 log L � ⃗θ; X(t) �� (39) However, note that the expected value of the first derivative of log L � ⃗θ; X(t)� with relation to β is zero: E � ∂ ∂βlog L � ⃗θ; X(t)�� = 1 σ2 n ∑ i=1 � E [xi − μ] − β ∑ j∈ηi E � xj − μ � � = 0 (40) Therefore, the second term of the numerator of Equation (39) vanishes and the final expression for the asymptotic variance of the inverse temperature parameter is given as the ratio between Φβ and Ψ2 β: υβ = 1 � ∑ j∈ηi ∑ k∈ηi σjk �2 � ∑ j∈ηi ∑ k∈ηi � σ2σjk + 2σijσik � − 2β ∑ j∈ηi ∑ k∈ηi ∑ l∈ηi � σijσkl + σikσjl + σilσjk � +β2 ∑ j∈ηi ∑ k∈ηi ∑ l∈ηi ∑ m∈ηi � σjkσlm + σjlσkm + σjmσkl �� (41) This derivation leads us to another definition concerning an isotropic pairwise GMRF. Definition 9. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi. Assuming that X(t) = {x(t) 1 , x(t) 2 , . . . , x(t) n } denotes the global configuration of the system at time t, and⃗ρ and Σ− p are defined as Equations (25) and (24), the asymptotic variance of the maximum pseudo-likelihood estimator of the inverse temperature parameter, β, is given by (using the same matrix-vector notation from the previous sections): υβ = σ2 ���Σ− p ��� + + 2 ��⃗ρ ⊗⃗ρT�� + − 6β ���⃗ρT ⊗ Σ− p ��� + + 3β2 ���Σ− p ⊗ Σ− p ��� + ��Σ−p ��2 + = (42) = σ2 ��Σ−p �� + + σ4Δβ � ⃗ρ, Σ− p � ��Σ−p ��2 + = 1 Ψβ + 1 Ψ2 β � Φβ − Ψβ � 264 Entropy 2014, 16, 1002–1036 Note that when information equilibrium prevails, that is Φβ = Ψβ, the asymptotic variance is given by the inverse of the expected Fisher information. However, the interpretation of this equation indicates that the uncertainty in the estimation of the inverse temperature parameter is minimized when Ψβ is maximized. Essentially, this means that on average, the local pseudo-likelihood functions are not flat, that is small changes on the local configuration patterns along the system cannot cause abrupt changes in the expected global behavior (the global spatial dependence structure is not susceptible to sharp changes). To reach this condition, there must be a reasonable degree of agreement between the neighboring elements throughout the system, a behavior that is usually associated to low temperature states (β is above a critical value and there is a visible induced spatial dependence structure). 6. The Fisher Curve of a System With the definition of Φβ, Ψβ and Hβ, we have the necessary tools to compute three important information-theoretic measures of a global configuration of the system. Our idea is that we can study the behavior of a complex system by constructing a parametric curve in this information-theoretic space as a function of the inverse temperature parameter, β. Our expectation is that the resulting trajectory provides a geometrical interpretation of how the system moves from an initial configuration, A (with a low entropy value for instance), to a desired final configuration, B (with a greater value of entropy, for instance), since the Fisher information plays an important role in providing a natural metric to the Riemannian manifolds of statistical models [18,19]. We will call the path from global State A to global State B as the Fisher curve (from A to B) of the system, denoted by ⃗FB A(β). Instead of using time as the parameter to build the curve, ⃗F, we parametrize ⃗F by the inverse temperature parameter, β. Definition 10. Let an isotropic pairwise GMRF be defined on a lattice S = {s1, s2, . . . , sn} with a neighborhood system, ηi, and X(β1), X(β2), . . . , X(βn) be a sequence of outcomes (global configurations) produced by different values of βi (inverse temperature parameters) for which A = βMIN = β1 < β2 < · · · < βn = βMAX = B. The system’s Fisher curve from A to B is defined as the function ⃗F : ℜ → ℜ3 that maps each configuration, X(βi), to a point � Φβi, Ψβi, Hβi � from the information space, that is: ⃗FB A (β) = � Φβ, Ψβ, Hβ � β = A, . . . , B (43) where Φβ, Ψβ and Hβ denote the Type I expected Fisher information, the Type II expected Fisher information and the Shannon entropy of the global configuration, X(β), defined by: Φβ = 1 σ4 � σ2 ���Σ− p ��� + + 2 ���⃗ρ ⊗⃗ρT��� + − 6β ���⃗ρT ⊗ Σ− p ��� + + 3β2 ���Σ− p ⊗ Σ− p ��� + � (44) Ψβ = 1 σ2 ���Σ− p ��� + (45) Hβ = 1 2 � log � 2πσ2 + 1 �� − � β σ2 ∥⃗ρ∥+ − β2 2 Ψβ � (46) In the next sections, we show some computational experiments that illustrate the effectiveness of the proposed tools in measuring the information encoded in complex systems. We want to investigate what happens to the Fisher curve as the inverse temperature parameter is modified in order to control the system’s global behavior. Our main conclusion, which is supported by experimental analysis, is that ⃗FB A(β) ̸= ⃗FA B (β). In other words, in terms of information, moving towards higher entropy states is not the same as moving towards lower entropy states, since the Fisher curves that represent the trajectory between the initial State A and the final State B are significantly different. 265 Entropy 2014, 16, 1002–1036 7. Computational Simulations This section discusses some numerical experiments proposed to illustrate some applications of the derived tools in both simulations and real data. Our computational investigations were divided into two main sets of experiments: (1) Local analysis: analysis of the local and global versions of the measures (φβ, ψβ, Φβ, Ψβ and Hβ), considering a fixed inverse temperature parameter; (2) Global analysis: analysis of the global versions of the measures (Φβ, Ψβ and Hβ) along Markov chain Monte Carlo (MCMC) simulations in which the inverse temperature parameter is modified to control the expected global behavior. 7.1. Learning from Spatial Data with Local Information-Theoretic Measures First, in order to illustrate a simple application of both forms of local observed Fisher information, φβ and ψβ, we performed an experiment using some synthetic images generated by the Metropolis–Hastings algorithm. The basic idea of this simulation process is to start at an initial configuration in which temperature is infinite (or β = 0). This basic initial condition is randomly chosen, and after a fixed number of steps, the algorithm produces a configuration that is considered to be a valid outcome of an isotropic pairwise GMRF model. Figure 1 shows an example of the initial condition and the resulting system configuration after 1,000 iterations considering a second order neighborhood system (eight nearest neighbors). The model parameters were chosen as: μ = 0, σ2 = 5 and β = 0.8. 266 Entropy 2014, 16, 1002–1036 Figure 1. Example of Gaussian Markov random field (GMRF) model outputs. The values of the inverse temperature parameter, β, in the left and right configurations are zero and 0.8, respectively. Three Fisher information maps were generated from both initial and resulting configurations. The first map was obtained by calculating the value, φβ(xi), for every point of the system, that is for i = 1, 2, . . . , n. Similarly, the second one was obtained by using ψβ(xi). The last information map was built by using the ratio between φβ(xi) and ψβ(xi), motivated by the fact that boundaries are often composed of patterns that are not expected to be “aligned” to the global behavior (and, therefore, show high values of φβ(xi)) and also are somehow unstable (show low values of ψβ(xi)). We will recall this measure, Lβ(xi) = φβ(xi)/ψβ(xi), the local L-information, once it is defined in terms of the first two derivatives of the logarithm of the local pseudo-likelihood function. Figure 2 shows the obtained information maps as images. Note that while φβ has a strong response for boundaries (the edges are light), ψβ has a weak one (so the edges are dark), evidence in favor of considering L-information in boundary detection procedures. Note also that in the initial condition, when the temperature is infinite, the informative patterns are almost uniformly spread all over the system, while the final configuration 267 Entropy 2014, 16, 1002–1036 shows a more sparse representation in terms of information. Figure 3 shows the distribution of local L-information for both systems’ configurations depicted in Figure 1. Figure 2. Fisher information maps. The first row shows the information maps of the system when the temperature is infinite (β = 0). The second row shows the same maps when the temperature is low (β = 0.8). The first and second columns show information maps that were generated by computing φβ(xi) and ψβ(xi) for each observation in the lattice. The column map was produced by computing the local L-information, that is the ratio between both local information measures. In terms of information, low temperature configurations are more sparse, since most local patterns are uninformative, due to the strong alignment of the particles throughout the system, which is the expected global behavior for β above a certain critical value. 268 Entropy 2014, 16, 1002–1036 � � � � � �� �� �� �� �� �� � ���� ���� ���� ���� ����� ����� ������������� ������������������������ ���������������������������������������������������������� � � �� �� �� �� �� �� �� �� �� � ��� � ��� � ��� � ��� ���� � ������������� ������������������������ ����������������������������������������������������������������� Figure 3. Distribution of local L-information. When the temperature is infinite, the information is spread along the system. For low temperature configurations, the number of local patterns with zero information content significantly increases, that is the system is more sparse in terms of Fisher information. 7.2. Analyzing Dynamical Systems with Global Information-Theoretic Measures In order to study the behavior of a complex system that evolves from an initial State A to another State B, we use the Metropolis–Hastings algorithm, an MCMC simulation method, to generate a sequence of valid isotropic pairwise GMRF model outcomes for different values of the inverse temperature parameter, β. This process is an attempt to perform a random walk on the state space of the system, that is, in the space of all possible global configurations in order to analyze the behavior of the proposed global measures: entropy and both forms of Fisher information. The main purpose of this experiment is to observe what happens to Φβ, Ψβ and Hβ when the system evolves from a random initial state to other global configurations. In other words, we want to investigate the Fisher curve of the system in order to characterize its behavior in terms of information. Basically, the idea is to use the Fisher curve as a kind of signature for the expected behavior of any system modeled by an isotropic pairwise GMRF, making it possible to gain insights into the understanding of large complex systems. 269 Entropy 2014, 16, 1002–1036 To simulate a system in which we can control the inverse temperature parameter, we define an updating rule for β based on fixed increments. In summary, we start with a minimum value βMIN (when βMIN = 0, the temperature of the system is infinite). Then, the value of β in the iteration, t, is defined as the value of β in t − 1 plus a small increment (Δβ), until it reaches a pre-defined upper bound, βMAX. The process in then repeated with negative increments −Δβ, until the inverse temperature reaches its minimum value, βMIN, again. This process continues for a fixed number of iterations, NMAX, during an MCMC simulation. As a result of this approach, a sequence of GMRF samples is produced. We use this sequence to calculate Φβ, Ψβ and Hβ and define the Fisher curve ⃗F, for β = βMIN, . . . , βMAX. Figure 4 shows some of the system’s configurations along an MCMC simulation. In this experiment, the parameters were defined as: βMIN = 0, Δβ = 0.001, βMAX = 0.15 and NMAX = 1, 000, μ = 0, σ2 = 5 and ηi = {(i − 1, j − 1), (i − 1, j), (i − 1, j + 1), (i, j − 1), (i, j + 1), (i + 1, j − 1), (i + 1, j), (i + 1, j + 1)}. Figure 4. Global configurations along a Markov chain Monte Carlo (MCMC) simulation. Evolution of the system as the inverse temperature parameter, β, is modified to control the expected global behavior. A plot of both forms of the expected Fisher information, Φβ and Ψβ, for each iteration of the MCMC simulation is shown in Figure 5. The graph produced by this experiment shows some interesting results. First of all, regarding upper and lower bounds on these measures, it is possible to note that when there is no induced spatial dependence structure (β ≈ 0), we have an information equilibrium condition (Φβ = Ψβ and the information equality holds). In this condition, the observations are practically independent in the sense that all local configuration patterns convey approximately the same amount of information. Thus, it is hard to find and separate the two categories of patterns we know: the informative and the non-informative ones. Once they all behave in a similar manner, there is no informative pattern to highlight. Moreover, in this information equilibrium situation, Ψβ reaches its lower bound (in this simulation, we observed that in the equilibrium Φβ ≈ Ψβ ≈ 8), indicating that this condition emerges when the system is most susceptible to a change in the expected global behavior, since the uncertainty about β is maximum at this moment. In other words, modification in the behavior of a small subset of local patterns may guide the system to a totally different stable configuration in the future. The results also show that the difference between Φβ and Ψβ is maximum when the system operates with large values of β, that is, when organization emerges and there is a strong dependence structure among the random variables (the global configuration shows clear visible clusters and boundaries between them). In such states, it is expected that the majority of patterns be aligned to the global behavior, which causes the appearance of few, but highly informative patterns: those connecting 270 Entropy 2014, 16, 1002–1036 elements from different regions (boundaries). Besides that, the results suggest that it takes more time for the system to go from the information equilibrium state to organization than the opposite. We will see how this fact becomes clear by analyzing the Fisher curve along Markov chain Monte Carlo (MCMC) simulations. Finally, the results also suggest that both Φβ and Ψβ are bounded by a superior value, possibly related to the size of the neighborhood system. � ��� ��� ��� ��� ��� ��� ��� ��� � �� �� �� �� �� �� ���������� ������������������������������ ������������������������������������������������������������������� ��� ��� � � �� Figure 5. Evolution of Fisher information along an MCMC simulation. As the difference between Φβ and Ψβ is maximized (*), the uncertainty about the real inverse temperature parameter is minimized and the number of informative patterns increases. In the information equilibrium condition (**), it is hard to find informative patterns, since there is no induced spatial dependence structure. Figure 6 shows the real parameter values used to generate the GMRF outputs (blue line), the maximum pseudo-likelihood estimative used to calculate Φβ and Ψβ (red line) and also a plot of the asymptotic variances (uncertainty about the inverse temperature) along the entire MCMC simulation. 271 Entropy 2014, 16, 1002–1036 � ��� ��� ��� ��� ��� ��� ��� ��� ����� � ���� ���� ���� ���� ��� ���� ���� ���� ���� ���������� ������������������������������������������������������������������������������������� ���������� �������������� �������� Figure 6. Real and estimated inverse temperatures along the MCMC simulation. The system’s global behavior is controlled by the real inverse temperature parameter values (blue line), used to generate the GMRF outputs. The maximum pseudo-likelihood estimative is used to compute both Φβ and Ψβ. Note that the uncertainty about the inverse temperature increases as β → 0 and the system approaches the information equilibrium condition. We now proceed to the analysis of the Shannon entropy of the system along the simulation. Despite showing a behavior similar to Ψβ, the range of values for entropy is significantly smaller. In this simulation, we observed that 0 ≤ Hβ ≤ 4.5, 0 ≤ Φβ ≤ 18 and 8 ≤ Ψβ ≤ 61. An interesting point is that knowledge of Φβ and Ψβ allows us to infer the entropy of the system. For example, looking at Figures 5 and 7, we can see that Φβ and Ψβ start to diverge a little bit earlier (t ≈ 80), then the entropy in a GMRF model begins to grow (t ≈ 120). Therefore, in an isotropic pairwise GMRF model, if the system is close to the information equilibrium condition, then Hβ is low, since there is little variability in the observed configuration patterns. When the difference between Φβ and Ψβ is large, Hβ increases. 272 Entropy 2014, 16, 1002–1036 � ��� ��� ��� ��� ��� ��� ��� ��� � ��� � ��� � ��� ���������� ������� �������������������������������������������� ������������������� ���������� Figure 7. Evolution of Shannon entropy along an MCMC simulation. Hβ start to grow when the system leaves the equilibrium condition, where the entropy in the isotropic pairwise GMRF model is identical to the entropy of a simple Gaussian random variable (since β → 0). Another interesting global information-theoretic measure is L-information, from now on denoted by Lβ, since it conveys all the information about the likelihood function (in a GMRF model, only the first two derivatives of L(⃗θ; X(t)) are not null). Lβ is defined as the ratio between the two forms of expected Fisher information, Φβ and Ψβ. A nice property about this measure is that 0 ≤ Lβ ≤ 1. With this single measurement, it is possible to gain insights about the global system behavior. Figure 8 shows that a value close to one indicates a system approximating the information equilibrium condition, while a value close to zero indicates a system close to the maximum entropy condition (a stable configuration with boundaries and informative patterns). 273 Entropy 2014, 16, 1002–1036 � ��� ��� ��� ��� ��� ��� ��� ��� � ��� ��� ��� ��� ��� ��� ��� ��� ��� � ���������� ������������� �������������������������������������������������� Figure 8. Evolution of L-information along an MCMC simulation. When Lβ approaches one, the system tends to the information equilibrium condition. For values close to zero, the system tends to the maximum entropy condition. To investigate the intrinsic non-linear connection between Φβ, Ψβ and Hβ in a complex system modeled by an isotropic pairwise GMRF model, we now analyze its Fisher curves. The first curve, which is a planar one, is defined as ⃗F(β) = (Φβ, Ψβ), for A = βmin to B = βmax and shows how Fisher information changes when the inverse temperature of the system is modified to control the global behavior. Figure 9 shows the results. In the first image, the blue piece of the curve is the path from A to B, that is, ⃗F(β)B A, and the red piece is the inverse path (from B to A), that is, ⃗F(β)A B . We must emphasize that ⃗F(β)B A is the trajectory from a lower entropy global configuration to a higher entropy global configuration. On the other hand, when the system moves from B to A, we are moving towards entropy minimization. To make this clear, the second image of Figure 9 illustrates the same Fisher curve as before, but now in three dimensions, that is, ⃗F(β) = (Φβ, Ψβ, Hβ). For comparison purposes, Figure 10 shows the Fisher curves for another MCMC simulation with different parameter settings. Note that the shape of the curves are quite similar to those in Figure 9. 274 Entropy 2014, 16, 1002–1036 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70 PHI PSI 2D Fisher curve for a GMRF model A Equilibrium Line B 0 5 10 15 20 0 20 40 60 80 2 2.5 3 3.5 4 PHI 3D Fisher curve for a GMRF model PSI H B A Figure 9. 2D and 3D Fisher curves of a complex system along an MCMC simulation. The graph shows a parametric curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from a differential geometry perspective, as the divergence between Φβ and Ψβ increases, the torsion of the parametric curve becomes evident (the curve leaves the plane of constant entropy). 275 Entropy 2014, 16, 1002–1036 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70 PHI PSI 2D Fisher curve for an isotropic pairwise GMRF model Equilibrium line B A 0 5 10 15 20 0 20 40 60 80 2 2.1 2.2 2.3 2.4 PSI 3D Fisher curve for an isotropic pairwise GMRF model PHI H A B Figure 10. 2D and 3D Fisher curves along another MCMC simulation. The graph shows a parametric curve obtained by varying the β parameter from βMIN to βMAX and back. Note that, from a geometrical perspective, the properties of these curves are essentially the same as the ones from the previous simulation. We can see that the majority of points along the Fisher curve is concentrated around two regions of high curvature: (A) around the information equilibrium condition (an absence of short-term and long-term correlations, since β = 0); and (B) around the maximum entropy value, where the divergence between the information values are maximum (self-organization emerges, since β is greater than a critical value, βc). The points that lie in the middle of the path connecting these two regions represent the system undergoing a phase transition. Its properties change rapidly and in an asymmetric way, since ⃗F(β)B A ̸= ⃗F(β)A B for a given natural orientation. By now, some observations can be highlighted. First, the natural orientation of the Fisher curve defines the direction of time. The natural A–B path (increase in entropy) is given by the blue curve and the natural B–A path (decrease in entropy) is given by the red curve. In other words, the only possible way to walk from A to B (increase Hβ) by the red path or to walk from B to A (decrease Hβ) by the blue path would be moving back in time (by running the recorded simulation backwards).Eventually, we believe that a possible explanation for this fact could be that the deformation process that takes the original geometric structure (with constant curvature) of the usual Gaussian model (A) to the novel geometric structure of the isotropic pairwise GMRF model (B) is not reversible. In other words, the way the model is "curved" is not simply the reversal of the "flattering" process (when it is restored to its 276 Entropy 2014, 16, 1002–1036 constant curvature form). Thus, even the basic notion of time seems to be deeply connected with the relationship between entropy and Fisher information in a complex system: in the natural orientation (forward in time), it seems that the divergence between Φβ and Ψβ is the cause of an increase in the entropy, and the decrease of entropy is the cause of the convergence of Φβ and Ψβ. During the experimental analysis, we repeated the MCMC simulations with different parameter settings, and the observed behavior for Fisher information and entropy was the same. Figure 11 shows the graphs of Φβ, Ψβ and Hβ for another recorded MCMC simulation. The results indicate that in the natural orientation (in the direction of time), an increase in Ψβ seems to be a trigger for an increase in the entropy and a decrease in the entropy seems to be a trigger for a decrease in Ψβ. Roughly speaking, Ψβ “pushes Hβ up” and Hβ “pushes Ψβ down”. � ��� ��� ��� ��� ���� ���� ���� ���� ���� ���� � �� �� �� �� �� �� �� ���������� ������������������ ����������������������������������������������������������� ��� ��� � ��� ��� ��� ��� ���� ���� ���� ���� ���� ���� � ���� ��� ���� ��� ���� ��� ���� ��� ���������� ������� ������������������������������������������������ Figure 11. Relations between entropy and Fisher information. When a system modeled by an isotropic pairwise GMRF evolves in the natural orientation (forward in time), two rules that relate Fisher information and entropy can be observed: (1) an increase in Ψβ is the cause of an increase in Hβ (the increase in Hβ is a consequence of the increase in Ψβ); (2) a decrease in Hβ is the cause of a decrease in Ψβ (the decrease in Ψβ is a consequence of the decrease in Hβ). In other words, when moving towards higher entropy states, changes in Fisher information precedes changes in entropy (Ψβ “pushes Hβ up”). When moving towards lower entropy states, changes in entropy precedes changes in Fisher information (Hβ “pushes Ψβ down”). In summary, the central idea discussed here is that while entropy provides a measure of order/disorder of the system at a given configuration, X(t), Fisher information links these thermodynamical states through a path (Fisher curve). Thus, Fisher information is a powerful 277 Entropy 2014, 16, 1002–1036 mathematical tool in the study of complex and dynamical systems, since it establishes how these different thermodynamical states are related along the evolution of the inverse temperature. Instead of knowing whether the entropy, Hβ, is increasing or decreasing, with Fisher information, it is possible to know how and why this change is happening. 7.2.1. The Effect of Induced Perturbations in the System To test whether a system can recover part of its original configuration after a perturbation is induced, we conducted another computational experiment. During a stable simulation, two kinds of perturbations were induced in the system: (1) the value of the inverse temperature parameter was set to zero for the next consecutive two iterations; (2) the value of the inverse temperature parameter was set to the equilibrium value, β∗ (the solution of Equation 28), for the next consecutive two iterations. We should mention that in both cases, the original value of β is recovered after these two iterations are completed. When the system is disturbed by setting β to zero, the simulations indicate that the system is not successful in recovering components from its previous stable configuration (note that Φβ and Ψβ clearly touch one another in the graph). When the same perturbation is induced, but using the smallest of the two β∗ values (minimum solution of Equation 28), after a short period of turbulence, the system can recover parts (components, clusters) of its previous stable state. This behavior suggests that this softer perturbation is not enough to remove all the information encoded within the spatial dependence structure of system, preserving some of the long-term correlations in data (stronger bonds), slightly remodeling the large clusters presented in the system. Figures 12 and 13 illustrate the results. 7.3. Considerations and Final Remarks The goal of this section is to summarize the main results obtained in this paper, focusing on the interpretation of the Fisher curve of a system modeled by a GMRF. First, our system is initialized with a random configuration, simulating that in the moment of its creation, the temperature is infinite (β = 0). We observe two important things at this moment: (1) there is a perfect symmetry in information, since the equilibrium condition prevails, that is, Φβ = Ψβ; (2) the entropy of the system is minimal. By a mere convention, we name this initial state of minimal entropy, A. By reducing the global temperature (β increases), this “universe” is deviating from this initial condition. As the system is drifted apart from the initial condition, we clearly see a break in the symmetry of information (Φβ diverges from Ψβ), which apparently is the cause for an increase in the system’s entropy, since this symmetry break seems to precede an increase in the entropy, H. This is a fundamental symmetry break, since other forms of ruptures that will happen in the future and will give rise to several properties of the system, including the basic notion of time as an irreversible process, follow from this first one. During this first stage of evolution, the system evolves to the condition of maximum entropy, named B. Hence, after the break in the information equilibrium condition, there is a significant increase in the entropy as the system continues to evolve. This stage lasts while the temperature of the system is further reduced or kept established. When the temperature starts to increase (β decreases), another form of symmetry break takes place. By moving towards the initial condition (A) from B, changes in the entropy seem to precede changes in Fisher information (when moving from A to B, we observe exactly the opposite). Moreover, the variations in entropy and Fisher information towards A are not symmetric with the variations observed when we move towards B, a direct consequence of that first fundamental break of the information equilibrium condition. By continuing this process of increasing the temperature of the system until infinity (β is approaching zero), we take our system to a configuration that is equivalent to the initial condition, that is, where information equilibrium prevails. This fundamental symmetry break becomes evident when we look at the Fisher curve of the system. We clearly see that the path from the state of minimum entropy, A, and the state of maximum entropy, B, defined by the curve, ⃗FB A (the blue trajectory in Figure 9), is not the same as the path from B 278 Entropy 2014, 16, 1002–1036 to A, defined by the curve, ⃗FA B (the red trajectory in Figure 9). An implication of this behavior is that if the system is moving along the arrow of time, then we are moving through the Fisher curve in the clockwise orientation. Thus, the only way to go from A to B along ⃗FA B (the red path) is going back in time. Therefore, if that first fundamental symmetry break did not exist, or even if it had happened, but all the posterior evolution of Φβ, Ψβ and Hβ were absolutely symmetric (i.e., the variations in these measures were exactly the same when moving from A to B and when moving from B to A), what we � �� ��� ��� ��� ��� ��� ��� ��� � �� �� �� �� �� �� ���������� ������������������ ���������������������������������������������������� ��� ��� � �� ��� ��� ��� ��� ��� ��� ��� � �� �� �� �� �� �� ���������� ������������������ ����������������������������������������������������� ��� ��� Figure 12. Disturbing the system to induce changes. Variation on Φβ and Ψβ after the system is disturbed by an abrupt change in the value of β. In the first image, the inverse temperature is set to zero. Note that Φβ and Ψβ touch one another, indicating that no residual information is kept, as if the simulation had been restarted from a random configuration. In the second image, the inverse temperature is set to the equilibrium value, β∗. The results suggest that this kind of perturbation is not enough to remove all the information within the spatial dependence structure, allowing the system to recover a significant part of its original configuration after a short stabilization period. 279 Entropy 2014, 16, 1002–1036 Figure 13. The sequence of outputs along the MCMC simulation before and after the system is disturbed. The first row (when β is set to zero) shows that the system evolved to a different stable configuration after the perturbation. The second row (when β is set to β∗) indicates that the system was able to recover a significant part from its previous stable configuration. would actually see is that ⃗FB A = ⃗FA B . As a consequence, to decrease/increase the system’s temperature would be like moving towards the future/past. In fact, the basic notion of time in that system would be compromised, since time would be a perfectly reversible process (just similar to a spatial dimension, in which we can move in both directions). In other words, we would not distinguish whether the system is moving forward or moving backwards in time. 8. Conclusions The definition of what is information in a complex system is a fundamental concept in the study of many problems. In this paper, we discussed the roles of two important statistical measures in isotropic pairwise Markov random fields composed of Gaussian variables: Shannon entropy and Fisher information. By using the pseudo-likelihood function of the GMRF model, we derived analytical expressions for these measures. The definition of a Fisher curve as a geometric representation for the study and analysis of complex systems allowed us to reveal the intrinsic non-linear relation between these information-theoretic measures and gain insights about the behavior of such systems. Computational experiments demonstrate the effectiveness of the proposed tools in decoding information from the underlying spatial dependence structure of a Gaussian-Markov random field. Typical informative patterns in a complex systems are located in the boundaries of the clusters. One of the main conclusions of this scientific investigation concerns the notion of time in a complex system. The obtained results suggest that the relationship between Fisher information and entropy determines whether the system is moving forward or backward in time. Apparently, in the natural orientation (when the system is evolving forward in time), when β is growing, that is, the temperature of the system is reducing, an increase in Fisher information leads to an increase in the system’s entropy, and when β is reducing, that is the temperature of the system is growing, 280 Entropy 2014, 16, 1002–1036 a decrease in the system’s entropy leads to a decrease in the Fisher information. In future works we expect to completely characterize the metric tensor that represents the geometric structure of the isotropic pairwise GMRF model by specifying all the elements of the Fisher information matrix. Future investigations should also include the definition and analysis of the proposed tools in other Markov random field models, such as the Ising and Potts pairwise interaction models. Besides, a topic of interest concerns the investigation of minimum and maximum information paths in graphs to explore intrinsic similarity measures between objects belonging to a common surface or manifold in ℜn. We believe this study could bring benefits to some pattern recognition and data analysis computational applications. Acknowledgments: The author would like to thank CNPQ(Brazilian Council for Research and Development) for the financial support through research grant number 475054/2011-3. Conflicts of Interest: Conflict of Interest The authors declare no conflict of interest. References 1. Shannon, C.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Urbana, Chicago, IL & London, USA, 1949. 2. Rényi, A. On measures of information and entropy. In Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961. pp. 547–561 3. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. 4. Bashkirov, A. Rényi entropy as a statistical entropy for complex systems. Theor. Math. Phys. 2006, 149, 1559–1573. 5. Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. 6. Grad, H. The many faces of entropy. Comm. Pure. Appl. Math. 1961, 14, 323–254. 7. Adler, R.; Konheim, A.; McAndrew, A. Topological entropy. Trans. Am. Math. Soc. 1965, 114, 309–319. 8. Goodwyn, L. Comparing topological entropy with measure-theoretic entropy. Am. J. Math. 1972, 94, 366–388. 9. Samuelson, P.A. Maximum principles in analytical economics. Am. Econ. Rev. 1972, 62, 249–262. 10. Costa, M. Writing on dirty paper. IEEE T. Inform. Theory 1983, 29, 439–441. 11. Dembo, A.; Cover, T.; Thomas, J. Information theoretic inequalities. IEEE T. Inform. Theory 1991, 37, 1501–1518. 12. Cover, T.; Zhang, Z. On the maximum entropy of the sum of two dependent random variables. IEEE T. Inform. Theory 1994, 40, 1244–1246. 13. Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, London, 2004. 14. Frieden, B.R.; Gatenby, R.A. Exploratory Data Analysis Using Fisher Information; Springer: London, UK, 2006. 15. Lehmann, E.L. Theory of Point Estimation; Wiley: New York, NY, USA, 1983. 16. Bickel, P.J. Mathematical Statistics; Holden Day: New York, NY, USA, 1991. 17. Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Duxbury: New York, NY, USA, 2002. 18. Amari, S. Nagaoka, H. Methods of information geometry (Translations of mathematical monographs vol. 191); AMS Bookstore: Tokyo, Japan, 2000. 19. Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–234. 20. Anandkumar, A.; Tong, L.; Swami, A. Detection of Gauss-Markov random fields with nearest-neighbor dependency. IEEE T. Inform. Theory 2009, 55, 816–827. 21. Gómez-Villegas, M.A.; Main, P.; Susi, R. The effect of block parameter perturbations in Gaussian Bayesian networks: Sensitivity and robustness. Inform. Sci. 2013, 222, 439–458. 22. Moura, J.; Balram, N. Recursive structure of noncausal Gauss-Markov random fields. IEEE T. Inform. Theory 1992, 38, 334–354. 23. Moura, J.; Goswami, S. Gauss-Markov random fields (GMrf) with continuous indices. IEEE Trans. Inform. Theory 1997, 43, 1560–1573. 281 Entropy 2014, 16, 1002–1036 24. Besag, J. Spatial interaction and the statistical analysis of lattice systems. J. Roy. Stat. Soc. B Stat. Meth. 1974, 36, 192–236. 25. Besag, J. Statistical analysis of non-lattice data. The Statistician 1975, 24, 179–195. 26. Hammersley, J.; Clifford, P. (University of California, Berkeley, Oxford and Bristol). Markov Field on Finite Graphs and Lattices. Unpublished work, 1971. 27. Efron, B.F.; Hinkley, D.V. Assessing the accuracy of the ml estimator: Observed versus expected fisher information. Biometrika 1978, 65, 457–487. 28. Isserlis, L. On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika 1918, 12, 134–139. 29. Jensen, J.; Künsh, H. On asymptotic normality of pseudo likelihood estimates for pairwise interaction processes. Ann. Inst. Stat. Math. 1994, 46, 475–486. 30. Winkler, G. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods: A Mathematical Introduction; Springer-Verlag New York, Inc.: Secaucus, NJ, USA, 2006. 31. Liang, G.; Yu, B. Maximum pseudo likelihood estimation in network tomography. IEEE T. Signal Proces. 2003, 51, 2043–2053. c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 282 entropy Article Network Decomposition and Complexity Measures: An Information Geometrical Approach Masatoshi Funabashi Sony Computer Science Laboratories, inc. Takanawa muse bldg. 3F, 3-14-13, Higashi Gotanda, Shinagawa-ku, Tokyo 141-0022, Japan; E-Mail: masa_funabashi@csl.sony.co.jp; Tel.: +81-3-5448-4380; Fax: +81-3-5448-4273 Received: 28 March 2014; in revised form: 24 June 2014 / Accepted: 14 July 2014 / Published: 23 July 2014 Abstract: We consider the graph representation of the stochastic model with n binary variables, and develop an information theoretical framework to measure the degree of statistical association existing between subsystems as well as the ones represented by each edge of the graph representation. Besides, we consider the novel measures of complexity with respect to the system decompositionability, by introducing the geometric product of Kullback–Leibler (KL-) divergence. The novel complexity measures satisfy the boundary condition of vanishing at the limit of completely random and ordered state, and also with the existence of independent subsystem of any size. Such complexity measures based on the geometric means are relevant to the heterogeneity of dependencies between subsystems, and the amount of information propagation shared entirely in the system. Keywords: information geometry; complexity measure; complex network; system decomposition- ability; geometric mean 1. Introduction Complex systems sciences emphasize on the importance of non-linear interactions that can not be easily approximated linearly. In other word, the degrees of non-linear interactions are the source of complexity. The classical reductionism approach generally decomposes a system into its components with linear interactions, and tries to evaluate whether the whole property of the system can still be reproduced. If this decomposition of a system destroys too much information to reproduce the system’s whole property, the plausibility of such reductionism is lost. Inversely, if we can evaluate how much information is ignored by the decomposition, we can assume how much complexity of the whole system is lost. This gives us a way to measure the complexity of a system with respect to the system decomposition. In stochastic systems described as a set of joint distributions, the interaction can basically be expressed as the statistical association between the variables. The simplest reductionism approach is to separate the whole system into some subsets of variables, and assume the independence between them. If such decomposition does not affect the system’s property, the isolated subsystem is independent from the rest. On the other hand, if the decomposition loses too much information, then the subsystem is inside of a larger subsystem with strong internal dependencies and can not be easily separated. The stochastic models have often been represented with the use of graph representation, and treated with the name of complex network [1–3]. Generally, the nodes represent the variables and the weights on the edges are the statistical association between them. However, if we consider the information contained in the different orders of dependencies among variables, the graph with a single kind of edges is not sufficient to express the whole information of the system [4]. An edge of a graph with n nodes contains the information of statistical association up to the n-th order dependencies among n variables. If we try to decompose the system independently by cutting these information, we have to consider what it means to cut the edge of the graph from the information theoretical point of view. Entropy 2014, 16, 4132–4167; doi:10.3390/e16074132 www.mdpi.com/journal/entropy 283 Entropy 2014, 16, 4132–4167 Indeed, analysis on the degree of dependencies existing between variables derived many defini- tion of complexity in stochastic model [5], which have been mostly studied with information theoretical perspective. Beginning with seminal works of Lempel and Ziv (e.g., [6]), computation-oriented definition of complexity takes deterministic formalization and measures the necessary information to reproduce a given symbolic sequence exactly, which is classified with the name of algorithmic complexity [7–9]. On the other hand, statistical approach to complexity, namely statistical complexity, assumes some stochastic model as theoretical basis, and refers to the structure of information source on it in measure-theoretic way [10–12]. One of the most classical statistical complexities is the mutual information between two stochastic variables, and its generalized form to measure dependence between n variables is proposed (e.g., [13]) and explored in relevance to statistical models and theories by several authors [14–16]. We should also recall that complexity is not necessary conditioned only by information theory, but rather motivated from the organization of living system such as brain activity. The TSE complexity shows further extension of generalized mutual information into biological context, where complexity exists as the heterogeneity between different system hierarchies [17]. These statistical complexities are all based on the boundary condition of vanishing at the limit of completely random and ordered state [18]. The complexity measure is usually the projection from system’s variables to one-dimensional quantity, which is composed to express the degree of characteristic that we define to be important in what means “complexity”. Since the complexity measure is always a many-to-one association, it has both aspects of compressing information to classify the system from simple to complex, and losing resolution of the system’s phase space. If the system has n variables, we generally need n independent complexity measures to completely characterize the system with real-value resolution. The problematics of defining a complexity measure is situated on the edge of balancing the information compression on system’s complexity with theoretical support, and the resolution of the system identification to be maintained high enough to avoid trivial classification. The latter criterion increases its importance as the system size becomes larger. The better complexity measure is therefore a set of indices, with as less number as possible, which characterizes major features related to the complexity of the system. In this sense, the ensemble of complexity measures is also analogous to the feature space of support vector machine. A non-trivial set of complexity measures need to be complementary to each other in parameter space for the possible best discrimination of different systems. In this paper, we first consider the stochastic system with binary variables and theoretically develop a way to measure the information between subsystems, which is consistent to the information represented by the edges of the graph representation. Next, we particularly focus on the generalized mutual information as a start point of the argument, and further consider to incorporate network heterogeneity into novel measures of complexity with respect to the system’s decompositionability. This approach will be revealed to be complementary to TSE complexity as the difference between arithmetic and geometric means of information. 2. System Decomposition Let us consider the stochastic system with n binary variables x = (x1, · · · , xn) where xi ∈ {0, 1} (1 ≤ i ≤ n). We denote the joint distribution of x by p(x). We define the decomposition pdec(x) of p(x) into two subsystems y1 = (x1 1, · · · , x1 n1) and y2 = (x2 1, · · · , x2 n2) (n1 + n2 = n, y1 ∪ y2 = x, y1 ∩ y2 = φ) as follows: pdec(x) = p(y1)p(y2), (1) where p(y1) and p(y2) are the joint distributions of y1 and y2, respectively. For simplicity, hereafter we denote the system decomposition using the smallest subscript of variables in each subsystem. For example, in case n = 4, y1 = (x1, x3) and y2 = (x2, x4), we describe the decomposed system pdec(x) 284 Entropy 2014, 16, 4132–4167 as < 1212 >. The system decomposition means to cut all statistical association between the two subsystems, which is expressed as setting the independent relation between them. We will further consider the Equation (1) in terms of the graph representation. We define the undirected graph Γ := (V, E) of the system p(x), whose vertices V = {x1, · · · , xn} and edges E = V × V represent the variables and the statistical association, respectively. To express the system, we set the value of each vertex as the value of the corresponding variable, and the weight of each edge as the degree of dependency between the connected variables. There is however a problem considering the representation with a single kind of edge. The statistical association among variables is not only between two variables, but can be independently defined among plural variables up to the n-th order. Therefore, the exact definition of the weight of the edges remains unclear. To clarify these problematics, we consider the hierarchical marginal distributions j as another coordinates of the system p(x) as follows: j = (j1; j2; · · · ; jn), (2) where j1 = (η1, · · · , ηi, · · · , ηn), (1 < i < n), j2 = (η1,2, · · · , ηi,j, · · · , ηn−1,n), (1 < i < j < n), ... jn = η1,2,··· ,n, (3) and η1 = ∑ i2,··· ,in∈{0,1} p(1, i2, · · · , in), ... ηn = ∑ i1,··· ,in−1∈{0,1} p(i1, · · · , in−1, 1), η1,2 = ∑ i3,··· ,in∈{0,1} p(1, 1, i3, · · · , in), ... ηn−1,n = ∑ i1,··· ,in−2∈{0,1} p(i1, · · · , in−2, 1, 1), ... η1,2,··· ,n = p(1, 1, · · · , 1). (4) Since the definition of j is a linear transformation of p(x), both coordinates have the degrees of freedom ∑n k=1 nCk. The subcoordinates j1 are simply the set of marginal distributions of each variable. The subcoordinates jk (1 < k ≤ n) include the statistical association among k variables, that can not be expressed with the coordinates less than the k-th order. This means that the different statistical associations exist independently in each order among the corresponding sets of the variables. The statistical association represented by the weight of a graph edge {xi, xj} is therefore the superposition of the different dependencies defined on every subset of x including xi and xj. To measure the degree of statistical association in each order, the information geometry established the following setting [19]. We first define another coordinates ` = (`1; `2; · · · ; `n) that are the dual 285 Entropy 2014, 16, 4132–4167 coordinates of j with respect to the Legendre transformation of the exponential family’s potential function ψ(`) to its conjugate potential φ(j) as follows: `1 = (θ1, · · · , θn), `2 = (θ1,2, · · · , θn−1,n), (5) ... `n = θ1,2,··· ,n, where ψ(`) = log 1 p(0, · · · , 0), φ(j) = ∑ i θiηi + ∑ i give its dual coordinates `dec as follows: θdec 1 = log ηdec 1 − ηdec 1,2 1 − ηdec 1 − ηdec 2 + ηdec 1,2 = log η1 1 − η1 , θdec 2 = log ηdec 2 − ηdec 1,2 1 − ηdec 1 − ηdec 2 + ηdec 1,2 = log η2 1 − η2 , θdec 1,2 = log ηdec 1,2 (1 − ηdec 1 − ηdec 2 + ηdec 1,2 ) (ηdec 1 − ηdec 1,2 )(ηdec 2 − ηdec 1,2 ) = 0, (15) which means the first and second node is independent. For n = 3, the above defined jdec for the system decomposition < 122 > give its dual coordinates `dec as follows: θdec 1 = log ηdec 1 − ηdec 1,2 − ηdec 1,3 + ηdec 1,2,3 1 − ηdec 1 − ηdec 2 − ηdec 3 + ηdec 1,2 + ηdec 1,3 + ηdec 2,3 − ηdec 1,2,3 = log η1 1 − η1 , θdec 2 = log ηdec 2 − ηdec 1,2 − ηdec 1,3 + ηdec 1,2,3 1 − ηdec 1 − ηdec 2 − ηdec 3 + ηdec 1,2 + ηdec 1,3 + ηdec 2,3 − ηdec 1,2,3 = log η2 − η2,3 1 − η2 − η3 + η2,3 , θdec 3 = log ηdec 3 − ηdec 1,3 − ηdec 2,3 + ηdec 1,2,3 1 − ηdec 1 − ηdec 2 − ηdec 3 + ηdec 1,2 + ηdec 1,3 + ηdec 2,3 − ηdec 1,2,3 = log η3 − η2,3 1 − η2 − η3 + η2,3 , (16) θdec 1,2 = log (ηdec 1,2 − ηdec 1,2,3)(1 − ηdec 1 − ηdec 2 − ηdec 3 + ηdec 1,2 + ηdec 1,3 + ηdec 2,3 − ηdec 1,2,3) (ηdec 1 − ηdec 1,2 − ηdec 1,3 + ηdec 1,2,3)(ηdec 2 − ηdec 1,2 − ηdec 2,3 + ηdec 1,2,3) = 0, θdec 1,3 = log (ηdec 1,3 − ηdec 1,2,3)(1 − ηdec 1 − ηdec 2 − ηdec 3 + ηdec 1,2 + ηdec 1,3 + ηdec 2,3 − ηdec 1,2,3) (ηdec 1 − ηdec 1,2 − ηdec 1,3 + ηdec 1,2,3)(ηdec 3 − ηdec 1,3 − ηdec 2,3 + ηdec 1,2,3) = 0, θdec 2,3 = log (ηdec 2,3 − ηdec 1,2,3)(1 − ηdec 1 − ηdec 2 − ηdec 3 + ηdec 1,2 + ηdec 1,3 + ηdec 2,3 − ηdec 1,2,3) (ηdec 2 − ηdec 1,2 − ηdec 2,3 + ηdec 1,2,3)(ηdec 3 − ηdec 1,3 − ηdec 2,3 + ηdec 1,2,3) = log η2,3(1 − η2 − η3 + η2,3) (η2 − η2,3)(η3 − η2,3) , (17) θdec 1,2,3 = log � ηdec 1,2,3 (ηdec 1,2 − ηdec 1,2,3)(ηdec 1,3 − ηdec 1,2,3)(ηdec 2,3 − ηdec 1,2,3) × (ηdec 1 − ηdec 1,2 − ηdec 1,3 + ηdec 1,2,3)(ηdec 2 − ηdec 1,2 − ηdec 2,3 + ηdec 1,2,3)(ηdec 3 − ηdec 1,3 − ηdec 2,3 + ηdec 1,2,3) (1 − ηdec 1 − ηdec 2 − ηdec 3 + ηdec 1,2 + ηdec 1,3 + ηdec 2,3 − ηdec 1,2,3) � = 0, (18) which means the first node is independent from the other nodes. The generalization is possible with the use of recurrence formula between system size n and n + 1, according to the symmetry of the model and Legendre transformation between jdec and `dec coordinates. Numerical proof can be obtained by computing directly 0 elements of `dec from jdec. 288 Entropy 2014, 16, 4132–4167 The definition of jdec means to decompose the hierarchical marginal distributions j into the products of the subsystems’ marginal distributions, in case the subscripts traverse the two subsystems. Therefore, only the statistical associations between two subsystems are set to be independent, while the internal dependencies of each subsystem remain unchanged. This is analytically equivalent to compose another mixture coordinates ¸, namely the < · · · >-cut coordinates, with proper description of the system decomposition with < · · · >. The ¸ consists of the j coordinates with the subscripts that do not traverse between the decomposed subsystems, and the ` coordinates whose subscripts traverse between them. For simplicity, we only describe here the case n = 4 and the decomposition < 1133 > (the set of the first, second, and the third, fourth nodes each form a subsystem). The system p(x) is expressed with the < 1133 >-cut coordinates ¸ as ξ1 = η1, ... ξ4 = η4, ξ1,2 = η1,2, ξ1,3 = θ1,3, ξ1,4 = θ1,4, ξ2,3 = θ2,3, (19) ξ2,4 = θ2,4, ξ3,4 = η3,4, ξ1,2,3 = θ1,2,3, ... ξ2,3,4 = θ2,3,4, ξ1,2,3,4 = θ1,2,3,4. The decomposed system with no statistical association between two subsystems have the following coordinates ¸dec, which is, in any decomposition, equivalent to set all θ in ¸ as 0: ξdec 1 = η1, ... ξdec 4 = η4, ξdec 1,2 = η1,2, ξdec 1,3 = 0, ξdec 1,4 = 0, ξdec 2,3 = 0, (20) ξdec 2,4 = 0, ξdec 3,4 = η3,4, ξdec 1,2,3 = 0, ... ξdec 2,3,4 = 0, ξdec 1,2,3,4 = 0. 289 Entropy 2014, 16, 4132–4167 This is analytically equivalent to the definition of the decomposition (13)–(14) in case of < 1133 >. Therefore, the KL-divergence D[p(x, ¸) : p(x, ¸dec)] measures the information lost by the system decomposition. The following asymptotic agreement to χ2 test also holds. Proposition 2. 2N · D[p(x, ¸) : p(x, ¸dec)] ∼ χ2(♯θ(¸)), (21) where ♯θ(¸) is the number of ` coordinates appearing in the ¸ coordinates. 3. Edge Cutting We further expand the concept of system decomposition to eventually quantify the total amount of information expressed by an edge of the graph. Let us consider to cut an edge {xi, xj} (1 ≤ i < j ≤ n) of the graph with n vertices. Hereafter we call this operation as the edge cutting i − j. In the same way as the system decomposition, the edge cutting corresponds to modify the j coordinates to produce jec coordinates as follows: ηec i,j = ηiηj, ηec s[i,j,k1] = ηs[i,k1]ηs[j,k1], ηec s[i,j,k1,k2] = ηs[i,k1,k2]ηs[j,k1,k2], (22) ... ηec s[i,j,k1,··· ,kn−2] = ηs[i,k1,··· ,kn−2]ηs[j,k1,··· ,kn−2], ({i, j, k1, · · · , kn−2} = {1, · · · , n}), and the rest of jec remains the same as those of j. The formation of jec from j consists of replacing the k-th order elements (k ≥ 3) of j including both i and j in its subscripts, with the product of the k − 1-th order j in maximum subgraphs (k − 1 vertices) each including i or j. This means that all orders of statistical association including the variables xi and xj are set to be independent only between them. Other relations that do not include simultaneously xi and xj remain unchanged. Certain combinations of edge cuttings coincide with system decompositions. For example, in case n = 4, the edge cuttings 1 − 2, 1 − 3, and 1 − 4 are equivalent to the system decomposition < 1222 >. We define the i − j-cut mixture coordinates ¸ for orthogonal decomposition of the statistical association represented by the edge {xi, xj}. Although actual calculation can be performed only with j coordinates, this generalization is necessary to have a geometrical definition of the orthogonality. For simplicity, we only describe the ¸ in the case of n = 4: 290 Entropy 2014, 16, 4132–4167 ξ1 = η1, ... ξ4 = η4, ξ1,2 = θ1,2, ξ1,3 = η1,3, ξ1,4 = η1,4, ξ2,3 = η2,3, ξ2,4 = η2,4, (23) ξ3,4 = η3,4, ξ1,2,3 = θ1,2,3, ξ1,2,4 = θ1,2,4, ξ1,3,4 = η1,3,4, ξ2,3,4 = η2,3,4, ξ1,2,3,4 = θ1,2,3,4, where orthogonality between the elements of j and ` holds with respect to the Fisher information matrix. Calculating the dual coordinates `ec of jec, we can define the coordinates ¸ec of the system after the edge cutting 1 − 2 as follows: ξec 1 = η1, ... ξec 4 = η4, ξec 1,2 = θec 1,2, ξec 1,3 = η1,3, ξec 1,4 = η1,4, ξec 2,3 = η2,3, ξec 2,4 = η2,4, ξec 3,4 = η3,4, ξec 1,2,3 = θec 1,2,3, ξec 1,2,4 = θec 1,2,4, ξec 1,3,4 = η1,3,4, ξec 2,3,4 = η2,3,4, ξec 1,2,3,4 = θec 1,2,3,4. Note that the edge cutting can not be defined simply by setting the corresponding elements of `ec as 0. Then the KL-divergence D[p(x, ¸) : p(x, ¸ec)] represent the total amount of information represented by the edge 1 − 2. The following asymptotic agreement to χ2 test also holds: 291 Entropy 2014, 16, 4132–4167 Proposition 3. 2N · D[p(x, ¸) : p(x, ¸ec)] ∼ χ2(1 + n−2 ∑ k=1 n−2Ck). (24) We call this χ2 value or the KL-divergence itself as edge information of edge 1 − 2. 4. Generalized Mutual Information as Complexity with Respect to the Total System Decomposition In previous sections, we have introduced a measure of complexity in terms of system decomposition, by measuring the KL-divergence between a given system and its independently decomposed subsystems. We consider here the total system decomposition, and measure the informational distance I between the system and the totally decomposed system where each element are independent. I := n ∑ i=1 H(xi) − H(x1, · · · , xn), (25) where H(x) := −∑ x p(x) log(x). (26) This quantity is the generalization of mutual information, and is named in various ways such as generalized mutual information, integration, complexity, multi-information, etc. according to different authors. For simplicity, we call the I as “multi-information taking after [15]. This quantity can be interpreted as a measure of complexity that sums up the order-wise statistical association existing in each subset of components with information geometrical formalization [14] For simplicity, we denote the multi-information I of n-dimensional stochastic binary variables as follows, using the notation of the system decomposition: I = D[< 111 · · · 1 >:< 123 · · · n >]. (27) 5. Rectangle-Bias Complexity The multi-information contains some degrees of freedom in case n > 2. That is, we can define a set of distributions {p(x)|I = const.} with different parameters but the same I value. This fact can be clearly explained with the use of information geometry. From the Pythagorean relation, we obtain the followings in case of n = 3: D[< 111 >:< 113 >] + D[< 113 >:< 123 >] = D[< 111 >:< 123 >], D[< 111 >:< 121 >] + D[< 121 >:< 123 >] = D[< 111 >:< 123 >], (28) D[< 111 >:< 122 >] + D[< 122 >:< 123 >] = D[< 111 >:< 123 >]. Using these relations, we can schematically represent the decomposed systems on a circle diagram with diameter √ I. This representation is based on the analogous algebra between Pythagorean relation of KL-divergence, and that of Euclidian geometry where the circumferential angle of a semi-circular arc is always π 2 . Figure1 represents two different cases with the same I value in case n = 3. The distance between two systems in the same diagram corresponds to the square root value of KL-divergence between them. Clearly the left and right figures represent different dependencies between nodes, although they both have the same I value. Such geometrical variation is possible by the abundance of degree of freedom in dual coordinates compared to the given constraint. There exist 7 degrees of freedom in η or θ coordinates for n = 3, while the only constraint is the invariance of I value, which only reduce 1 292 Entropy 2014, 16, 4132–4167 degree of freedom. The remaining 6 degrees of freedom can then be deployed to produce geometrical variation in the circle diagram. As for considering system decomposition, the left figure is difficult to obtain decomposed systems without losing much information. While in the right figure there exists relatively easy decomposition < 122 >, which loses less information than any decomposition in the left figure. We call such degree of facility of decomposition with respect to the losing information as system decompositionability. In this sense, the left system is more complex although the 2 systems both have the same I value. Especially, in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] = D[< 111 >:< 121 >], the system does not have any easiest way of decomposition, and any isolation of a node loses significant amount of information. To further incorporate such geometrical structure reflecting system decompositionability into a measure of complexity, we consider a mathematical way to distinguish between these two figures. Although the total sum of KL-divergence along the sequence of system decomposition is always identical to I by Pythagorean relation, their product can vary according to the geometrical composition in the circle diagram. This is analogous to the isoperimetric inequality of rectangle, where regular tetragon gives the maximum dimensions amongst constant perimeter rectangles. We propose provisionary a new measure of complexity as follows, namely rectangle-bias complexity Cr: Cr = 1 |SD| − 2 ∑ <···>∈SD D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >], (29) where SD is the set of possible system decomposition in n binary variables, and |SD| is the element number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5 for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value for the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 > ] = D[< 111 >:< 121 >]. We propose provisionary a new measure of complexity as follows, namely rectangle-bias complexity Cr: Cr = 1 |SD| − 2 ∑ <···>∈SD D[< 11 · · · 1 >:< · · · >] · D[< · · · >:< 12 · · · n >], (30) where SD is the set of possible system decomposition in n binary variables, and |SD| is the element number of SD. For example, SD = {< 111 >, < 122 >, < 121 >, < 113 >, < 123 >} and |SD| = 5 for n = 3. This measure distinguishes between the two systems in Figure 1, and gives larger value for the left figure. It also gives maximum value in case D[< 111 >:< 122 >] = D[< 111 >:< 113 >] = D[< 111 >:< 121 >]. 293 Entropy 2014, 16, 4132–4167 Figure 1. Circle diagrams of system decomposition in 3-node network. Both systems have the same value of multi-information I that is expressed as the identical diameter length of the circles. 2 variations are shown, where the left system is more complex (Cr high) in a sense any system decomposition requires to lose more information than the easiest one (< 122 >) in the right figure (Cr low). 6. Complementarity between Complexities Defined with Arithmetic and Geometric Means We evaluate the possibility and the limit of rectangle-bias complexity Cr comparing with other proposed measures of complexity. The Interests in measuring network heterogeneity have been developed toward the incorporation of multi-scale characteristics into complexity measures. The TSE complexity is motivated from the structure of the functional differentiation of brain activity, which measures the difference of neural integration between all sizes of subsystems and the whole system [17]. Biologically motivated TSE complexity is also investigated from theoretical point of view, to further attribute desirable property as an universal complexity measure independent of system size [20]. The hierarchical structure of the exponential family in information geometry also leads to the order-wise description of statistical association, which can be regarded as a multi-scale complexity measure [14]. The relation between the order-wise dependencies and the TSE complexity is theoretically investigated to establish the order-wise component correspondence between them [15]. These indices of network heterogeneity, however, all depend on the arithmetic mean of the component-wise information theoretical measure. We show that these arithmetic means still miss to measure certain modularity based on the statistical independence between subsystems. Figure 2 present the simplified cases where complexity measures with arithmetic means fail to distinguish. We consider the two systems with different heterogeneity but identical multi-information I. Here, the multi-information can not reflect the network heterogeneity. The TSE complexity and its information geometrical correspondence in [15] has a sensitivity to measure the network heterogeneity, but since the arithmetic mean is taken over all subsystems, they do not distinguish the component-wise break of symmetry between different scales. The renormalized TSE complexity with respect to the multi-information I still has the same insensitivity. Even by incorporating the information of each subsystem scale, the arithmetic mean can balance out between the scale-wise variations, and a large range of the heterogeneity in different scale can realize the same value of these complexities. For the application in neuroscience, the assumption of a model with simple parametric heterogeneity and the comparison of TSE complexity between different I values alleviate this limitation [17]. 294 Entropy 2014, 16, 4132–4167 (a) (b) (c) (d) Figure 2. Schematic examples of stochastic systems with identical multi-information I where complexity measures with arithmetic mean fail to distinguish. (a): Example 1 of stochastic system with homogeneous mean of edge information and symmetric fluctuation of its heterogeneity; (b): Example 2 of heterogeneous stochastic system with bimodal edge information distribution and identical multi-information I and complexity based on arithmetic mean as example 1; (c): schematic representation of the distribution of statistical association (edge information) in upper network; (d): schematic representation of the distribution of statistical association (edge information) in upper network. In contrast to complexities with arithmetic mean, the rectangle-bias complexity Cr is related to the geometrical mean. The Cr can distinguish the two systems in Figure 2, giving relatively high Cr value to the left system and low value to the right one. This does not mean , however, that the Cr has a finer resolution than other complexity measures. The constant conditions of complexity measures are the constraints on ∑n k=1 nCk degrees of freedom in model parameter space, which define different geometrical composition of corresponding submanifolds. We basically need ∑n k=1 nCk independent measures to assure the real-value resolution of network feature characterization. Complexities with arithmetic and geometric means are just giving complementary information on network heterogeneity, or different constant-complexity submanifolds structure in statistical manifold as depicted in Figure 3. Therefore, it is also possible to construct a class of systems that has identical I and Cr values but different TSE complexity. Complexity measures should be utilized in combination, with respect to the non-linear separation capacity of network features of interest. 295 Entropy 2014, 16, 4132–4167 Figure 3. Schematic representation of complementarity between complexity measures based on arithmetic mean (Ca) and geometric mean (Cg) of informational distance. An example of the n − 1 dimensional constant-complexity submanifolds with respect to Ca = const. and Cg = const. conditions are depicted with yellow and orange surface, respectively. The dimension of the whole statistical manifold S is the parameter number n. 7. Cuboid-Bias Complexity with Respect to System Decompositionability We consider the expansion of Cr into general system size n. The n ≥ 4 situation is different from n = 3 and less in the existence of a hierarchical structure between system decompositions. Figure 4 shows the hierarchy of the system decompositions in case n = 4. Such hierarchical structure between system decompositions is not homogeneous with respect to the subsystems number, and depends on the isomorphic types of decomposed systems. This fact produces certain difference of meaning in complexity between each KL-divergences when considering the system decompositionability. Figure 4. Hierarchy of system decomposition for 4 nodes network (n = 4). Possible sequences of Seq = {SD1(is) → SD2(is) → SD3(is) → SD4(is)|1 ≤ is ≤ |Seq| = 18} are connected with the lines. A simple example in 4 nodes network is shown in Figure 5. 296 Entropy 2014, 16, 4132–4167 Figure 5. Hierarchical effect of sequential system decomposition on cuboid volume and rectangle surface on circle graph. We consider to increase the diameter of the green circle from dotted to dashed one without changing those of the red and blue circles, which gives different effect on the change of D[< 1222 >:< 1233 >] and D[< 1133 >:< 1134 >] according to the hierarchical structure of the decomposition sequences. We consider the modification of 2 KL-divergences in the figure, D[< 1111 >:< 1222 >] and D[< 1111 >:< 1133 >] from the diameter of green dotted circle to the dashed one. The joint distribution P(x1, x2, x3, x4) of a discrete distribution with 4 binary variables (x1, x2, x3, x4) (x1, x2, x3, x4 ∈ {0, 1}) have 24 − 1 = 15 parameters, which define the dual-flat coordinates of statistical manifold in information geometry. On the other hand, the possible system decompositions exist as the followings in n = 4: SD := {< 1111 >, < 1114 >, < 1131 >, < 1211 >, < 1222 >, < 1133 >, < 1212 >, < 1221 >, < 1134 >, < 1214 >, < 1231 >, < 1224 >, < 1232 >, < 1233 >, < 1234 >}. (31) Since the number of possible system decompositions is 15, and each is associated with the modification of different sets of P(x1, x2, x3, x4) parameters, the system decompositions and KL- divergences between them can be defined independently. This also holds even under the constant condition of I value or other complexity measures except the ones imposing dependency between system decompositions. This means that we can independently modify the diameter of green dotted circle in Figure 5, without changing the diameters of the red and blue circles, which define the system decompositions < 1233 > and < 1134 > in the sub-hierarchy of < 1222 > and < 1133 >, respectively. Other KL-divergences can also be maintained as given constant values for the same reason. The rectangle-biased complexity Cr increases its value with such modification, but does not reflect the heterogeneity of KL-divergences according to the hierarchy of system decompositions. If we consider the system decompositionability as the mean facility to decompose the given system into its finest components with respect to the “all” possible system decompositions, such hierarchical difference also has a meaning in the definition of complexity. The effect of modifying the diameter of the green dotted circle is different between the decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→< 297 Entropy 2014, 16, 4132–4167 1133 >→< 1134 >→< 1234 >. The decrease of the KL-divergence D[< 1222 >:< 1233 >] is less than D[< 1133 >:< 1134 >] since the diameter of the red dotted circle is larger than the blue one in Figure 5. This means that the effect of changing the same amount of KL-divergences in D[< 1111 >:< 1222 >] and D[< 1111 >:< 1133 >] produces larger effect on the sequence < 1111 >→< 1133 >→< 1134 >→< 1234 > than < 1111 >→< 1222 >→< 1233 >→< 1234 >, if compared at the sequence level. The rectangle-biased complexity Cr does not reflect such characteristics since it does not distinguish between the hierarchical structure between the diameters of the green, red and blue dotted circles. To incorporate such hierarchical effect in a complexity measure with geometric mean, we have the natural expansion of the rectangle-biased complexity Cr as the cuboid-bias complexity Cc, which is defined as follows: Cc := 1 |Seq| |Seq| ∑ is=1 n−1 ∏ i=1 D[SDi(is) : SDi+1(is)], (32) where Seq represents the possible sequences of hierarchical system decompositions as follows: Seq = {SD1(is) → SD2(is) → · · · SDi(is) · · · → SDn(is)|1 ≤ is ≤ |Seq|}. (33) The elements SDi(is) of Seq corresponds to the system decomposition, which is aligned according to the hierarchy with the following algorithmic procedure (based on [15]): (1) Initialization: Set the initial sets of system decomposition of all sequences in Seq as the whole system SD1(is) :=< 111 · · · 1 > (1 ≤ is ≤ |Seq|). (2) Step i → i + 1: If the system decomposition is the total system decomposition (SDi(is) :=< 123 · · · n >), then stop. Otherwise, choose a non-decomposed subsystem SSi(is) of the system decomposition SDi(is), and further divide it into two independent subsystems SS1 i (is) and SS2 i (is) different for each is. SDi+1(is) is then defined as a system decomposition of total system that further separates independently subsystems SS1 i (is) and SS2 i (is), in addition to the previous decomposition SDi(is). (3) Go to the next step i + 1 → i + 2. The value of |Seq| corresponds to the number of different sequences generated by this algorithm. For example, |Seq| = 3 and |Seq| = 18 holds for n = 3 and n = 4, respectively. The general analytical form |Seq|n of |Seq| with system size n is obtained as the following recurrence formula: |Seq|n = ⌊ n 2 ⌋ ∑ i=1 nCi|Seq|n−i|Seq|i, (34) where ⌊·⌋ is a floor function and with formal definition of |Seq|1 := 1. The products of KL-divergences according to the hierarchical sequences of system decompositions in Equation (32) is related to the volume of n − 1-dimensional cuboids in the circle diagram. An example in case of n = 4 is presented in Figure 5, where two cuboids with 3 orthogonal edges of the different decomposition sequences < 1111 >→< 1222 >→< 1233 >→< 1234 > and < 1111 >→< 1133 >→< 1134 >→< 1234 > are depicted, whose cuboid volumes are � D[< 1111 >:< 1222 >]D[< 1222 >:< 1233 >]D[< 1233 >:< 1234 >], (35) and � D[< 1111 >:< 1133 >]D[< 1133 >:< 1134 >]D[< 1134 >:< 1234 >], (36) respectively. 298 Entropy 2014, 16, 4132–4167 In the same way as Cr, we took in the definition of Cc the arithmetic average of cuboid volumes so that to renormalize the combinatorial increase of the decomposition paths (|Seq|) according to the system size n. Note that on the other hand we did not renormalize the rectangle-bias complexity Cr and the cuboid-bias complexity Cc by taking the exact geometrical mean of each product of KL-divergences such as n−1� ∏n−1 i=1 D[SDi(is) : SDi+1(is)]. This is for further accessibility to theoretical analysis such as variational method (see “Further Consideration" section), and does not change qualitative behavior of Cr and Cc since the power root is a monotonically increasing function. This treatment can be interpreted as taking the (n − 1)-th power of the geometric means for the hierarchical sequences of KL-divergences. A more comprehensive example on the utility of the cuboid-bias complexity Cc with respect to the rectangle-biased one Cr is shown in Figure 6. We consider the 6 nodes networks (n = 6) with the same I and Cr values but different heterogeneity. The system in the top left figure has a circularly connected structure with medium intensity, while that of the top right figure has strongly connected 3 subsystems. These systems have qualitatively five different ways of system decomposition that are the basic generators of all hierarchical sequences Seq = {SD1(is) → · · · → SD5(is)|1 ≤ is ≤ |Seq|} for these networks. The five basial system decompositions are shown with the number 1⃝, 2⃝, 2⃝′, 3⃝ and 4⃝ in top figures. The circle diagrams of these systems are depicted in the middle figures. To suppose the same constant value of Cr in both systems, the following condition is satisfied in the middle right figure: D[< 111111 >: 2⃝] < D[< 111111 >: 1⃝in Middle Left figure] < D[< 111111 >: 1⃝] < D[< 111111 >: 2⃝in Middle Left figure] < D[< 111111 >: 3⃝] < D[< 111111 >: 4⃝]. Furthermore, the total surface of right triangles sharing the circle diameter as hypotenuse in the middle left and the middle right figures are conditioned to be identical, therefore the rectangle-bias complexity Cr fails to distinguish. On the other hand, under the same condition, the cuboid-bias complexity Cc distinguishes between these two systems and gives higher value to the left one. The volume of 5-dimensional cuboids of the decomposition sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝ −−−−−−−−→< 123456 > are schematically shown in the bottom figures, maintaining the quantitative difference between KL-divergences. Since the multi-information I is identical between the two systems, so is the values of KL-divergence D[< 111111 >:< 123456 >], which is the sum of the KL-divergences along the sequence < 111111 > 1⃝ 2⃝ 2⃝′ 3⃝ 4⃝ −−−−−−−−→< 123456 > from the Pythagorean theorem. This means that the inequality between the cuboid volumes can be represented as the isoperimetric inequality of high-dimensional cuboid. As a consequence, the left system has quantitatively higher value of Cc than the right one. The cuboid-bias complexity Cc is also sensitive to such heterogeneity. 299 Entropy 2014, 16, 4132–4167 (a) (b) (c) (d) (e) (f) Figure 6. Meaning of taking geometric mean over the sequence of system decomposition in cuboid-bias complexity Cc. (a): Example of 6-node network with circularly connected structure with medium intensity. Edge width is proportional to edge information; (b): Example of 6-node network with strongly connected 3 subsystems. Edge width is proportional to edge information. The multi-information I of the two systems in Top figures are conditioned to be identical; The dotted lines schematically represent possible system decompositions. (c,d): Circle diagrams of each system decomposition in upper networks; The total surface of right triangles sharing the circle diameter as hypotenuse in (c) and (d) are conditioned to be identical, therefore the rectangle-bias complexity Cr fails to distinguish. (e,f): 5-dimensional cuboids of upper networks (Figure 6a,b) whose edges are the root of KL-divergences for the strain of system decomposition < 111111 > 1⃝ 2⃝ 2⃝ ′ 3⃝ 4⃝ −−−−−−−−→< 123456 >. Only the first 3-dimensional part is shown with solid line, and the remaining 2-dimensional part is represented with dotted line. The volume of cuboid in (e) is larger than the one in (f), according to the isoperimetric inequality of high-dimensional cuboid. The total squared length of each side is identical between two cuboids, which represents multi-information I = D[< 111111 >:< 123456 >]. 8. Regularized Cuboid-Bias Complexity with Respect to Generalized Mutual Information We further consider the geometrical composition of system decompositions in the circle diagram and insist the necessity of renormalizing the cuboid-bias complexity Cc with the multi-information I, which gives another measure of complexity namely “regularized cuboid-bias complexity CR c .” We consider the situation in actual data where the multi-information I varies. Figure 7 shows the n = 3 cases where the Cc fails to distinguish. Both the blue and red systems are supposed to have the same Cc value by adjusting the red system to have relatively smaller values of KL-divergences 300 Entropy 2014, 16, 4132–4167 D[< 111 >:< 122 >] and D[< 113 >:< 123 >] than the blue one. Such conditioning is possible since the KL-divergences are independent parameters with each other. (a) (b) (c) Figure 7. Examples of the 3-node systems with identical cuboid-bias complexity Cc but different multi-information I on circle graph. (a): System with smaller I but larger CRc ; (b): System with larger I but smaller CRc ; (c): Superposition of the above two systems. The regularized cuboid-bias complexity CRc distinguishes between the blue and red systems. Although the Cc value is identical, the two systems have different geometrical composition of system decompositions in the circle diagram. The red system has relatively easier way of decomposition < 111 >→< 122 > if renormalized with the total system decomposition < 111 >→< 123 >. This relative decompositionability with respect to the renormalization with the multi-information I can be clearly understood by superimposing the circle diagram of the two systems and comparing the angles between each and total decomposition paths (bottom figure). The red system has larger angle between the decomposition paths < 111 >→< 122 > and < 111 >→< 123 > than any others in the blue system, which represents the relative facility of the decomposition under renormalization with I. In this term, the paths < 111 >→< 121 > in the red and blue system do not change its relative facility, and the paths < 111 >→< 113 > are easier in the blue system. To express the system decompositionability based on these geometrical compositions in a comprehensive manner, we define the regularized cuboid-bias complexity CR c as follows: CR c := 1 |Seq| |Seq| ∑ is=1 n−1 ∏ i=1 D[SDi(is) : SDi+1(is)] D[< 11 · · · 1 >:< 12 · · · n >] := Cc D[< 11 · · · 1 >:< 12 · · · n >]n−1 := Cc In−1 . (37) The red system then has quantitatively smaller CR c value than the blue system in Figure 7. 9. Modular Complexity with Respect to the Easiest System Decomposition Path We have considered so far the system decompositionability with respect to the all possible decomposition sequences. This was also a way to avoid the local fluctuation of the network heterogeneity to be reflected in some specific decomposition paths. On the other hand, the easiest decomposition is particularly important when considering the modularity of the system. If there exists hierarchical structure of modularity in different scales with different coherence of the system, the KL-divergence and the sequence of the easiest decomposition gives much information. 301 Entropy 2014, 16, 4132–4167 Figure 8 schematically shows a typical example where there exist two levels of modularity. Such structure with different scales of statistical coherence appears as functional segregation in neural systems [17], and is expected to be observed widely in complex systems. The hierarchical topology of the easiest decomposition path reflects these structures. For example, in the system of Figure 8, the decompositions between < 1 1 · · · 1 > and < 1 1 1 1 5 5 5 5 9 9 9 9 13 13 13 13 > are easier than those inside of the 4-node subsystems. The values of KL-divergence also reflect the hierarchy, giving relatively low values for the decomposition between the 4-node subsystems, and high values inside of them. By examining the shortest decomposition path and associated KL-divergences in possible Seq, one can project the hierarchical structure of the modularity existing in the system. Figure 8. Example of 16-node system < 11 · · · 1 > that has different levels of modularity. The four 4-node subsystems < 1111 > (blue blocks) are loosely connected and easy to be decomposed, while inside each component (red blocks) is tightly connected. The degree of connection represents statistical dependency or edge information between subsystems. Such hierarchical structure can be detected by observing the decomposition path of the modular complexity Cm. For this reason, we define the modular complexity Cm as follows, which is the shortest path component of the cuboid-bias complexity Cc: Cm := n−1 ∏ i=1 D[SDi(imin) : SDi+1(imin)], (38) where the index imin of the sequence SD1(imin) → SD2(imin) → · · · → SDn(imin) is chosen as follows: imin = {i1} ∩ {i2} ∩ · · · ∩ {in−1}, (39) where {i1} = argmin is {D[SD1(is) : SD2(is)]|1 ≤ is ≤ |Seq|}, {i2} = argmin i1 {D[SD2(i1) : SD3(i1)]|i1 ∈ {i1}}, ... {in−1} = argmin in−2 {D[SDn−1(in−2) : SDn(in−2)]|in−1 ∈ {in−1}}, (40) 302 Entropy 2014, 16, 4132–4167 which gives eventually imin = in−1. (41) This means that beginning from the undecomposed state < 11 · · · 1 >, we continue to choose the shortest decomposition path in the next hierarchy of system decomposition. The minimization of the path length is guaranteed by the sequential minimization since the geometric mean of isometric path division is bounded below by its minimum component. imin is unique if the system is completely heterogenous (i.e., D[SD1(ik) : SD2(ik)] ̸= D[SD1(il) : SD2(il)], 1 ≤ ik < il ≤ |Seq|), otherwise plural decomposition paths that give the same Cm value are possible according to the homogeneity of the system. Besides its value, the modular complexity Cm should be utilized with the sequence information of the shortest decomposition path to evaluate the modularity structure of a system. The cases where Cm are identical but Cc are different can be composed by varying the system decompositions other than in the shortest path SD1(imin) → SD2(imin) → · · · → SDn(imin) without modifying the index imin. There exist also inverse examples with identical Cc and different Cm, due to the complementarity between Cm and Cc. We finally define the regularized modular complexity CR m as follows, for the same reason as defining CR c from Cc; CR m := n−1 ∏ i=1 D[SDi(imin) : SDi+1(imin)] D[< 11 · · · 1 >:< 12 · · · n >] := Cm D[< 11 · · · 1 >:< 12 · · · n >]n−1 := Cm In−1 . (42) Proposition 4. The cuboid-bias complexities Cc and CR c are bounded by the modular complexities Cm and CR m respectively: Cc ≤ Cm, (43) CR c ≤ CR m. (44) And they coincide at the maximum values under the given multi-information I: max{Cm|I = const.} = max{Cc|I = const.}, (45) max{CR m} = max{CR c }. (46) These relations (43)–(46) are numerically shown in the “Numerical Comparison” section. The superiority of the modular complexities is due to the hierarchical dependency of KL-divergence value in decomposition paths. In the shortest decomposition path defining modular complexities, the easier system decomposition relatively increase its value since they incorporate more number of edge cutting. Since we eventually cut all edges to obtain < 12 · · · n > at the end of the decomposition sequence, collecting the edges with relatively weak edge information and cutting them together augment the value of the product of KL-divergences. The modular complexities are then the maximum value components among the possible decomposition paths calculated in cuboid-bias complexities: Cm = max � n−1 ∏ i=1 D[SDi(is) : SDi+1(is)] ����� 1 ≤ is ≤ |Seq| � , (47) CR m = max � n−1 ∏ i=1 D[SDi(is) : SDi+1(is)] D[< 11 · · · 1 >:< 12 · · · n >]n−1 ����� 1 ≤ is ≤ |Seq| � . (48) 303 Entropy 2014, 16, 4132–4167 The difference between the cuboid-bias complexities and the modular complexities is an index of the geometrical variation of decomposed systems in the circle graph, which reflects the fluctuation of the sequence-wise system decompositionability. If the variation of the system decompositionability for each system decomposition is large, accordingly the modular complexities tend to give higher values than the cuboid-bias complexities. 10. Numerical Comparison We numerically investigate the complementarity between the proposed complexities, Cc, CR c , Cm, and CR m. Since the minimum node number giving non-trivial meaning to these measures is n = 4, the corresponding dimension of parameter space is ∑n k=1 nCk = 15. The constant-complexity submanifolds are therefore difficult to visualize due to the high dimensionality. For simplicity, we focus on the 2-dimensional subspace of this parameter space whose first axis ranging from random to maximum dependencies of the system, and the second one representing the system decompositionability of < 1133 >. For this purpose, we introduce the following parameters α and β (0 ≤ α, β ≤ 1) in the j-coordinates of the discrete distribution with 4-dimensional binary stochastic variable: η1 = η0, η2 = η0, η3 = η0, η4 = η0, η1,2 = η1η2 + α(η0 − ϵ − η1η2), (49) η3,4 = η3η4 + α(η0 − ϵ − η3η4), η1,3 = η1η3 + αβ(η0 − ϵ − η1η3), η1,4 = η1η4 + αβ(η0 − ϵ − η1η4), η2,3 = η2η3 + αβ(η0 − ϵ − η2η3), η2,4 = η2η4 + αβ(η0 − ϵ − η2η4), η1,2,3 = η1,2η3 + αβ(η0 − 2ϵ − η1,2η3), η1,2,4 = η1,2η4 + αβ(η0 − 2ϵ − η1,2η4), η1,3,4 = η1η3,4 + αβ(η0 − 2ϵ − η1η3,4), η2,3,4 = η2η3,4 + αβ(η0 − 2ϵ − η2η3,4), η1,2,3,4 = η1,2η3,4 + αβ(η0 − 3ϵ − η1,2η3,4). Where α represents the degree of statistical association from random (α = 0) to maximum (α = 1), and β control the system decompositionability of < 1133 >. If β = 1, the system has the maximum KL-divergence D[< 1111 >:< 1133 >] under the constraint of α parameter, and β = 0 gives D[< 1111 >:< 1133 >] = 0. ϵ is the minimum value of the joint distribution of 4-dimensional variable, which is defined to be more than 0 to avoid singularity in the dual-flat coordinates of statistical manifold. ϵ = 1.0 × 10−10 and η0 = 0.5 was chosen for the calculation. 304 Entropy 2014, 16, 4132–4167 The system with maximum statistical association under given η0 corresponds to the α = β = 1 condition in given parameters, whose j-coordinates become as follows: η1 = η0, ... η4 = η0, η1,2 = η0 − ϵ, ... η3,4 = η0 − ϵ, (50) η1,2,3 = η0 − 2ϵ, ... η2,3,4 = η0 − 2ϵ, η1,2,3,4 = η0 − 3ϵ, . On the other hand, the totally decomposed system corresponds to the α = 0 condition, and the j-coordinates are: η1 = η0, ... η4 = η0, η1,2 = η0η0, ... η3,4 = η0η0, (51) η1,2,3 = η0η0η0, ... η2,3,4 = η0η0η0, η1,2,3,4 = η0η0η0η0. Note that the completely deterministic case η0 = 1.0 and α = β = 1 gives I = 0. The intuitive meaning of these parameters α and β are also schematically depicted in Figure 9 bottom right. 305 Entropy 2014, 16, 4132–4167 (a) (b) (c) (d) Figure 9. Contour plot of the complexity landscape of I, Cc, Cm, CRc , and CRm on α-β plane. (a): Contour plot superposition of Cc and Cm. (b): Contour plot superposition of CRc and CRm. (c): Contour plot of I. The color of contour plots corresponds to the color gradient of 3D plots in Figure 10; (d): Schematic representation of the system in different regions of α-β plane. Edge width represents the degree of edge information, and independence is depicted with dotted line. Figure 10 shows the landscape of the proposed complexities on the α-β plane. Their contour plots are depicted in Figure 9. The proposed complexities each differs from others in almost everywhere points on α-β plane except at the intersection lines. Therefore, these measures serve as the independent features of the system, each has its specific meaning with respect to the system decompositionability. The α-β plane shows a section of the actual structure of the complementarity expressed in Figure 3 between the proposed complexity measures. The relations between the cuboid-bias complexities and modular complexities in Equations (43)–(46) are also numerically confirmed. The modular complexities are superior than the corresponding cuboid-bias complexities, and coincide at the parameter α = β = 1 giving maximum values and dependencies in this parameterization. 306 Entropy 2014, 16, 4132–4167 (a) (b) (c) (d) (e) Figure 10. Landscape of complexities I, Cc, Cm, CRc , and CRm on α-β plane. (a): Multi-information I; (b): Cuboid-bias complexity Cc. (c): Modular complexity Cm;(d): Regularized cuboid-bias complexity CRc ; (e): Regularized modular complexity CRm. All complexity measures show the complementarity intersecting with each other, satisfying the boundary conditions vanishing at α = 0 and β = 0 except the multi-information I. Note that regularized complexities CRc and CRm show singularity of convergence at α → 0 due to the regularization of infinitesimal value. In general case without the parameterization with α, β and η0, the boundary conditions of Cc, CR c , Cm and CR m include that of the multi-information I, which vanish at the completely random or ordered state. This is common to other complexity measures such as the LMC complexity, and fit to the basic 307 Entropy 2014, 16, 4132–4167 intuition on the concept of complexity situated equivalently far from the completely predictable and disordered states [21,22]. The proposed complexities further incorporate boundary conditions that vanish with the existence of a completely independent subsystem of any size. This means that the Cc, CR c , Cm and CR m of a system become 0 if we add another independent variable. This property does not reflect the intuition of complexity defined by the arithmetic average of statistical measures. The proposed complexity can better find its meaning in comparison to other complexity measures such as the multi-information I, and by interactively changing the system scale to avoid trivial results with small independent subsystem. For example, the proposed complexities could be utilized as the information criteria for the model selection problems, especially with an approximative modular structure based on the statistical independency of data between subsystems. We insist that the complementarity principle between plural complexity measures of different foundation is the key to understand the complexity in a comprehensive manner. To characterize the property of Cc, CR c , Cm and CR m in relation to the diverse composition of each system decomposition, it is useful to consider the geometry of their contour structure, as compared in Figure 9. The contour can be formalized as Cc, CR c , Cm, CR m = const. for each complexity measure, and D[< 11 · · · 1 >: SDi(is)] = const. (1 ≤ i ≤ n − 1, 1 ≤ is ≤ |Seq|) for each system decomposition. For that purpose, analysis with algebraic geometry can be considered as a prominent tool. Algebraic geometry investigates the geometrical property of polynomial equations [23]. The complexities Cc, CR c , Cm and CR m can be interpreted as polynomial functions by taking each system decomposition as novel coordinates, therefore directly accessible to algebraic geometry. However, if we want to investigate the contour of the complexities on the p parameter space, logarithmic function appears as the definition of KL-divergence, which is a transcendental function and outreach the analytical requirement of algebraic geometry. To introduce compatibility between the p parameter space of information geometry and algebraic geometry, it suffices to describe the model by replacing the logarithmic functions as another n variables such as q = log p, and reconsider the intersection between the result from algebraic geometry on the coordinates (p, q) and q = log p condition. The contour of Cc, CR c , Cm and CR m is also important to seek for the utility of these measures as a potential to interpret the dynamics of statistical association as geodesics. 11. Further Consideration 11.1. Pythagorean Relations in System Decomposition and Edge Cutting We further look back at the system decomposition and edge cutting in terms of the Pythagorean relation between KL-divergences, which is based on the orthogonality between ` and j coordinates. In system decomposition, the distribution of decomposed system is analytically obtained from the product of subsystems’ η coordinates, which is equivalent to set all θdec parameters as 0 in mixture coordinate ξdec. From the consistency of θdec parameters in ξdec being 0 in all system decompositions, we have the Pythagorean relation according to the inclusion relation of system decomposition. For example, the following holds: D[< 1111 >:< 1234 >] = D[< 1111 >:< 1222 >] + D[< 1222 >:< 1233 >] (52) + D[< 1233 >:< 1234 >]. The proof is in the same way as k-cut coordinates isolating k-tuple statistical association between variables [14]. On the other hand, the edge cutting previously defined using the product of remaining maximum cliques’ η coordinates does not coincides with the θec = 0 condition in mixture coordinates ξec. We have defined the ηec values of edge cutting based only on the orthogonal relation between η and θ 308 Entropy 2014, 16, 4132–4167 coordinates, by generalizing the rule of system decomposition in ηec coordinates, and did not consider the Pythagorean relation between different edge cuttings. It is then possible to define another way of edge cutting using θec = 0 condition in ξec. Indeed, in k-cut mixture coordinates, θk+ = 0 condition is derived from the independent condition of the variables in all orders, and k-tuple statistical association is measured by reestablishing the η parameters for the statistical association up to k − 1-tuple order. In the same way, we can set θdec = 0 condition for ξdec of a system decomposition, and reestablish edges with respect to the η parameters, except the one in focus for edge cutting. As a simple example, consider the system decomposition < 1222 > and edge cutting 1 − 2 in 4-node graph. We have the mixture coordinate ξdec for the system decomposition as follows: ξdec 1,2 = θdec 1,2 = 0, ξdec 1,3 = θdec 1,3 = 0, ξdec 1,4 = θdec 1,4 = 0, ξdec 1,2,3 = θdec 1,2,3 = 0, (53) ξdec 1,3,4 = θdec 1,3,4 = 0, ξdec 1,2,3,4 = θdec 1,2,3,4 = 0, where all the rest of ξdec coordinates is equivalent to that of η coordinates. We then consider the new way of edge cutting 1 − 2 by recovering the statistical association in edges 1 − 3 and 1 − 4 from system decomposition < 1222 >, orthogonally to that of edge 1 − 2. The new mixture coordinate ξEC changes to the following: ξEC 1,2 = θEC 1,2 = 0, ξEC 1,3 = η1,3, ξEC 1,4 = η1,4, ξEC 1,2,3 = θEC 1,2,3 = 0, (54) ξEC 1,3,4 = η1,3,4, ξEC 1,2,3,4 = θEC 1,2,3,4 = 0, and the rest is equivalent to that of η coordinates. This new ξEC is also compatible with k-cut coordinates formalization for its simple θEC = 0 conditions. To obtain ξEC for arbitrary edge cutting i − j, one should take θEC containing i and j in its subscript, set them to 0, and combine with η coordinates for the rest of the subscript. For plural edge cuttings i − j, · · · , k − l (1 ≤ i, j, k, l ≤ n), it suffices to take θEC containing i and j, ... , k and l in its subscript respectively, then set them to 0. We finally obtain the Pythagorean relation between edge cuttings. Denoting the general edge cutting(s) coordinates as ξi−j,··· ,k−l, the following holds for the example of system decomposition < 1222 >: D[< 1111 >:< 1222 >] = D[< 1111 >: p(ξ1−2)] + D[p(ξ1−2) : p(ξ1−2,1−3)] (55) + D[p(ξ1−2,1−3) : p(ξ1−2,1−3,1−4)]. Despite the consistency with the dual structure between θ and η, we do not generally have analytical solution to determine ηEC values from θEC = 0 conditions. We should call for some numerical algorithm to solve θEC = 0 conditions with respect to ηEC values, which are in general high-degree simultaneous polynomials. Furthermore, numerical convergence of the solution has to be 309 Entropy 2014, 16, 4132–4167 very strict, since tiny deviation from the conditions can become non-negligible by passing fractional function and logarithmic function of θ coordinates. On the other hand, the previously defined edge cutting with ξec using the product between subgraphs’ η coordinates is analytically simple and does not need to consider the other edges’ recovery from system decomposition or independence hypothesis. We then chose the previous way of edge cutting for both calculability and clarity of the concept. There have been many attempts to approximate complex network by low-dimensional system with the use of statistical physics and network theory. As a contemporary example, moment-closure approximation provides a various way to abstract essential dynamics e.g., in discrete adaptive network [24]. Although the approximation takes several theoretical assumptions such as random graph approximation, it is difficult to quantitatively reproduce the dynamics even in some simplest model. This is partly due to homogeneous treatment of statistics such as truncation into pair-wise order. The edge cutting can offer a complementary view on the evaluation of moment-closure approximations. Using orthogonal decomposition between edge information, one can evaluate which part of network link and which order of statistics contain essential information, which does not necessary conform to top-down theoretical treatment. 11.2. Complexity of the Systems with Continuous Phase Space We have developed the concept of system decompositionability based on discrete binary variables. One can also apply the same principle to continuous variable. For an ergodic map G : X → X in continuous space X, KS entropy h(μ, G) is defined as the maximum of entropy rate with respect to all possible system decomposition A, when the invariant measure μ exists: h(μ, G) = sup A h(μ, G, A). (56) where A is the disjoint decomposition of X that consists of non-trivial sets ai, whose total number is n(A), defined as X = n(A) � i=1 ai, (57) ai ∩ aj = φ, i ̸= j, 1 ≤ i, j ≤ n(A), (58) meaning the natural expansion of system decomposition into continuous space. The entropy rate h(μ, G, A) in Equation (56) is defined as h(μ, G, A) = lim n→∞ 1 n H(μ, A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)), (59) according to the entropy H(μ, A) based on the decomposition A = {ai} H(μ, A) = − n(A) ∑ i=1 μ(ai) ln μ(ai), (60) and the product C = A ∨ B as C = A ∨ B = {ci = aj ∩ bk|1 ≤ j ≤ n(A), 1 ≤ k ≤ n(B)}. (61) 310 Entropy 2014, 16, 4132–4167 In a more general case, topological entropy hT(G) is defined simply with the number of decomposed subsystem elements by preimages as follows, without requiring ergodicity, therefore neither the existence of invariant measure μ: hT(G) = sup A lim n→∞ 1 n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)). (62) Topological entropy takes the maximum value of the possible preimage divisions, in order to measure the complexity in terms of the mixing degree of the orbits. For example, if the KS entropy is positive as h(μ, G) > 0, the dynamics of G on an invariant set of invariant measure μ is chaotic for almost everywhere initial conditions. As for the positive topological entropy hT(G) > 0, the dynamics of G contain chaotic orbits, but not necessary as attractive chaotic invariant set, since hT(G) ≥ h(μ, G) and the KS entropy can be negative. Although these definitions are useful to characterize the existence of chaotic dynamics, the system decompositionability is another property representing different aspect of the system complexity. It is rather the matter of the existence of independent dynamics components, or the degree of orbit localization between arbitrary system decompositions. We propose the following “geometric topological entropy” hg(G) applying the same principle of taking geometric product between all hierarchical structure of the system decomposition A. hg(G) := ∏ σ(A)>0 lim n→∞ 1 n ln n(A ∨ G−1(A) ∨ · · · ∨ G−n+1(A)), (63) where σ(A) > 0 means to take all components of A having positive Lebesgue measure on X. This gives 0 if the preimage of certain ai ∈ A is ai itself, meaning there exist a subsystem ai whose range is invariant under G, closed by itself. The system X can be completely divided into ai and the rest. This corresponds to the existence of an independent subsystem in cuboid-bias and modular complexities. In case such independent components do not exist, it still reflects the degree of orbit localization for all possible system decompositions in multiplicative manner. The condition σ(A) > 0 is to avoid trivial case such as the existence of unstable limit cycle, whose Lebesgue measure is 0. Typical example giving hg(G) = 0 is the function having independent ergodic components, such as the Chirikov-Taylor map with appropriate parameter [25]. 12. Conclusions and Discussion We have theoretically developed a framework to measure the degree of statistical association existing between subsystems as well as the ones represented by each edge of the graph representation. We then reconsidered the problem of how to define complexity measures in terms of the construction of non-linear feature space. We defined new type of complexity based on the geometrical product of KL-divergence representing the degree of system decompositionability. Different complexity measures as well as newly proposed ones are compared on a complementarity basis on statistical manifold. Application of presented theory can encompass a large field of complex systems and data science, such as social network, genetic expression network, neural activities, ecological database, and any kind of complex networks with binary co-occurrence matrix data e.g., [26–29], databases: [30–34]. Continuous variables are also accessible by appropriate discretization of information source with e.g., entropy maximization principle. In contrast to arithmetic mean of information over the whole system, geometric mean has not been investigated sufficiently in the analysis of complex network. However in different fields, theoretical ecology has already pointed out the importance of geometric mean when considering the long-term fitness of a species population in a randomly varying environment [35,36]. Long-term fitness refers to the ecological complexity of its survival strategy under large stochastic fluctuation. Here, we can find useful analogy between the growth rate of a population in ecology and the spatio-temporal 311 Entropy 2014, 16, 4132–4167 propagation rate of information between subsystems in general. If we take an arbitrary subsystem and consider the amount of information it can exchange with all other subsystems, the proposed complexity measures with geometric mean reflect the minimum amount with amongst all possible other subsystems, which can not be distinguished with arithmetic mean. The propagation rate of a population in ecology and the information transmission in complex network hold mathematically analogous structure. In population ecology, the variance of growth rate is crucial to evaluate the long-term survival of the population. Even if the arithmetic mean of growth rate is high, large variance will lead to low geometric mean even with a small amount of exceptionally small fitness situation, which ecologically means extinction of an entire species. In stochastic network, the variance of system decompositionability is essential to evaluate the amount of information shared between subsystems, or information persistence in the entire network. Even the multi-information I is high, large heterogeneity of edge information can lead to informational isolation of certain subsystem, which means extinction of its information. If such subsystem is situated on the transmission pathway, information cannot propagate across these nodes. Therefore, the proposed complexity measures CC, CR C, Cm and CR m generally reflect the minimum amount of information propagation rate spread entirely on the system without exception of isolated division. Some recent studies on adaptive network focus on the evolution of network topology in response to node activity, such as game-theoretic evolution of strategies [37], opinion dynamics on an evolving network [38], epidemic spreading on an adaptive network [39], etc. Analysis of coevolution network between variables and interactions can capture important dynamical feature of complex systems. In contrast to topological network analysis, the newly proposed complexity measures can complement its statistical dynamics analysis. In addition to the topological change of network model, (e.g., linking dynamics of game theory, opinion community network structure, contact network of epidemics transmission), one can evaluate the emerged statistical association between the variables that does not necessary coincide with the network topology. Interesting feature of non-linear dynamics is the unexpected correlation between distant variables, which is quantified as Tsallis entropy [40]. The complementary relation between concrete interaction and resulting statistical association can provide a twofold methodology to characterize the coevolutionary dynamics of adaptive network. Such strategy can promote integrated science from laboratory experiments to open-field in natura situation, where actual multi-scale problematics remain to be solved [41]. Arithmetic and geometric means can be integrated in a mutual formula called generalized mean [42]. Therefore, the proposed complexity measures with geometric mean of KL-divergence is an expansion of preexisting complexity measures with mixture coordinates. Table 1 summarizes the generalization of complexity measure in this article. Based on the k-cut coordinates ı, the weighted sum of KL-divergence representing k-tuple order of statistical association derived complexity measures with (weighted) arithmetic mean such as multi-information I and TSE complexity. On the other hand, we showed that subsystem-wise correlation can also be isolated with the use of mixture coordinates, namely < · · · >-cut coordinates ¸. To quantify the heterogeneity of system decompositionability, we generally took a weighted geometric mean of KL-divergence in CC, CR C, Cm and CR m. Here, the shortest path selection of Cm and CR m, and regularization of CR C and CR m with respect to multi-information I can be interpreted as the weight function of geometric mean. This perspective brings a definition of a generalized class of complexity measures based on the mixture coordinates and generalized mean of KL-divergence. Information discrepancy can also be generalized from KL-divergence to Bregman divergence, providing access to the concept of multiple centroids in large stochastic data analysis such as image processing [43]. The blank columns of the Table 1 imply the possibility of other complexity measures in this class. For example, the weighted geometric mean of KL-divergence defined between k-cut coordinates is expected to yield complexity measures that are sensitive to the heterogeneity of correlation orders. The weighted arithmetic mean of KL-divergence defined between < · · · >-cut coordinates should be sensitive to the mean decompositionability of arbitrary subsystem. Since these measures take analytically different form on mixture coordinates and/or mean 312 Entropy 2014, 16, 4132–4167 functions, their derivatives do not coincide, which give independent information of the system on the complementary basis on statistical manifold, as long as the number of complexity measures are inferior to the freedom degree of the system. Table 1. Classification of complexity measures with KL-divergence on mixture coordinates. Generalized Mean of KL-Divergence Arithmetic Mean Geometric Mean Mixture Coordinates k-cut ı TSE complexity, I < · · · >-cut ¸ CC, CR C, Cm, CRm Acknowledgments: This study was partially supported by CNRS, the long term study abroad support program of the university of Tokyo, and the French government (Promotion Simone de Beauvoir). Conflicts of Interest: Conflicts of Interest The author declares no conflict of interest. References 1. Boccalettia, S.; Latorab, V.; Morenod, Y.; Chavezf, M.; Hwang, D.U. Complex Networks: Structure and Dynamics. Phys. Rep. 2006, 424, 175–308. 2. Strogatz, S.H. Exploring Complex Networks. Nature 2001, 410, 268–276. 3. Wasserman, S.; Faust, K. Social Network Analysis; Cambridge University Press: Cambridge, UK, 1994. 4. Funabashi, M.; Cointet, J.P.; Chavalarias, D. Complex Network. In Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2009; Volume 207, pp. 161–172. 5. Badii, R.; Politi, A. Complexity: Hierarchical Structures and Scaling in Physics; Cambridge University Press: Cambridge, UK, 2008. 6. Lempel, A.; Ziv, J. On the Complexity of Finite Sequences. IEEE Trans. Inf. Theory 1976, 22, 75–81. 7. Li, M.; Vitanyi, P. Texts in Computer Science. In An Introduction to Kolmogorov Complexity and Its Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 1997. 8. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 2006. 9. Bennett, C. On the Nature and Origin of Complexity in Discrete, Homogeneous, Locally-Interacting Systems. Found. Phys. 1986, 16, 585–592. 10. Grassberger, P. Toward a Quantitative Theory of Self-Generated Complexity. Int. J. Theor. Phys. 1986, 25, 907–938. 11. Crutchfield, J.P.; Feldman, D.P. Regularities Unseen, Randomness Observed: The Entropy Convergence Hierarchy. Chaos 2003, 15, 25–54. 12. Crutchfield, J.P. Inferring Statistical Complexity. Phys. Rev. Lett. 1989, 63, 105–108. 13. Prichard, D.; Theiler, J. Generalized Redundancies for Time Series Analysis. Physica D 1995, 84, 476–493. 14. Amari, S. Information Geometry on Hierarchy of Probability Distributions. IEEE Trans. Inf. Theory 2001, 47, 1701–1711. 15. Ay, N.; Olbrich, E.; Bertschinger, N.; Jost, J. A Unifying Framework for Complexity Measures of Finite Systems; Report 06-08-028; Santa Fe Institute: Santa Fe, NM, USA, 2006. 16. MacKay, R.S. Nonlinearity in Complexity Science. Nonlinearity 2008, 21, T273–T281. 17. Tononi, G.; Sporns, O.; Edelman, M. A Measure for Brain Complexity: Relating Functional Segregation and Integration in the Nervous System. Proc. Natl. Acad. Sci. USA 1994, 91, 5033. 18. Feldman, D.P.; Crutchfield, J.P. Measures of statistical complexity: Why? Phys. Lett. A 1998, 238, 244–252. 19. Nakahara, H.; Amari, S. Information-Geometric Measure for Neural Spikes. Neural Comput. 2002, 14, 2269– 2316. 20. Olbrich, E.; Bertschinger, N.; Ay, N.; Jost, J. How Should Complexity Scale with System Size? Eur. Phys. J. B 2008, 63, 407–415. 21. Feldman, D.P.; Crutchfield, J.P. Measures of Statistical Complexity: Why? Phys. Lett. A 1998, 238, 244–252. 22. Lopez-Ruiz, R.; Mancini, H.; Calbet, X. A Statistical Measure of Complexity. Phys. Lett. A 1995, 209, 321–326. 313 Entropy 2014, 16, 4132–4167 23. Hodge, W.; Pedoe, D. Methods of Algebraic Geometry; Cambridge Mathematical Library, Cambridge University Press: Cambridge, UK, 1994; Volume 1–3. 24. Demirel, G.; Vazquez, F.; Bohme, G.; Gross, T. Moment-closure Approximations for Discrete Adaptive Networks. Physica D 2014, 267, 68–80. 25. Fraser, G., Ed. The New Physics for the Twenty-First Century; Cambridge University Press: Cambridge, UK, 2006; p. 335. 26. Scott, J. Social Network Analysis: A Handbook; SAGE Publications Ltd.: London, UK, 2000. 27. Geier, F.; Timmer, J.; Fleck, C. Reconstructing Gene-Regulatory Networks from Time Series, Knock-Out Data, and Prior Knowledge. BMC Syst. Biol. 2007, 1, doi:10.1186/1752-0509-1-11. 28. Brown, E.N.; Kass, R.E.; Mitra, P.P. Multiple Neural Spike Train Data Analysis: State-of-the-Art and Future Challenges. Nat. Neurosci. 2004, 7, 456–461. 29. Yee, T.W. The Analysis of Binary Data in Quantitative Plant Ecology. Ph.D. Thesis, The University of Auckland, New Zealand, 1993. 30. Stanford Large Network Dataset Collection. Available online: http://snap.stanford.edu/data/ (accessed on 19 July 2014). 31. BioGRID. Available online: http://thebiogrid.org/ (accessed on 19 July 2014). 32. Neuroscience Information Framework. Available online: http://www.neuinfo.org/ (accessed on 19 July 2014). 33. Global Biodiversity Information Facility. Available online: http://www.gbif.org/ (accessed on 19 July 2014). 34. UCI Network Data Repository. Available online: http://networkdata.ics.uci.edu/index.php (accessed on 19 July 2014). 35. Lewontin, R.C.; Cohen, D. On Population Growth in a Randomly Varying Environment. Proc. Natl. Acad. Sci. USA 1969, 62, 1056–1060. 36. Yoshimura, J.; Clark, C.W. Individual Adaptations in Stochastic Environments. Evol. Ecol. 1969, 5, 173–192. 37. Wu, B.; Zhou, D.; Wang, L. Evolutionary Dynamics on Stochastic Evolving Networks for Multiple-Strategy Games. Phys. Rev. E 2011, 84, 046111. 38. Fu, F.; Wang, L. Coevolutionary Dynamics of Opinions and Networks: From Diversity to Uniformity. Phys. Rev. E 2008, 78, 016104. 39. Gross, T.; D’Lima, C.J.D.; Blasius, B. Epidemic Dynamics on an Adaptive Network. Phys. Rev. Lett. 2006, 96, 208701. 40. Tsallis, C. Possible Generalization of Boltzmann-Gibbs Statistics. J. Stat. Phys. 1988, 52, 479–487. 41. Quintana-Murci, L.; Alcais, A.; Abel, L.; Casanova, J.L. Immunology in natura: Clinical, Epidemiological and Evolutionary Genetics of Infectious Diseases. Nat. Immunol. 2007, 8, 1165–1171. 42. Hardy, G.; Littlewood, J.; Polya, G. Inequalities; Cambridge University Press: Cambridge, UK, 1967; Chapter 3. 43. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 314 entropy Article The Entropy-Based Quantum Metric Roger Balian Institut de Physique Théorique, CEA/Saclay, F-91191 Gif-sur-Yvette Cedex, France; E-Mail: roger@balian.fr Received: 15 May 2014; in revised form: 25 June 2014 / Accepted: 11 July 2014 / Published: 15 July 2014 Abstract: The von Neumann entropy S( ˆD) generates in the space of quantum density matrices ˆD the Riemannian metric ds2 = −d2S( ˆD), which is physically founded and which characterises the amount of quantum information lost by mixing ˆD and ˆD + d ˆD. A rich geometric structure is thereby implemented in quantum mechanics. It includes a canonical mapping between the spaces of states and of observables, which involves the Legendre transform of S( ˆD). The Kubo scalar product is recovered within the space of observables. Applications are given to equilibrium and non equilibrium quantum statistical mechanics. There the formalism is specialised to the relevant space of observables and to the associated reduced states issued from the maximum entropy criterion, which result from the exact states through an orthogonal projection. Von Neumann’s entropy specialises into a relevant entropy. Comparison is made with other metrics. The Riemannian properties of the metric ds2 = −d2S( ˆD) are derived. The curvature arises from the non-Abelian nature of quantum mechanics; its general expression and its explicit form for q-bits are given, as well as geodesics. Keywords: quantum entropy; metric; q-bit; information; geometry; geodesics; relevant entropy 1. A Physical Metric for Quantum States Quantum physical quantities pertaining to a given system, termed as “observables” ˆO, behave as non-commutative random variables and are elements of a C*-algebra. We will consider below systems for which these observables can be represented by n-dimensional Hermitean matrices in a finite-dimensional Hilbert space H. In quantum (statistical) mechanics, the “state” of such a system encompasses the expectation values of all its observables [1]. It is represented by a density matrix ˆD, which plays the rôle of a probability distribution, and from which one can derive the expectation value of ˆO in the form < ˆO >= Tr ˆD ˆO = ( ˆD; ˆO) . (1) Density matrices should be Hermitean (< ˆO > is real for ˆO = ˆO†), normalised (the expectation value of the unit observable is Tr ˆD = 1) and non-negative (variances < ˆO2 > − < ˆO >2 are non-negative). They depend on n2 − 1 real parameters. If we keep aside the multiplicative structure of the set of operators and focus on their linear vector space structure, Equation (1) appears as a linear mapping of the space of observables onto real numbers. We can therefore regard the observables and the density operators ˆD as elements of two dual vector spaces, and expectation values (1) appear as scalar products. It is of interest to define a metric in the space of states. For instance, the distance between an exact state ˆD and an approximation ˆDapp would then characterise the quality of this approximation. However, all physical quantities come out in the form (1) which lies astride the two dual spaces of observables and states. In order to build a metric having physical relevance, we need to rely on another meaningful quantity which pertains only to the space of states. Entropy 2014, 16, 3878–3888; doi:10.3390/e16073878 www.mdpi.com/journal/entropy 315 Entropy 2014, 16, 3878–3888 We note at this point that quantum states are probabilistic objects that gather information about the considered system. Then, the amount of missing information is measured by von Neumann’s entropy S( ˆD) ≡ − Tr ˆD ln ˆD . (2) Introduced in the context of quantum measurements, this quantity is identified with the thermodynamic entropy when ˆD is an equilibrium state. In non-equilibrium statistical mechanics, it encompasses, in the form of “relevant entropy” (see Section 5 below), various entropies defined through the maximum entropy criterion. It is also introduced in quantum computation. Alternative entropies have been introduced in the literature, but they do not present all the distinctive and natural features of von Neumann’s entropy, such as additivity and concavity. As S( ˆD) is a concave function, and as it is the sole physically meaningful quantity apart from expectation values, it is natural to rely on it for our purpose. We thus define [2] the distance ds between two neighbouring density matrices ˆD and ˆD + d ˆD as the square root of ds2 = −d2S( ˆD) = Tr d ˆDd ln ˆD . (3) This Riemannian metric is of the Hessian form since the metric tensor is generated by taking second derivatives of the function S( ˆD) with respect to the n2 − 1 coordinates of ˆD. We may take for such coordinates the real and imaginary parts of the matrix elements, or equivalently (Section 6) some linear transform of these (keeping aside the norm Tr ˆD = 1). 2. Interpretation in the Context of Quantum Information The simplest example, related to quantum information theory, is that of a q-bit (two-level system or spin 1 2) for which n = 2. Its states, represented by 2 × 2 Hermitean normalised density matrices ˆD, can conveniently be parameterised, on the basis of Pauli matrices, by the components rμ = D12 + D21, i(D12 − D21), D11 − D22 (μ = 1, 2, 3) of a 3-dimensional vector r lying within the unit Poincaré–Bloch sphere (r ≤ 1). From the corresponding entropy S = 1 + r 2 ln 2 1 + r + 1 − r 2 ln 2 1 − r , (4) we derive the metric ds2 = 1 1 − r2 �r · dr r �2 + 1 2r ln 1 + r 1 − r ���� r × dr r ���� 2 , (5) which is a natural Riemannian metric for q-bits, or more generally for positive 2 × 2 matrices. The metric tensor characterizing (5) diverges in the vicinity of pure states r = 1, due to the singularity of the entropy (2) for vanishing eigenvalues of ˆD. However, the distance between two arbitrary (even pure) states ˆD′ and ˆD′′ measured along a geodesic is always finite. We shall see (Equation (29)) that for n = 2 the geodesic distance s between two neighbouring pure states ˆD′ and ˆD′′, represented by unit vectors r′ and r′′ making a small angle δϕ ∼ |r′ − r′′|, behaves as δs2 ∼ δϕ2 ln(4√π/δϕ). The singularity of the metric tensor manifests itself through this logarithmic factor. Identifying von Neumann’s entropy to a measure of missing information, we can give a simple interpretation to the distance between two states. Indeed, the concavity of entropy expresses that some information is lost when two statistical ensembles described by different density operators merge. By mixing two equal size populations described by the neighbouring distributions ˆD′ = ˆD + 1 2δ ˆD and ˆD′′ = ˆD − 1 2δ ˆD separated by a distance δs, we lose an amount of information given by ΔS ≡ S � ˆD � − S( ˆD′) + S( ˆD′′) 2 ∼ ffis2 8 , (6) 316 Entropy 2014, 16, 3878–3888 and thereby directly related to the distance δs defined by (3). The proof of this equivalence relies on the expansion of the entropies S( ˆD′) and S( ˆD′′) around ˆD, and is valid when Tr δ ˆD2 is negligible compared to the smallest eigenvalue of ˆD. If ˆD′ and ˆD′′ are distant, the quantity 8ΔS cannot be regarded as the square of a distance that would be generated by a local metric. The equivalence (6) for neighbouring states shows that ds2 is the metric that is the best suited to measure losses of information my mixing. The singularity of δs2 at the edge of the positivity domain of ˆD may suggest that the result (6) holds only within this domain. In fact, this equivalence remains nearly valid even in the limit of pure states because ΔS itself involves a similar singularity. Indeed, if the states ˆD′ = |ψ′ >< ψ′| and ˆD′′ = |ψ′′ >< ψ′′| are pure and close to each other, the loss of information ΔS behaves as 8ΔS ∼ ffi’2 ln(4/ffi’) where δϕ2 ∼ 2 Tr δD2. This result should be compared to various geodesic distances between pure quantum states, which behave as δs2 ∼ δϕ2 ln(4√π/δϕ for the present metric, and as δs2 BH = 4δs2 FS ∼ δϕ2 ∼ Tr( ˆD′ − ˆD′′)2 for the Bures – Helstrom and the quantum Fubini – Study metrics, respectively (see Section 7; these behaviours hold not only for n = 2 but for arbitrary n since only the space spanned by |ψ′ > and |ψ′′ > is involved). Thus, among these metrics, only ds2 = −d2S can be interpreted in terms of information loss, whether the states ˆD′ and ˆD′′ are pure or mixed. At the other extreme, around the most disordered state ˆD = ˆI/n, in the region ∥ n ˆD − ˆI ∥≪ 1, the metric becomes Euclidean since ds2 = Tr d ˆDd ln ˆD ∼ n Tr(d ˆD)2 (for n = 2, ds2 = dr2). For a given shift d ˆD, the qualitative change of a state ˆD, as measured by the distance ds, gets larger and larger as the state ˆD becomes purer and purer, that is, when the information contents of ˆD increases. 3. Geometry of Quantum Statistical Mechanics A rich geometric structure is generated for both states and observables by von Neumann’s entropy through introduction of the metric ds2 = −d2S. Now, this metric (3) supplements the algebraic structure of the set of observables and the above duality between the vector spaces of states and of observables, with scalar product (1). Accordingly, we can define naturally within the space of states scalar products, geodesics, angles, curvatures. We can also regard the coordinates of d ˆD and d ln ˆD as covariant and contravariant components of the same infinitesimal vector (Section 6). To this aim, let us introduce the mapping ˆD ≡ e ˆX Tr e ˆX (7) between ˆD in the space of states and ˆX in the space of observables. The operator ˆX appears as a parameterisation of ˆD. (The normalisation of ˆD entails that ˆX, defined within an arbitrary additive constant operator X0 ˆI, also depends on n2 − 1 independent real parameters.) The metric (3) can then be re-expressed in terms of ˆX in the form ds2 = Tr d ˆDd ˆX = Tr � 1 0 dξ ˆDe−ξ ˆXd ˆXeξ ˆXd ˆX − (Tr ˆDd ˆX)2 = d2 ln Tr e ˆX = d2F , (8) where we introduced the function F( ˆX) ≡ ln Tr e ˆX (9) of the observable ˆX(The addition of X0 ˆI to ˆX results in the addition of the irrelevant constant X0 to F). This mapping provides us with a natural metric in the space of observables, from which we recover the scalar product between d ˆX1 and d ˆX2 in the form of a Kubo correlation in the state ˆD. The metric (8) has been quoted in the literature under the names of Bogoliubov–Kubo–Mori. 4. Covariance and Legendre Transformation We can recover the above geometric mapping (7) between ˆD and ˆX, or between the covariant and contravariant coordinates of d ˆD, as the outcome of a Legendre transformation, by considering 317 Entropy 2014, 16, 3878–3888 the function F( ˆX). Taking its differential dF = Tr e ˆXd ˆX/ Tr e ˆX, we identify the partial derivatives of F( ˆX) with the coordinates of the state ˆD = e ˆX/ Tr e ˆX, so that ˆD appears as conjugate to ˆX in the sense of Legendre transformations. Expressing then ˆX as function of ˆD and inserting into F − Tr ˆD ˆX, we recognise that the Legendre transform of F( ˆX) is von Neumann’s entropy F − Tr ˆD ˆX = S( ˆD) = − Tr ˆD ln ˆD. The conjugation between ˆD and ˆX is embedded in the equations dF = Tr ˆDd ˆX ; dS = − Tr ˆXd ˆD . (10) Legendre transformations are currently used in equilibrium thermodynamics. Let us show that they come out in this context directly as a special case of the present general formalism. The entropy of thermodynamics is a function of the extensive variables, energy, volume, particle numbers, etc. Let us focus for illustration on the energy U, keeping the other extensive variables fixed. The thermodynamic entropy S(U), a function of the single variable U, generates the inverse temperature as β = ∂S/∂U. Its Legendre transform is the Massieu potential F(β) = S − βU. In order to compare these properties with the present formalism, we recall how thermodynamics comes out in the framework of statistical mechanics. The thermodynamic entropy S(U) is identified with the von Neumann entropy (2) of the Boltzmann–Gibbs canonical equilibrium state ˆD, and the internal energy with U = Tr ˆD ˆH. In the relation (7), the operator ˆX reads ˆX = −β ˆH (within an irrelevant additive constant). By letting U or β vary, we select within the spaces of states and of observables a one-dimensional subset. In these restricted subsets, ˆD is parameterised by the single coordinate U, and the corresponding ˆX by the coordinate −β. By specialising the general relations (10) to these subsets, we recover the thermodynamic relations dF = −Udβ and dS = βdU. We also recover, by restricting the metric (3) or (8) to these subsets, the current thermodynamic metric ds2 =−(∂2S/∂U2)dU2 =−dUdβ. More generally, we can consider the Boltzmann–Gibbs states of equilibrium statistical mechanics as the points of a manifold embedded in the full space of states. The thermodynamic extensive variables, which parameterise these states, are the expectation values of the conserved macroscopic observables, that is, they are a subset of the expectation values (1) which parameterise arbitrary density operators. Then the standard geometric structure of thermodynamics simply results from the restriction of the general metric (3) to this manifold of Boltzmann–Gibbs states. The commutation of the conserved observables simplifies the reduced thermodynamic metric, which presents the same features as a Fisher metric (see Section 6). 5. Relevant Entropy and Geometry of the Projection Method The above ideas also extend to non-equilibrium quantum statistical mechanics [2–4]. When introducing the metric (3), we indicated that it may be used to estimate the quality of an approximation. Let us illustrate this point with the Nakajima–Zwanzig–Mori–Robertson projection method, best introduced through maximum entropy. Consider some set { ˆAk} of “relevant observables”, whose time-dependent expectation values ak ≡ < ˆAk > = Tr ˆD ˆAk we wish to follow, discarding all other variables. The exact state ˆD encodes the variables {ak} that we are interested in, but also the expectation values (1) of the other observables that we wish to eliminate. This elimination is performed by associating at each time with ˆD a “reduced state” ˆDR which is equivalent to ˆD as regards the set ak = Tr ˆDR ˆAk, but which provides no more information than the values{ak}. The former condition provides the constraints < ˆAk > = ak, and the latter condition is implemented by means of the maximum entropy criterion: One expresses that, within the set of density matrices compatible with these constraints, ˆDR is the one which maximises von Neumann’s entropy (2), that is, which contains solely the information about the relevant variables ak. The least biased state ˆDR thus defined has the form ˆDR = e ˆXR/ Tr e ˆXR, where ˆXR ≡ ∑k λk ˆAk involves the time-dependent Lagrange multipliers λk, which are related to the set ak through Tr ˆDR ˆAk = ak. 318 Entropy 2014, 16, 3878–3888 The von Neumann entropy S( ˆDR) ≡ SR{ak} of this reduced state ˆDR is called the “relevant entropy” associated with the considered relevant observables ˆAk. It measures the amount of missing information, when only the values {ak} of the relevant variables are given. During its evolution, ˆD keeps track of the initial information about all the variables < ˆO > and its entropy S( ˆD) remains constant in time. It is therefore smaller than the relevant entropy S( ˆDR) which accounts for the loss of information about the irrelevant variables. Depending on the choice of relevant observables { ˆAk}, the corresponding relevant entropies SR{ak} encompass various current entropies, such as the non-equilibrium thermodynamic entropy or Boltzmann’s H-entropy. The same structure as the one introduced above for the full spaces of observables and states is recovered in this context. Here, for arbitrary values of the parameters λk, the exponents ˆXR = ∑k λk ˆAk constitute a subspace of the full vector space of observables, and the parameters {λk} appear as the coordinates of ˆXR on the basis { ˆAk}. The corresponding states ˆDR, parameterised by the set {ak}, constitute a subset of the space of states, the manifold R of “reduced states”(Note that this manifold is not a hyperplane, contrary to the space of relevant observables; it is embedded in the full vector space of states, but does not constitute a subspace). By regarding SR{ak} as a function of the coordinates {ak}, we can define a metric ds2 = −d2SR{ak} on the manifold R, which is the restriction of the metric (3). Its alternative expression ds2 = ∑k dakdλk = d2FR{λk}, where FR{λk} ≡ ln Tr exp ∑k λk ˆAk, is a restriction of (8). The correspondence between the two parameterisations {ak} and {λk} is again implemented by the Legendre transformation which relates SR{ak} and FR{λk}. The projection method relies on the mapping ˆD �→ ˆDR which associates ˆDR to ˆD. It consists in replacing the Liouville–von Neumann equation of motion for ˆD by the corresponding dynamical equation for ˆDR on the manifold R, or equivalently for the coordinates {ak} or for the coordinates {λk}, a programme that is in practice achieved through some approximations. This mapping is obviously a projection in the sense that ˆD �→ ˆDR �→ ˆDR, but moreover the introduction of the metric (3) shows that the vector ˆD − ˆDR in the space of states is perpendicular to the manifold R at the point ˆDR. This property is readily shown by writing, in this metric, the scalar product Tr d ˆD d ˆX′ of the vector d ˆD = ˆD − ˆDR by an arbitrary vector d ˆD′ in the tangent plane of R. The latter is conjugate to any combination d ˆX′ of observables ˆAk, and this scalar product vanishes because Tr ˆD ˆAk = Tr ˆDR ˆAk. Thus the mapping ˆD �→ ˆDR appears as an orthogonal projection, so that the relevant state ˆDR associated with ˆD may be regarded as its best possible approximation on the manifold R. 6. Properties of the Metric The metric tensor can be evaluated explicitly in a basis where the matrix ˆD is diagonal. Denoting by Di its eigenvalues and by dDij the matrix elements of its variations, we obtain from (3) ds2 = Tr � ∞ 0 dξ � d ˆD ˆD + ξ �2 = ∑ ij ln Di − ln Dj Di − Dj dDijdDji . (11) (For Di = Dj,whether or not i = j, the ratio is defined as 1/Di by continuity.) In the same basis, the form (8) of the metric reads ds2 = 1 Z ∑ ij eXi − eXj Xi − Xj dXijdXji − � ∑i eXidXii Z �2 , (12) with Z = ∑i eXi(For Xi = Xj, the ratio is eXi). The singularity of the metric (11) in the vicinity of vanishing eigenvalues of ˆD, in particular near pure states (end of Section 2), is not apparent in the representation (12) of this metric, because the mapping from ˆD to ˆX sends the eignevalue Xi to −∞ when Di tends to zero. Let us compare the expression (11) with the corresponding classical metric, which is obtained by starting from Shannon’s entropy instead of von Neumann’s entropy. For discrete probabilities pi, 319 Entropy 2014, 16, 3878–3888 we have then S{pi} = − ∑i pi ln pi and hence the same definition ds2 = −d2S{pi} as above of an entropy-based metric yields ds2 = ∑i dp2 i /pi, which is identified with the Fisher information metric. The present metric thus appears as the extension to quantum statistical mechanics of the Fisher metric when the latter is interpreted in terms of entropy. In fact, the terms of (11) which involve the diagonal elements i = j of the variations d ˆD reduce to dD2 ii/Di. This result was expected since density matrices behave as probability distributions if both ˆD and d ˆD are diagonal. Let us more generally consider in (11), instead of solely diagonal variations dDii, variations dDij with indices i and j such that ��Di − Dj �� ≪ Di + Dj. The expansion of Di and Dj around 1 2(Di + Dj) in the corresponding ratios of (11) yields (ln Di − ln Dj)/(Di − Dj) ∼ 2/(Di + Dj). The considered terms of (11) are therefore the same as in the Bures–Helstrom metric ds2 BH = ∑ ij 2 Di + Dj dDijdDji , (13) introduced long ago as an extension to matrices of the Fisher metric [5]. We thus recover this Bures–Helstrom metric as an approximation of the present entropy-based metric ds2 = −d2S( ˆD). For n = 2, ds2 BH is obtained from the expression (5) of ds2 by omitting the factor tanh−1 r/r entering the second term. In order to express the properties of the Riemannian metric (3) in a general form, which will exhibit the tensor structure, we use a Liouville representation. There, the observables ˆO = Oμ ˆΩμ, regarded as elements of a vector space, are represented by their coordinates Oμ on a complete basis ˆΩμ of n2 observables. The space of states is spanned by the dual basis ˆΣμ, such that Tr ˆΩν ˆΣμ = δν μ, and the states ˆD = Dμ ˆΣμ are represented by their coordinates Dμ. Thus, the expectation value (1) is the scalar product DμOμ. In the matrix representation which appears as a special case, μ denotes a pair of indices i, j, ˆΩμ stands for | j >< i |, ˆΣμ for | i >< j |, Oμ denotes the matrix element Oji and Dμ the element Dij. For the q-bit (n = 2) considered in Section 2, we have chosen the Pauli operators ˆσμ as basis ˆΩμ for observables, and 1 2 ˆσμ as dual basis ˆΣμ for states, so that the coordinates Dμ = Tr ˆD ˆΩμ of ˆD = 1 2( ˆI + rμ ˆσμ) are the components rμ of the vector r (The unit operator ˆI is kept aside since ˆD is normalised and since constants added to ˆX are irrelevant). The function F{X} = ln Tr e ˆX of the coordinates Xμ of the observable ˆX, and the von Neumann entropy S{D} as function of the coordinates Dμ of the state ˆD, are related by the Legendre transformation F = S + DμXμ, and the relations (10) are expressed by Dμ = ∂F/∂Xμ, Xμ = −∂S/∂Dμ. The metric tensor is given by gμν = ∂2F ∂Xμ∂Xν , gμν = − ∂2S ∂Dμ∂Dν , (14) and the correspondence issued from (7) between covariant and contravariant infinitesimal variations of ˆX and ˆD is implemented as dDμ = gμνdXν, dXμ = gμνdDν. These expressions exhibit the Hessian nature of the metric. This property simplifies the expression of the Christoffel symbol, which reduces to Γμνρ = −1 2 ∂3S ∂Dμ∂Dν∂Dρ , (15) and which provides a parametric representation ˆD(t) of the geodesics in the space of states through d2Dμ dt2 + gμσΓσνρ dDν dt dDρ dt = 0 . (16) Then, the Riemann curvature tensor comes out as Rμρ νσ = gξζ(ΓμσξΓνρζ − ΓμνξΓρσζ) , (17) 320 Entropy 2014, 16, 3878–3888 the Ricci tensor and the scalar curvature as Rμν = gρσRμρ νσ, R = gμνRμν , (18) We have noted that the classical equivalent of the entropy-based metric ds2 = −d2S is the Fisher metric ∑i dp2 i /pi, which as regards the curvature is equivalent to a Euclidean metric. While the space of classical probabilities is thus flat, the above equations show that the space of quantum states is curved. This curvature arises from the non-commutation of the observables, it vanishes for the completely disordered state ˆD = ˆI/n. Curvature can thus be used as a measure of the degree of classicality of a state. 7. Geometry of the Space of q-Bits In the illustrative example of a q-bit, the operator ˆX = χμ ˆσμ associated with ˆD is parameterised by the 3 components of the vector χμ (μ = 1, 2, 3), related to r by χ = tanh−1 r and χμ/χ = rμ/r. The metric tensor given by (5) is expressed as gμν = Krμrν + χ r δμν , K ≡ 1 r d dr χ r = 1 r2 � 1 1 − r2 − χ r � , (19) gμν = (1 − r2)pμν + r χqμν . (We have defined rμ = rμ, δμν = δμ ν = δμν so as to introduce the projectors rμrν/r2 ≡ pμν ≡ δμν − qμν in the Euclidean 3-dimensional space, and thus to simplify the subsequent calculations.) In polar coordinates r = (r, θ, ϕ), the infinitesimal distance takes the form ds2 = drdχ + rχ(dθ2 + sin2 θdϕ2) . (20) We determine from (15) and (19) the explicit form Γμνρ = K 2 � rμδνρ + rνδμρ + rρδμν � + 1 2r dK dr rμrνrρ (21) of the Christoffel symbol. By raising its first index with gμν and using polar coordinates, we obtain from (16) the equations of geodesics for n = 2. Within the Poincaré–Bloch sphere the geodesics are deduced by rotations from a one-parameter family of curves which lie in the θ = 1 2π, |ϕ| ≤ 1 2π half-plane and which are symmetric with respect to the ϕ = 0 axis. This family is characterized by the equations (where χ = tanh−1 r): d2r dt2 + r 1 − r2 �dr dt �2 − r 2 � 1 + χ r � 1 − r2�� �dϕ dt �2 = 0 , (22) d2ϕ dt2 + 1 r dr dt dϕ dt + 1 χ dχ dt dϕ dt = 0 , (23) and the boundary conditions at t = 0: r (0) = a , ϕ (0) = 0 , dr (0) dt = 0 , dϕ (0) dt = 1 k , k2 = a tanh−1 a . (24) Equation (23) provides, using the boundary conditions (24): dϕ dt = k rχ . (25) 321 Entropy 2014, 16, 3878–3888 Insertion of (25) into (22) gives rise to an equation for r (t), which can be integrated by regarding t as a function of ζ = arcsin r. One obtains: �dr dt �2 = � 1 − r2� � 1 − k2 rχ � . (26) The scale of t has been fixed by relating to r (0) the boundary condition (24) for dϕ (0) /dt, a choice which ensures that ds2 = drdχ + rχdϕ2 = dt2, and hence that the parameter t measures the distance along geodesics. For k = 0, we obtain r = |sin t|, ϕ = ±π/2. Thus, the longest geodesics are the diameters of the Poincaré–Bloch sphere. We find the value π for their “length”, that is, for the geodesic distance between two orthogonal pure states. At the other extreme, when the middle point r = a, ϕ = 0 of a geodesic lies close to the surface r = 1 of the sphere, the asymptotic form of the equation (26) is solved as t = ±2k√ πe−k2 erf ξ , ξ = � 1 2 ln 1 − a 1 − r , k2 = 1 2 ln 2 1 − a (27) (by taking ξ as variable instead of r). The determination of the explicit equations of such short geodesic curves is achieved by integrating (25) into ϕ = t k = ±2√ πe−k2 erf ξ . (28) From (27) and (28) we can determine the geodesic distance between two neighbouring pure states ˆD′ = |ψ′ >< ψ′| and ˆD′′ = |ψ′′ >< ψ′′| represented by the points rmax = 1, ϕmax = ± 1 2δϕ with δϕ small. At these two points, we have ξ → ∞, erf ξ = 1, and this determines k in terms of 1 2δϕ through (28). The length of the geodesic that joins them, given by (27), is: δs2 = δϕ2 ln 4√π δϕ , δϕ = arccos ��< ψ′ | ψ′′ > �� . (29) Thus, in spite of its singularity for r = 1, the present 3-dimensional metric (5) in the space r, θ, ϕ defines distances between pure states represented by points on the surface r = 1 of the Poincaré–Bloch sphere. However, It should be noted that the presence of the logarithmic factor in (29) forbids such distances to be generated by a 2-dimensional metric in the space θ, ϕ. In fact, the distance (29) is measured along a geodesic that penetrates the sphere r = 1, because no geodesic is tangent to the surface of this sphere nor lies on its surface. In contrast, all geodesics produced by the Bures–Helstrom metric are tangent to the surface of the sphere, or are its great circles. They are given by Equations (25) and (26), where χ is replaced by r and k by a; the solution of these equations provides the ellipses r cos ϕ = a cos t , r sin ϕ = sin t . (30) Here as above, the largest distance π is reached for orthogonal pure states represented by opposite points on the sphere, but now a peculiarity occurs. Whereas the metric ds2 = −d2S produces a single geodesic, the diameter joining these two points (with “length” π), the Bures metric produces a double infinity of geodesics, the half-ellipses (30) having as long axis this diameter, and having all the same “length” π. Other pairs of pure states are joined by geodesics which are arcs of great circles, and their Bures distance δsBH = δϕ is identified with the ordinary length of the arc. Here for n = 2 as in the general case, the 3-dimensional Bures–Helstrom metric admits a restriction to pure states generated by a 2-dimensional metric, which is identified with the quantum Fubini–Study metric, itself defined only for pure states by sFS = arccos |< ψ′ | ψ′′ >| = 1 2sBH. 322 Entropy 2014, 16, 3878–3888 Returning to the metric ds2 = d2S, the Riemann curvature is obtained from (17) as Rμ ρ νσ = K 4 � (r2 + r χ − 1)(qμ σqνρ − qμ νqρσ) + (r2 − r χ + 1)(pμ σqνρ − pμ νqρσ) (31) + r χ 1 1 − r2 (r2 − r χ + 1)(qμ σpνρ − qμ νpρσ) � . Contracting with gρσ the indices of (30) as in (18), we finally derive the Ricci curvature Rμ ν = −Kr 2χ � r2δμ ν + χ − r χ pμ ν � , (32) and the scalar curvature R = −Kr 2χ � 3r2 + χ − r χ � . (33) Both are negative in the whole Poincaré sphere. In the limit r → 0, the curvature R vanishes as R ∼ − 10 9 r2, as expected from the general argument of Section 2: a weakly polarised spin behaves classically. At the other extreme r → 1, R behaves as R ∼ −2 [(1 − r) | ln(1 − r) |]−1; it diverges, again as expected: pure states have the largest quantum nature. The metric ds2 = −d2S, introduced above in the context of quantum mechanics for mixed states (and their pure limit) and information theory, might more generally be useful to characterise distances in spaces of positive matrices. Conflicts of Interest: Conflicts of Interest The author declares no conflict of interest. References 1. Thirring, W. Quantum Mechanics of Large Systems. In A Course of Mathematical Physics; Volume 4; Springler-Verlag: New York, NY, USA, 1983. 2. Balian, R.; Alhassid, Y.; Reinhardt, H. Dissipation in many-body systems: A geometric approach based on information theory. Phys. Rep. 1986, 131, 1–146. 3. Balian, R. Incomplete descriptions and relevant entropies. Am. J. Phys. 1999, 67, 1078–1090. 4. Balian, R. Information in statistical physics. Stud. Hist. Philos. Mod. Phys. 2005, 36, 323–353. 5. Bures, D. An extension of Kakutani’s theorem. Trans. Am. Math. Soc. 1969,135, 199–212. c⃝ 2014 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 323 entropy Article Extending the Extreme Physical Information to Universal Cognitive Models via a Confident Information First Principle Xiaozhao Zhao 1, Yuexian Hou 1,2,*, Dawei Song 1,3 and Wenjie Li 2 1 School of Computer Science and Technology, Tianjin University, Tianjin 300072, China; E-Mails: 0.25eye@gmail.com (X.Z.); dawei.song2010@gmail.com (D.S.) 2 Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China; E-Mail: cswjli@comp.polyu.edu.hk 3 Department of Computing and Communications, The Open University, Milton Keynes MK76AA, UK * E-Mail: yxhou@tju.edu.cn; Tel.: +86-022-27406538. Received: 25 March 2014; in revised form: 6 June 2014 / Accepted: 20 June 2014 / Published: 1 July 2014 Abstract: The principle of extreme physical information (EPI) can be used to derive many known laws and distributions in theoretical physics by extremizing the physical information loss K, i.e., the difference between the observed Fisher information I and the intrinsic information bound J of the physical phenomenon being measured. However, for complex cognitive systems of high dimensionality (e.g., human language processing and image recognition), the information bound J could be excessively larger than I (J ≫ I), due to insufficient observation, which would lead to serious over-fitting problems in the derivation of cognitive models. Moreover, there is a lack of an established exact invariance principle that gives rise to the bound information in universal cognitive systems. This limits the direct application of EPI. To narrow down the gap between I and J, in this paper, we propose a confident-information-first (CIF) principle to lower the information bound J by preserving confident parameters and ruling out unreliable or noisy parameters in the probability density function being measured. The confidence of each parameter can be assessed by its contribution to the expected Fisher information distance between the physical phenomenon and its observations. In addition, given a specific parametric representation, this contribution can often be directly assessed by the Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cramér–Rao bound. We then consider the dimensionality reduction in the parameter spaces of binary multivariate distributions. We show that the single-layer Boltzmann machine without hidden units (SBM) can be derived using the CIF principle. An illustrative experiment is conducted to show how the CIF principle improves the density estimation performance. Keywords: information geometry; Boltzmann machine; Fisher information; parametric reduction 1. Introduction Information has been found to play an increasingly important role in physics. As stated in Wheeler [1]: “All things physical are information-theoretic in origin and this is a participatory universe...Observer participancy gives rise to information; and information gives rise to physics”. Following this viewpoint, Frieden [2] unifies the derivation of physical laws in major fields of physics, from the Dirac equation to the Maxwell-Boltzmann velocity dispersion law, using the extreme physical information principle (EPI). More specifically, a variety of equations and distributions can be derived by extremizing the physical information loss K, i.e., the difference between the observed Fisher information I and the intrinsic information bound J of the physical phenomenon being measured. Entropy 2014, 16, 3670–3688; doi:10.3390/e16073670 www.mdpi.com/journal/entropy 324 Entropy 2014, 16, 3670–3688 The first quantity, I, measures the amount of information as a finite scalar implied by the data with some suitable measure [2]. It is formally defined as the trace of the Fisher information matrix [3]. In addition to I, the second quantity, the information bound J, is an invariant that characterizes the information that is intrinsic to the physical phenomenon [2]. During the measurement procedure, there may be some loss of information, which entails I = κJ, where κ ≤ 1 is called the efficiency coefficient of the EPI process in transferring the Fisher information from the phenomenon (specified by J) to the output (specified by I). For closed physical systems, in particular, any solution for I attains some fraction of J between 1/2 (for classical physics) and one (for quantum physics) [4]. However, it is usually not the case in cognitive science. For complex cognitive systems (e.g., human language processing and image recognition), the target probability density function (pdf) being measured is often of high dimensionality (e.g., thousands of words in a human language vocabulary and millions of pixels in an observed image). Thus, it is infeasible for us to obtain a sufficient collection of observations, leading to excessive information loss between the observer and nature. Moreover, there is a lack of an established exact invariance principle that gives rise to the bound information in universal cognitive systems. This limits the direct application of EPI in cognitive systems. In terms of statistics and machine learning, the excessive information loss between the observer and nature will lead to serious over-fitting problems, since the insufficient observations may not provide necessary information to reasonably identify the model and support the estimation of the target pdf in complex cognitive systems. Actually, a similar problem is also recognized in statistics and machine learning, known as the model selection problem [5]. In general, we would require a complex model with a high-dimensional parameter space to sufficiently depict the original high-dimensional observations. However, over-fitting usually occurs when the model is excessively complex with respect to the given observations. To avoid over-fitting, we would need to adjust the complexity of the models to the available amount of observations and, equivalently, to adjust the information bound J corresponding to the observed information I. In order to derive feasible computational models for cognitive phenomenon, we propose a confident-information-first (CIF) principle in addition to EPI to narrow down the gap between I and J (thus, a reasonable efficiency coefficient κ is implied), as illustrated in Figure 1. However, we do not intend to actually derive the distribution laws by solving the differential equations of the extremization of the new information loss K′. Instead, we assume that the target distribution belongs to some general multivariate binary distribution family and focus on the problem of seeking a proper information bound with respect to the constraint of the parametric number and the given observations. Figure 1. (a) The paradigm of the extreme physical information principle (EPI) to derive physical laws by the extremization of the information loss K∗ (K∗ = J/2 for classical physics and K∗ = 0 for quantum physics); (b) the paradigm of confident-information-first (CIF) to derive computational models by reducing the information loss K′ using a new physical bound J′. The key to the CIF approach is how to systematically reduce the physical information bound for high-dimensional complex systems. As stated in Frieden [2], the information bound J is a functional form that depends upon the physical parameters of the system. The information is contained in 325 Entropy 2014, 16, 3670–3688 the variations of the observations (often imperfect, due to insufficient sampling, noise and intrinsic limitations of the “observer”), and can be further quantified using the Fisher information of system parameters (or coordinates) [3] from the estimation theory. Therefore, the physical information bound J of a complex system can be reduced by transforming it to a simpler system using some parametric reduction approach. Assuming there exists an ideal parametric model S that is general enough to represent all system phenomena (which gives the ultimate information bound in Figure 1), our goal is to adopt a parametric reduction procedure to derive a lower-dimensional sub-model M (which gives the reduced information bound in Figure 1) for a given dataset (usually insufficient or perturbed by noises) by reducing the number of free parameters in S. Formally speaking, let q(ξ) be the ideal distribution with parameters ξ that describes the physical system and q(ξ + Δξ) be the observations of the system with some small fluctuation Δξ in parameters. In [6], the averaged information distance I(Δξ) between the distribution and its observations, the so-called shift information, is used as a disorder measure of the fluctuated observations to reinterpret the EPI principle. More specifically, in the framework of information geometry, this information distance could also be assessed using the Fisher information distance induced by the Fisher–Rao metric, which can be decomposed into the variation in the direction of each system parameter [7]. In principle, it is possible to divide system parameters into two categories, i.e., the parameters with notable variations and the parameters with negligible variations, according to their contributions to the whole information distance. Additionally, the parameters with notable contributions are considered to be confident, since they are important for reliably distinguishing the ideal distribution from its observation distributions. On the other hand, the parameters with negligible contributions can be considered to be unreliable or noisy. Then, the CIF principle can be stated as the parameter selection criterion that maximally preserves the Fisher information distance in an expected sense with respect to the constraint of the parametric number and the given observations (if available), when projecting distributions from the parameter space of S into that of the reduced sub-model M. We call it the distance-based CIF. As a result, we could manipulate the information bound of the underlying system by preserving the information of confident parameters and ruling out noisy parameters. In this paper, the CIF principle is analyzed in the multivariate binary distribution family in the mixed-coordinate system [8]. It turns out that, in this problematic configuration, the confidence of a parameter can be directly evaluated by its Fisher information, which also establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cramér–Rao bound [3]. Hence, the CIF principle can also be interpreted as the parameter selection procedure that keeps the parameters with reliable estimates and rules out unreliable or noisy parameters. This CIF is called the information-based CIF. Note that the definition of confidence in distance-based CIF depends on both Fisher information and the scale of fluctuation, and the confidence in the information-based CIF (i.e., Fisher information) can be seen as a special case of confidence measure with respect to certain coordinate systems. This simplification allows us to further apply the CIF principle to improve existing learning algorithms for the Boltzmann machine. The paper is organized as follows. In Section 2, we introduce the parametric formulation for the general multivariate binary distributions in terms of information geometry (IG) framework [7]. Then, Section 3 describes the implementation details of the CIF principle. We also give a geometric interpretation of CIF by showing that it can maximally preserve the expected information distance (in Section 3.2.1), as well as the analysis on the scale of the information distance in each individual system parameter (in Section 3.2.2). In Section 4, we demonstrate that a widely used cognitive model, i.e., the Boltzmann machine, can be derived using the CIF principle. Additionally, an illustrative experiment is conducted to show how the CIF principle can be utilized to improve the density estimation performance of the Boltzmann machine in Section 5. 326 Entropy 2014, 16, 3670–3688 2. The Multivariate Binary Distributions Similar to EPI, the derivation of CIF depends on the analysis of the physical information bound, where the choice of system parameters, also called “Fisher coordinates” in Frieden [2], is crucial. Based on information geometry (IG) [7], we introduce some choices of parameterizations for binary multivariate distributions (denoted as statistical manifold S) with a given number of variables n, i.e., the open simplex of all probability distributions over binary vector x ∈ {0, 1}n. 2.1. Notations for Manifold S In IG, a family of probability distributions is considered as a differentiable manifold with certain parametric coordinate systems. In the case of binary multivariate distributions, four basic coordinate systems are often used: p-coordinates, η-coordinates, θ-coordinates and mixed-coordinates [7,9]. Mixed-coordinates is of vital importance for our analysis. For the p-coordinates [p] with n binary variables, the probability distribution over 2n states of x can be completely specified by any 2n − 1 positive numbers indicating the probability of the corresponding exclusive states on n binary variables. For example, the p-coordinates of n = 2 variables could be [p] = (p01, p10, p11). Note that IG requires all probability terms to be positive [7]. For simplicity, we use the capital letters I, J, . . . to index the coordinate parameters of probabilistic distribution. To distinguish the notation of Fisher information (conventionally used in literature, e.g., data information I and information bound J in Section 1) from the coordinate indexes, we make explicit explanations when necessary from now on. An index I can be regarded as a subset of {1, 2, . . . , n}. Additionally, pI stands for the probability that all variables indicated by I equal to one and the complemented variables are zero. For example, if I = {1, 2, 4} and n = 4, then pI = p1101 = Prob(x1 = 1, x2 = 1, x3 = 0, x4 = 1). Note that the null set can also be a legal index of the p-coordinates, which indicates the probability that all variables are zero, denoted as p0...0. Another coordinate system often used in IG is η-coordinates, which is defined by: ηI = E[XI] = Prob{∏ i∈I xi = 1} (1) where the value of XI is given by ∏i∈I xi and the expectation is taken with respect to the probability distribution over x. Grouping the coordinates by their orders, the η-coordinate system is denoted as [η] = (η1 i , η2 ij, . . . , ηn 1,2...n), where the superscript indicates the order number of the corresponding parameter. For example, η2 ij denotes the set of all η parameters with the order number two. The θ-coordinates (natural coordinates) are defined by: log p(x) = ∑ I⊆{1,2,...,n},I̸=NullSet θIXI − ψ(θ) (2) where ψ(θ) = log(∑x exp{∑I θIXI(x)}) is the cumulant generating function and its value equals to − log Prob{xi = 0, ∀i ∈ {1, 2, ..., n}}. The θ-coordinate is denoted as [θ] = (θi 1, θij 2 , . . . , θ1,...,n n ), where the subscript indicates the order number of the corresponding parameter. Note that the order indices locate at different positions in [η] and [θ] following the convention in Amari et al. [8]. The relation between coordinate systems [η] and [θ] is bijective. More formally, they are connected by the Legendre transformation: θI = ∂φ(η) ∂ηI , ηI = ∂ψ(θ) ∂θI (3) where ψ(θ) is given in Equation (2) and φ(η) = ∑x p(x; η) log p(x; η) is the negative of entropy. It can be shown that ψ(θ) and φ(η) meet the following identity [7]: ψ(θ) + φ(η) − ∑ θIηI = 0 (4) 327 Entropy 2014, 16, 3670–3688 Next, we introduce mixed-coordinates, which is important for our derivation of CIF. In general, the manifold S of probability distributions could be represented by the l-mixed-coordinates [8]: [ζ]l = (η1 i , η2 ij, . . . , ηl i,j,...,k, θi,j,...,k l+1 , . . . , θ1,...,n n ) (5) where the first part consists of η-coordinates with order less or equal to l (denoted by [ηl−]) and the second part consists of θ-coordinates with order greater than l (denoted by [θl+]), l ∈ {1, ..., n − 1}. 2.2. Fisher Information Matrix for Parametric Coordinates For a general coordinate system [ξ], the i-th row and j-th column element of the Fisher information matrix for [ξ] (denoted by Gξ) is defined as the covariance of the scores of [ξi] and [ξj] [3], i.e., gij = E[∂ log p(x; ξ) ∂ξi · ∂ log p(x; ξ) ∂ξj ] under the regularity condition for the pdf that the partial derivatives exist. The Fisher information measures the amount of information in the data that a statistic carries about the unknown parameters [10]. The Fisher information matrix is of vital importance to our analysis, because the inverse of Fisher information matrix gives an asymptotically tight lower bound to the covariance matrix of any unbiased estimate for the considered parameters [3]. Another important concept related to our analysis is the orthogonality defined by Fisher information. Two coordinate parameters ξi and ξj are called orthogonal if and only if their Fisher information vanishes, i.e., gij = 0, meaning that their influences on the log likelihood function are uncorrelated. The Fisher information for [θ] can be rewritten as gIJ = ∂2ψ(θ) ∂θI∂θJ , and for [η], it is gIJ = ∂2φ(η) ∂ηI∂ηJ [7]. Let Gθ = (gIJ) and Gη = (gIJ) be the Fisher information matrices for [θ] and [η], respectively. It can be shown that Gθ and Gη are mutually inverse matrices, i.e., ∑J gIJgJK = δI K, where δI K = 1 if I = K and zero otherwise [7]. In order to generally compute Gθ and Gη, we develop the following Propositions 1 and 2. Note that Proposition 1 is a generalization of Theorem 2 in Amari et al. [8]. Proposition 1. The Fisher information between two parameters θI and θJ in [θ], is given by: gIJ(θ) = ηI � J − ηIηJ (6) Proof. in Appendix A. Proposition 2. The Fisher information between two parameters ηI and ηJ in [η], is given by: gIJ(η) = ∑ K⊆I∩J (−1)|I−K|+|J−K| · 1 pK (7) where | · | denotes the cardinality operator. Proof. in Appendix B. Based on the Fisher information matrices Gη and Gθ, we can calculate the Fisher information matrix Gζ for the l-mixed-coordinate system [ζ]l, as follows: Proposition 3. The Fisher information matrix Gζ of the l-mixed-coordinates [ζ]l is given by: Gζ = � A 0 0 B � (8) 328 Entropy 2014, 16, 3670–3688 where A = ((G−1 η )Iη)−1, B = ((G−1 θ )Jθ)−1, Gη and Gθ are the Fisher information matrices of [η] and [θ], respectively, Iη is the index set of the parameters shared by [η] and [ζ]l, i.e., {η1 i , ..., ηl i,j,...,k}, and Jθ is the index set of the parameters shared by [θ] and [ζ]l, i.e., {θi,j,...,k l+1 , . . . , θ1,...,n n }. Proof. in Appendix C. 3. The General CIF Principle In this section, we propose the CIF principle to reduce the physical information bound for high-dimensionality systems. Given a target distribution q(x) ∈ S, we consider the problem of realizing it by a lower-dimensionality submanifold. This is defined as the problem of parametric reduction for multivariate binary distributions. The family of multivariate binary distributions has been proven to be useful when we deal with discrete data in a variety of applications in statistical machine learning and artificial intelligence, such as the Boltzmann machine in neural networks [11,12] and the Rasch model in human sciences [13,14]. Intuitively, if we can construct a coordinate system so that the confidences of its parameters entail a natural hierarchy, in which high confident parameters are significantly distinguished from and orthogonal to lowly confident ones, then we can conveniently implement CIF by keeping the high confident parameters unchanged and setting the lowly confident parameters to neutral values. Therefore, the choice of coordinates (or parametric representations) in CIF is crucial to its usage. This strategy is infeasible in terms of p-coordinates, η-coordinates or θ-coordinates, since the orthogonality condition cannot hold in these coordinate systems. In this section, we will show that the l-mixed-coordinates [ζ]l meets the requirement of CIF. In principle, the confidence of parameters should be assessed according to their contributions to the expected information distance between the ideal distribution and its fluctuated observations. This is called the distance-based CIF (see Section 1). For some coordinated systems, e.g., the mixed-coordinate system [ζ]l, the confidence of a parameter can also be directly evaluated by its Fisher information. This is called the information-based CIF (see Section 1). The information-based CIF (i.e., Fisher information) can be seen as an approximation to distance-based CIF, since it neglects the influence of parameter scaling to the expected information distance. However, considering the standard mixed-coordinates [ζ]l for the manifold of multivariate binary distributions, it turns out that both distance-based CIF and information-based CIF entail the same submanifold M (refer to Section 3.2 for detailed reasons). For the purpose of legibility, we will start with the information-based CIF, where the parameter’s confidence is simply measured using its Fisher information. After that, we show that the information-based CIF leads to an optimal submanifold M, which is also optimal in terms of the more rigorous distance-based CIF. 3.1. The Information-Based CIF Principle In this section, we will show that the l-mixed-coordinates [ζ]l meet the requirement of the information-based CIF. According to Proposition 3 and the following Proposition 4, the confidences of coordinate parameters (measured by Fisher information) in [ζ]l entail a natural hierarchy: the first part of high confident parameters [ηl−] are separated from the second part of low confident parameters [θl+]. Additionally, those low confident parameters [θl+] have the neutral value of zero. Proposition 4. The diagonal elements of A are lower bounded by one, and those of B are upper bounded by one. Proof. in Appendix D. Moreover, the parameters in [ηl−] are orthogonal to the ones in [θl+], indicating that we could estimate these two parts independently [9]. Hence, we can implement the information-based CIF for parametric reduction in [ζ]l by replacing low confident parameters with neutral value 329 Entropy 2014, 16, 3670–3688 zero and reconstructing the resulting distribution. It turns out that the submanifold of S tailored by information-based CIF becomes [ζ]lt = (η1 i , ..., ηl ij...k, 0, . . . , 0). We call [ζ]lt the l-tailored-mixed-coordinates. To grasp an intuitive picture for the CIF strategy and its significance w.r.t mixed-coordinates, let us consider an example with [p] = (p001 = 0.15, p010 = 0.1, p011 = 0.05, p100 = 0.2, p101 = 0.1, p110 = 0.05, p111 = 0.3). Then, the confidences for coordinates in [η], [θ] and [ζ]2 are given by the diagonal elements of the corresponding Fisher information matrices. Applying the two-tailored CIF in mixed-coordinates, the loss ratio of Fisher information is 0.001%, and the ratio of the Fisher information of the tailored parameter (θ123 3 ) to the remaining η parameter with the smallest Fisher information is 0.06%. On the other hand, the above two ratios become 7.58% and 94.45% (in η-coordinates) or 12.94% and 92.31% (in θ-coordinates), respectively. We can see that [ζ]2 gives us a much better way to tell apart confident parameters from noisy ones. 3.2. The Distance-Based CIF: A Geometric Point-of-View In the previous section, the information-based CIF entails a submanifold of S determined by the l-tailored-mixed-coordinates [ζ]lt. A more rigorous definition for the confidence of coordinates is the distance-based confidence used in the distance-based CIF, which relies on both of the coordinate’s Fisher information and its fluctuation scaling. In this section, we will show that the the submanifold M determined by [ζ]lt is also an optimal submanifold M in terms of the distance-based CIF. Note that, for other coordinate systems (e.g., arbitrarily rescaling coordinates), the information-based CIF may not entail the same submanifold as the distance-based CIF. Let q(x), with coordinate ζq, denote the exact solution to the physical phenomenon being measured. Additionally, the act of observation would cause small random perturbations to q(x), leading to some observation q′(x) with coordinate ζq + Δζq. When two distributions q(x) and q′(x) are close, the divergence between q(x) and q′(x) on manifold S could be assessed by the Fisher information distance: D(q, q′) = (Δζq · Gζ · Δζq)1/2, where Gζ is the Fisher information matrix and the perturbation Δζq is small. The Fisher information distance between two close distributions q(x) and q′(x) on manifold S is the Riemannian distance under the Fisher–Rao metric, which is shown to be the square root of the twice of the Kullback–Leibler divergence from q(x) to q′(x) [8]. Note that we adopt the Fisher information distance as the distance measure between two close distributions, since it is shown to be the unique metric meeting a set of natural axioms for the distribution metrics [7,15,16], e.g., the invariant property with respect to reparametrizations and the monotonicity with respect to the random maps on variables. Let M be a smooth k-dimensionality submanifold in S (k < 2n − 1). Given the point q(x) ∈ S, the projection [8] of q(x) on M is the point p(x) that belongs to M and is closest to q(x) with respect to the Kullback–Leibler divergence (K-L divergence) [17] from the distribution q(x) to p(x). On the submanifold M, the projections of q(x) and q′(x) are p(x) and p′(x), with coordinates ζp and ζp + Δζp, respectively, shown in Figure 2. Let the preserved Fisher information distance be D(p, p′) after projecting on M. In order to retain the information contained in observations, we need the ratio D(p,p′) D(q,q′) to be as large as possible in the expected sense, with respect to the given dimensionality k of M. The next two sections will illustrate that CIF leads to an optimal submanifold M based on different assumptions on the perturbations Δζq. 330 Entropy 2014, 16, 3670–3688 Figure 2. By projecting a point q(x) on S to a submanifold M, the l-tailored mixed-coordinates [ζ]lt gives a desirable M that maximally preserves the expected Fisher information distance when projecting a ε-neighborhood centered at q(x) onto M. 3.2.1. Perturbations in Uniform Neighborhood Let Bq be a ε-sphere surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε}, where KL(·, ·) denotes the K-L divergence and ε is small. Additionally, q′(x) is a neighbor of q(x) uniformly sampled on Bq, as illustrated in Figure 2. Recall that, for a small ε, the K-L divergence can be approximated by half of the squared Fisher information distance. Thus, in the parameterization of [ζ]l, Bq is indeed the surface of a hyper-ellipsoid (centered at q(x)) determined by Gζ. The following proposition shows that the general CIF would lead to an optimal submanifold M that maximally preserves the expected information distance, where the expectation is taken upon the uniform neighborhood, Bq. Proposition 5. Consider the manifold S in l-mixed-coordinates [ζ]l. Let k be the number of free parameters in the l-tailored-mixed-coordinates [ζ]lt. Then, among all k-dimensional submanifolds of S, the submanifold determined by [ζ]lt can maximally preserve the expected information distance induced by the Fisher–Rao metric. Proof. in Appendix E. 3.2.2. Perturbations in Typical Distributions To facilitate our analysis, we make a basic assumption on the underlying distributions q(x) that at least (2n − 2n/2) p-coordinates are of the scale ϵ, where ϵ is a sufficiently small value. Thus, residual p-coordinates (at most 2n/2) are all significantly larger than zero (of scale Θ(1/2(n/2))), and their sum approximates one. Note that these assumptions are common situations in real-world data collections [18], since the frequent (or meaningful) patterns are only a small fraction of all of the system states. Next, we introduce a small perturbation Δp to the p-coordinates [p] for the ideal distribution q(x). The scale of each fluctuation ΔpI is assumed to be proportional to the standard variation of corresponding p-coordinate pI by some small coefficients (upper bounded by a constant a), which can be approximated by the inverse of the square root of its Fisher information via the Cramér–Rao bound. It turns out that we can assume the perturbation ΔpI to be a√pI. In this section, we adopt the l-mixed-coordinates [ζ]l = (ηl−; θl+), where l = 2 is used in the following analysis. Let Δζq = (Δη2−; Δθ2+) be the incremental of mixed-coordinates after the perturbation. The squared Fisher information distance D2(p, p′) = Δζq · Gζ · Δζq could be decomposed into the direction of each coordinate in [ζ]l. We will clarify that, under typical cases, the scale of the 331 Entropy 2014, 16, 3670–3688 Fisher information distance in each coordinate of θl+ (reduced by CIF) is asymptotically negligible, compared to that in each coordinate of ηl− (preserved by CIF). The scale of squared Fisher information distance in the direction of ηI is proportional to ΔηI · (Gζ)I,I · ΔηI, where (Gζ)I,I is the Fisher information of ηI in terms of the mixed-coordinates [ζ]2. From Equation (1), for any I of order one (or two), ηI is the sum of 2n−1 (or 2n−2) p-coordinates, and the scale is Θ(1). Hence, the incremental Δη2− is proportional to Θ(1), denoted as a · Θ(1). It is difficult to give an explicit expression of (Gζ)I,I analytically. However, the Fisher information (Gζ)I,I of ηI is bounded by the (I, I)-th element of the inverse covariance matrix [19], which is exactly 1/gI,I(θ) = 1 ηI−η2 I (see Proposition 3). Hence, the scale of (Gζ)I,I is also Θ(1). It turns out that the scale of squared Fisher information distance in the direction of ηI is a2 · Θ(1). Similarly, for the part θ2+, the scale of squared Fisher information distance in the direction of θJ is proportional to ΔθJ · (Gζ)J,J · ΔθJ, where (Gζ)J,J is the Fisher information of θJ in terms of the mixed-coordinates [ζ]2. The scale of θJ is maximally f (k)|log(√ϵ)| based on Equation (2), where k is the order of θJ and f (k) is the number of p-coordinates of scale Θ(1/2(n/2)) that are involved in the calculation of θJ. Since we assume that f (k) ≤ 2(n/2), the maximum scale of θJ is 2(n/2)|log(√ϵ)|. Thus, the incremental ΔθJ is of a scale bounded by a · 2(n/2)|log(√ϵ)|. Similar to our previous deviation, the Fisher information (Gζ)J,J of θJ is bounded by the (J, J)-th element of the inverse covariance matrix, which is exactly 1/gJ,J(η) (see Proposition 3). Hence, the scale of (Gζ)J,J is (2k − f (k))−1ϵ. In summary, the scale of squared Fisher information distance in the direction of θJ is bounded by the scale of a2 · Θ(2nϵ |log(√ϵ)|2 2k− f (k) ). Since ϵ is a sufficiently small value and a is constant, the scale of squared Fisher information distance in the direction of θJ is asymptotically zero. In summary, in terms of modeling the fluctuated observations of typical cognitive systems, the original Fisher information distance between the physical phenomenon (q(x)) and observations (q′(x)) is systematically reduced using CIF by projecting them on an optimal submanifold M. Based on our above analysis, the scale of Fisher information distance in the directions of [ηl−] preserved by CIF is significantly larger than that of the directions [θl+] reduced by CIF. 4. Derivation of Boltzmann Machine by CIF In the previous section, the CIF principle is uncovered in the [ζ]l coordinates. Now, we consider an implementation of CIF when l equals to two, which gives rise to the single-layer Boltzmann machine without hidden units (SBM). 4.1. Notations for SBM The energy function for SBM is given by: ESBM(x; ξ) = −1 2xTUx − bTx (9) where ξ = {U, b} are the parameters and the diagonals of U are set to zero. The Boltzmann distribution over x is p(x; ξ) = 1 Zexp{−ESBM(x; ξ)}, where Z is a normalization factor. Actually, the parametrization for SBM could be naturally expressed by the coordinate systems in IG (e.g., [θ] = (θi 1 = bi, θij 2 = Uij, θijk 3 = 0, ..., θ1,2,...,n n = 0)). 4.2. The Derivation of SBM using CIF Given any underlying probability distribution q(x) on the general manifold S over {x}, the logarithm of q(x) can be represented by a linear decomposition of θ-coordinates, as shown in Equation (2). Since it is impractical to recognize all coordinates for the target distribution, we would like to only approximate part of them and end up with a k-dimensional submanifold M of S, where k (≪ 2n − 1) is the number of free parameters. Here, we set k to be the same dimensionality as SBM, i.e., k = n(n+1) 2 , so that all candidate submanifolds are comparable to the submanifold endowed by 332 Entropy 2014, 16, 3670–3688 SBM (denoted as Msbm). Next, the rationale underlying the design of Msbm can be illustrated using the general CIF. Let the two-mixed-coordinates of q(x) on S be [ζ]2 = (η1 i , η2 ij, θi,j,k 3 , . . . , θ1,...,n n ). Applying the general CIF on [ζ]2, our parametric reduction rule is to preserve the high confident part parameters [η2−] and replace low confident parameters [θ2+] by a fixed neutral value of zero. Thus, we derive the two-tailored-mixed-coordinates: [ζ]2t = (η1 i , η2 ij, 0, . . . , 0), as the optimal approximation of q(x) by the k-dimensional submanifolds. On the other hand, given the two-mixed-coordinates of q(x), the projection p(x) ∈ Msbm of q(x) is proven to be [ζ]p = (η1 i , η2 ij, 0, . . . , 0) [8]. Thus, SBM defines a probabilistic parameter space that is derived from CIF. 4.3. The Learning Algorithms for SBM Let q(x) be the underlying probability distribution from which samples D = {d1, d2, . . . , dN} are generated independently. Then, our goal is to train an SBM (with stationary probability p(x)) based on D that realizes q(x) as faithfully as possible. Here, we briefly introduce two typical learning algorithms for SBM: maximum-likelihood and contrastive divergence [11,20,21]. Maximum-likelihood (ML) learning realizes a gradient ascent of log-likelihood of D: ΔUij = ε∂l(ξ; D) ∂Uij = ε(Eq[xixj] − Ep[xixj]) (10) where ε is the learning rate and l(ξ; D) = 1 N ∑N n=1 log(dn; ξ). Eq[·] and Ep[·] are expectations over q(x) and p(x), respectively. Actually, Eq[xixj] and Ep[xixj] are the coordinates η2 ij of q(x) and p(x), respectively. Eq[xixj] could be unbiasedly estimated from the sample. Markov chain Monte Carlo [22] is often used to approximate Ep[xixj] with an average over samples from p(x). Contrastive divergence (CD) learning realizes the gradient descent of a different objective function to avoid the difficulty of computing the log-likelihood gradient, shown as follows: ΔUij = −ε∂(KL(q0||p) − KL(pm||p)) ∂Uij = ε(Eq0[xixj] − Epm[xixj]) (11) where q0 is the sample distribution, pm is the distribution by starting the Markov chain with the data and running m steps and KL(·||·) denotes the K-L divergence. Taking samples in D as initial states, we could generate a set of samples for pm(x). Those samples can be used to estimate Epm[xixj]. From the perspective of IG, we can see that ML/CD learning is to update parameters in SBM, so that its corresponding coordinates [η2−] are getting closer to the data (along with the decreasing gradient). This is consistent with our theoretical analysis in Section 3 and Section 4.2 that SBM uses the most confident information (i.e., [η2−]) for approximating an arbitrary distribution in an expected sense. 5. Experimental Study: Incorporate Data into CIF In the information-based CIF, the actual values of the data were not used to explicitly effect the output PDF (e.g., the derivation of SBM in Section 4). The data constrains the state of knowledge about the unknown pdf. In order to force the estimate of our probabilistic model to obey the data, we need to further reduce the difference between data information and physical information bound. How can this be done? In this section, the CIF principle will also be used to modify existing SBM training algorithm (i.e., CD-1) by incorporating data information. Given a particular dataset, the CIF can be used to further recognize less-confident parameters in SBM and to reduce them properly. Our solution here is to apply CIF to take effect on the learning trajectory with respect to specific samples and, hence, further confine the parameter space to the region indicated by the most confident information contained in the samples. 333 Entropy 2014, 16, 3670–3688 5.1. A Sample-Specific CIF-Based CD Learning for SBM The main modification of our CIF-based CD algorithm (CD-CIF for short) is that we generate the samples for pm(x) based on those parameters with confident information, where the confident information carried by certain parameter is inherited from the sample and could be assessed using its Fisher information computed in terms of the sample. For CD-1 (i.e., m=1), the firing probability for the i-th neuron after a one-step transition from the initial state x(0) = {x(0) 1 , x(0) 2 , . . . , x(0) n }) is: p(x(1) i = 1|x(0)) = 1 1 + exp{− ∑j̸=i Uijx(0) j − bi} (12) For CD-CIF, the firing probability for the i-th neuron in Equation (12) is modified as follows: p(x(1) i =1|x(0))= 1 1+exp{− ∑ (j̸=i)&(F(Uij)>τ) Uijx(0) j −bi} (13) where τ is a pre-selected threshold, F(Uij) = Eq0[xixj] − Eq0[xixj]2 is the Fisher information of Uij (see Equation (6)) and the expectations are estimated from the given sample D. We can see that those weights whose Fisher information are less than τ are considered to be unreliable w.r.t D. In practice, we could setup τ by the ratio r to specify the proportion of the total Fisher information TFI of all parameters that we would like to remain, i.e., ∑Uij>τ,i 1, ∀i ∈ {1, 2, . . . , l}. Proof. Since H is positive definite, it is a Gramian matrix of l linearly independent vectors v1, v2, . . . , vl, i.e., Hij = ⟨vi, vj⟩ (⟨·, ·⟩ denotes the inner product). Similarly, H−1 is the Gramian matrix of l linearly independent vectors w1, w2, . . . , wl and (H−1)ij = ⟨wi, wj⟩. It is easy to verify that ⟨wi, vi⟩ = 1, ∀i ∈ {1, 2, . . . , l}. If Hii < 1, we can see that the norm ∥vi∥ = √Hii < 1. Since ∥wi∥ × ∥vi∥ ≥ ⟨wi, vi⟩ = 1, we have ∥wi∥ > 1. Hence, (H−1)ii = ⟨wi, wi⟩ = ∥wi∥2 > 1. 337 Entropy 2014, 16, 3670–3688 Appendix E Proof of Proposition 5 Proof. Let Bq be a ε-ball surface centered at q(x) on manifold S, i.e., Bq = {q′ ∈ S|∥KL(q, q′) = ε}, where KL(·, ·) denotes the Kullback–Leibler divergence and ε is small. ζq is the coordinates of q(x). Let q(x) + dq be a neighbor of q(x) uniformly sampled on Bq and ζq(x)+dq be its corresponding coordinates. For a small ε, we can calculate the expected information distance between q(x) and q(x) + dq as follows: EBq = � [(ζq(x)+dq − ζq)TGζ(ζq(x)+dq − ζq)] 1 2 dBq (A1) where Gζ is the Fisher information matrix at q(x). Since Fisher information matrix Gζ is both positive definite and symmetric, there exists a singular value decomposition Gζ = UTΛU where U is an orthogonal matrix and Λ is a diagonal matrix with diagonal entries equal to the eigenvalues of Gζ (all ≥ 0). Applying the singular value decomposition into Equation (A1), the distance becomes: EBq= � [(ζq(x)+dq − ζq)TUTΛU(ζq(x)+dq − ζq)] 1 2 dBq (A2) Note that U is an orthogonal matrix, and the transformation U(ζq(x)+dq − ζq) is a norm-preserving rotation. Now, we need to show that among all tailored k-dimensional submanifolds of S, [ζ]lt is the one that preserves maximum information distance. Assume IT = {i1, i2, . . . , ik} is the index of k coordinates that we choose to form the tailored submanifold T in the mixed-coordinates [ζ]. According to the fundamental analytical properties of the surface of the hyper-ellipsoid and the orthogonality of the mixed-coordinates, there exists a strict positive monotonicity between the expected information distance EBq for T and the sum of eigenvalues of the sub-matrix (Gζ)IT, where the sum equals to the trace of (Gζ)IT. That is, the greater the trace of (Gζ)IT, the greater the expected information distance EBq for T. Next, we show that the sub-matrix of Gζ specified by [ζ]lt gives a maximum trace. Based on Proposition 4, the elements on the main diagonal of the sub-matrix A are lower bounded by one and those of B upper bounded by one. Therefore, [ζ]lt gives the maximum trace among all sub-matrices of Gζ. This completes the proof. Author Contributions: Author Contributions Theoretical study and proof: Yuexian Hou and Xiaozhao Zhao. Conceived and designed the experiments: Xiaozhao Zhao, Yuexian Hou, Dawei Song and Wenjie Li. Performed the experiments: Xiaozhao Zhao. Analyzed the data: Xiaozhao Zhao, Yuexian Hou. Wrote the manuscript: Xiaozhao Zhao, Dawei Song, Wenjie Li and Yuexian Hou. All authors have read and approved the final manuscript. Conflicts of Interest: Conflicts of Interest The authors declare no conflict of interest. References 1. Wheeler, J.A. Time Today; Cambridge University Press: Cambridge, UK, 1994; pp. 1–29. 2. Frieden, B.R. Science from Fisher Information: A Unification; Cambridge University Press: Cambridge, UK, 2004. 3. Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. 4. Frieden, B.R.; Gatenby, R.A. Principle of maximum Fisher information from Hardy’s axioms applied to statistical systems. Phys. Rev. E 2013, 88, 042144. 5. Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information—Theoretic Approach; Springer: Berlin/Heidelberg, Germany, 2002. 6. Vstovsky, G.V. Interpretation of the extreme physical information principle in terms of shift information. Phys. Rev. E 1995, 51, 975–979. 338 Entropy 2014, 16, 3670–3688 7. Amari, S.; Nagaoka, H. Methods of Information Geometry; Translations of Mathematical Monographs; Oxford University Press: Oxford, UK, 1993. 8. Amari, S.; Kurata, K.; Nagaoka, H. Information geometry of Boltzmann machines. IEEE Trans. Neural Netw. 1992, 3, 260–271. 9. Hou, Y.; Zhao, X.; Song, D.; Li, W. Mining pure high-order word associations via information geometry for information retrieval. ACM Trans. Inf. Syst. 2013, 31, 12:1–12:32. 10. Kass, R.E. The geometry of asymptotic inference. Stat. Sci. 1989, 4, 188–219. 11. Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A learning algorithm for Boltzmann machines. Cogn. Sci. 1985, 9, 147–169. 12. Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. 13. Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. 14. Bond, T.; Fox, C. Applying the Rasch Model: Fundamental Measurement in the Human Sciences; Psychology Press: London, UK, 2013. 15. Gibilisco, P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: Cambridge, UK, 2010. 16. ˇCencov, N.N. Statistical Decision Rules and Optimal Inference; American Mathematical Society: Washington, D.C., USA, 1982. 17. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. 18. Buhlmann, P.; van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory And Applications; Springer: Berlin/Heidelberg, Germany, 2011. 19. Bobrovsky, B.; Mayer-Wolf, E.; Zakai, M. Some classes of global Cramér-Rao bounds. Ann. Stat. 1987, 15, 1421–1438. 20. Hinton, G.E. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002, 14, 1771–1800. 21. Carreira-Perpinan, M.A.; Hinton, G.E. On contrastive divergence learning. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, Key West, FL, USA, 6–8 January 2005; pp. 33–40. 22. Gilks, W.R.; Richardson, S.; Spiegelhalter, D. Introducing markov chain monte carlo. In Markov Chain Monte Carlo in Practice; Chapman and Hall/CRC: London, UK, 1996; pp. 1–19. 23. Nakahara, H.; Amari, S. Information geometric measure for neural spikes. Neural Comput. 2002, 14, 2269–2316. c⃝ 2014 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). 339 MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com Entropy Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34 Fax: +41 61 302 89 18 www.mdpi.com ISBN 978-3-03897-633-2