Information theoretical approaches in network inference
Information theory (IT) offers a powerful theoretical foundation that is wellfit to contribute to the development of computational methodologies intended to deal with network inference problems as applied to real data in several branches of complex systems theory (HernándezLemus and RangelEscareño 2011). IT also provides an analogy with statistical mechanics (SM), that can be useful for inferring network interactions (links) from betweennode correlation measures, thus enabling to use (although in a quite nontrivial manner) the huge arsenal of tools of this science. There are, however a number of open questions in the application of IT to the probabilistic complex network inference. The applied algorithms may be able to return intelligible models relying on scarce a priori information while dealing with a potentially large number of variables. IT methods may also detect nonlinear dependencies in highly noisy nonindependent probability distributions. The best benchmarking options for such kind of complex network inference, for us, seems to be the use of sequential search algorithms (instead of stochastic search, typically involving the assignment of structures for large constrained datasets, since these procedures have a high computational complexity, even NPhard" exponentially large searchspace) and performance measures based on IT, since this makes feature selection fast end efficient, and it also provides an easy way to communicate results.
Information theoretical measures have been applied intensively to infer interactions in complex networks, in particular in the field of computational biology (Bansal et al. 2007 de Jong 2002 Fleuret 2004 HernándezLemus et al. 2009 Margolin et al. 2006 Peng et al. 2005 van Someren et al. 2002) but also in social network studies (CrowleyRiddey 2009 Dong 2011 Mislove 2009 Mislove et al. 2010 Zhao et al. 2011). A group of correlation measures including mutual information, Markov random fields and KullbackLiebler divergences, amongst others are considered appropriate to perform probabilistic network inference (HernándezLemus and RangelEscareño 2011). However, since conditional probabilities obey the socalled tower property, a number of false positives links may appear as a consequence of indirect correlations (HernándezLemus and RangelEscareño 2011).
For instance, if node (or agent) A has a high value of conditional correlation (say, mutual information) with node B, and B is also highly correlated with node C, most common algorithms would predict (with a marginal probability p^{ind}) the presence of a (possibly nonexistent) link between processes A and C. In order to correct for the presence of indirect links we may implement some methods from IT, such as bounds in the informationtheoretical probability measures and the use of the Data Processing Inequality (DPI) (Sehgal et al. 2007). DPI can provide a bound to the extent on which signal processing may optimize probabilistic inference. We will discuss these and other ideas in the framework of network inference and structure assessment. We will also discuss some of their implications, and potential applications in the contemporary complex systems scenario.
Some of the essential notions of IT that will be used in this work are: (informationtheoretical) entropy, mutual information and other related measures. To do so, let X and Y denote two discrete random variables having the following features:

Finite alphabet \mathcal{X} and \mathcal{Y} respectively

Joint probability mass distribution p(X, Y)

Marginal probability mass distributions p(X) and p(Y)

Conditional probability mass distributions p(XY) and p(YX)
Following Shannon (1949), it is possible to define the information theoretical entropy H of such distribution as follows
H={K}_{s}\sum _{\nu}{p}_{\nu}\left(X\right)log{p}_{\nu}\left(X\right)
(1)
here H is called ShannonWeaver’s entropy, K_{
s
} is a constant, useful the determine the units in which entropy is measured (bits, nats, and so on, depending on the base of the log used) and p_{
ν
}(X) is the mass probability density for state ν of the random variable given by X = x. IT entropy is a measure of the amount of uncertainty associated to the value of X, hence relating the predictability of an outcome to the probability distribution. Let us now consider two discrete random variables (Y, X) with a Joint Probability Distribution (JPD) p(Y, X). For these random variables the joint entropy H(Y, X) is:
H(Y,X)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\sum _{y\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\mathcal{Y}}\sum _{x\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\mathcal{X}}p(y,x)logp(y,x)
(2)
The maximal joint entropy corresponds to independence conditions of the random variables Y and X i.e. when the JPD is factorized p(Y, X) = p(Y)p(X). The entropy of the JPD is then just the sum of their respective entropies. An inequality theorem could be stated as an upper bound for the joint entropy:
H(Y,X)\phantom{\rule{0.3em}{0ex}}\le \phantom{\rule{0.3em}{0ex}}H\left(Y\right)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}H\left(X\right)
(3)
Equality holds iff X and Y are statistically independent.
Also, given a Conditional Probability Distribution (CPD), the corresponding conditional entropy of Y given X is given by:
H\left(Y\rightX)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\sum _{y\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\mathcal{Y}}\sum _{x\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\mathcal{X}}p(y,x)logp(y\leftx\right)
(4)
Conditional entropies measure the uncertainty of a random variable once another one (the conditioner) is known. It can be proved (Cover and Thomas 1991) that:
H(Y,X)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}H\left(X\right)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}H\left(Y\rightX)\le \phantom{\rule{0.3em}{0ex}}H(Y)\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}H(X)
(5)
Or:
H\left(Y\rightX)\le \phantom{\rule{0.3em}{0ex}}H(Y)
(6)
Again, equality holds iff X and Y are statistically independent. Equation 6 is useful in the inference/prediction scenario as follows: if Y is a target variable and X is a predictor, adding variables can only decrease the uncertainty on target Y. As it will be shown later, this is essential for network inference when applying IT methods. Entropy reduction by conditioning can be accounted if we consider a measure called the mutual information, I(Y,X) which is a symmetrical measure (i.e. I(Y, X) = I(X, Y)) that is written as:
I(Y,X)=H\left(Y\right)H\left(Y\rightX\left)\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{2.77626pt}{0ex}}\text{or}\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{2.77626pt}{0ex}}I\right(X,Y)=H(X)H(X\leftY\right)
(7)
If we resort to Shannon’s definition of entropy (equation 1) (Shannon and Weaver 1949) and substitute it into equation 7 we get:
H(Y,X)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\sum _{y\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\mathcal{Y}}\sum _{x\phantom{\rule{0.3em}{0ex}}\in \phantom{\rule{0.3em}{0ex}}\mathcal{X}}p(x,y)log\frac{p(x,y)}{p\left(x\right)p\left(y\right)}
(8)
A comprehensive catalogue of algorithms to calculate diverse information theoretical measures (including mutual information) has been developed for [R], the statistical scientific computing environment (INFOTHEO 2012). We will analyze the special role that MI has in the field of complex networks inference from quantitative feature data. MI has been applied successfully as a measure to infer 2way interactions in complex networks (quite specially in the field of Gene Regulatory Networks or GRNs) (Andrecut and Kauffman 2006a; Andrecut and Kauffman 2006b; Madni and Andrecut 2007; Margolin et al. 2006). As we have seen, MI quantifies the degree of statistical dependency between two random variables (say α and β). One can see that MI(α, β) = 0iff α and β are statistically independent.
Hence, if we measure some quantitativefeature of interest ϑ (say expression level of genes in GRNs), by studying its profile (and more specifically the mutual correlation profile for a set of nodes) we may find interactions conforming a network. A pair of agents characterized by feature distributions ϑ_{
i
} and ϑ_{
j
} for which MI(ϑ_{
i
}, ϑ_{
j
}) ≠ 0 are said to interact with each other. Since MI is reparametrization invariant, one usually calculates the normalized mutual information. In this case MI(ϑ_{
i
}, ϑ_{
j
}) ∈ [0, 1], ∀i, j.
Distinguishing between direct and indirect interactions
With these definitions in mind, let us consider two random variables, X and Y, whose mutual information is MI(X, Y). Now consider a third random variable, Z, that is a (probabilistic) function of Y only. It can be shown that P_{ZX Y} = P_{ZY}, which in turn implies that P_{XY Z} = P_{XY}, as follows from Bayes’ theorem.
An informationtheoretical theorem called the Data Processing Inequality (DPI) states that Z cannot have more information about X than Y has about X; that is MI(X;Z) ≤ MI(X;Y). We can see that MI(X;Z) = H(X) − H(XZ) ≤ H(X) − H(XY, Z) = H(X) − H(XY) = MI(X;Y). Inequality follows because conditioning on an extra variable (in this case Y as well as Z) can only decrease entropy (in a similar way to what occurs in statistical physics when adding constraints to a thermal system), A formal definition of such a theorem would be:
Definition 1
Three random variables X, Y and Z are said to form a Markov chain (in that order) denoted X → Y → Z, if the conditional distribution of Z depends only on Y and is independent of X. i.e. if we know Y, knowing X doesn’t add anything new to what we already know about Z than if we know only Y.
If X, Y and Z form a Markov chain, then the Joint Probability Distribution can be written as follows:
P(X,Y,Z)=P\left(X\right)P\left(Y\rightX\left)P\right(Z\leftY\right)
(9)
Theorem 1
Data Processing Inequality: If X, Y and Z form a Markov chain, then
\mathit{\text{MI}}(X;Z)\le \mathit{\text{MI}}(X;Y)
(10)
Proof
By the chain rule for mutual information we can state that:
\begin{array}{l}\mathit{\text{MI}}(X;Y,Z)=\mathit{\text{MI}}(X;Z)+\mathit{\text{MI}}(X;YZ)\\ \phantom{\rule{7em}{0ex}}\mathit{\text{MI}}(X;Y)+\mathit{\text{MI}}(X;ZY)\end{array}
By the Markov property, since X and Z are independent, given Y, MI(X;ZY) = 0, then, since MI(X;Y,Z) ≥ 0 we have: MI(X;Z) ≤ MI(X;Y) c.q.d. □
In reference (Margolin et al. 2006), the application of DPI has shown that if nodes ϑ_{1} and ϑ_{3} interact only through a third node, ϑ_{2} within a given network, MI(ϑ_{1}, ϑ_{3}) ≤ min[MI(ϑ_{1}, ϑ_{2});MI(ϑ_{2}, ϑ_{3})]. Hence, the least of the three MIs values may come from indirect interactions. The proposed algorithm examines each triplet vertex for which all three MIs are measured and compared to some threshold value MI_{0}. If there is an edge with an MI value below the threshold, then it is removed (see Figure 1). DPI is thus useful to quantify efficiently the dependencies among a large number of nodes. The DPI algorithm is useful in the problem of complex network structure assessment as well, since it eliminates those statistical dependencies that might be of an indirect nature.
In some cases, however, it may happen that the Markov chain structure is not absolutely fulfilled. Say when nodes ϑ_{1} and ϑ_{3} interact not only through a third node, ϑ_{2}, but also by means of a direct interaction. Hence ϑ_{1} and ϑ_{3} may be twofold connected, in this case pruningout one of the links may render an inaccurate version of the actual interaction pattern. This scenario can be accounted for by means of establishing a threshold for removing a link, i.e. the link with the lesser MI measure would be removed only if its value is below a certain threshold (that we hereon call DPI_{
tol
}) to be determined in a casebycase basis by close examination of the network and also by considering its intended applications and it stays in the network otherwise (Figure 2).
There are similar approaches to the one just presented, for instance the ones in reference (Liang and Wang 2008) and in reference (Zhang et al. 2012). Both approaches are based in conditional mutual information (i.e. the degree of information a variable X and a variables Y share, given a third variable (or group of variables) Z). These algorithms try to account for indirect links by means of conditioning the associated mutual information distributions. In reference (Zhang et al. 2012), the authors further improve the performance of the algorithm by using an adaptive computation framework. In addition to these good performance general methods (references (Liang and Wang 2008; Zhang et al. 2012) were developed for gene regulatory networks although with minimal adjustments can ba applied to any other probabilistically inferred networks), there are also more specific approaches based on somehow Ad Hoc considerations. We can mention, for instance the MARINa algorithm (Lefebvre et al. 2010; Lefebvre et al. 2007; Mani et al. 2008) developed specifically for the assessment and reconstruction of gene regulatory networks based on statistical enrichment of certain signatures (Subramanian et al. 1554), an approach close in philosophy of that of conditioning variables that, however requires for additional information (i.e. the signatures themselves) to be useful, hence is more restricted to its scope and applications as are approaches relying on additional phenotypic information (Wu et al. 2009; Yu et al. 2006).
Data and algorithms
In order to introduce the importance of telling direct network interactions from indirect ones, we performed a topological analysis on the Gene Network of the fruit fly, Drosophila melanogaster (D.m.). The fruit fly Gene Network is a paradigmatic system for genetic studies and one of the best annotated organisms in genomics databases. It also presents a high genomic similarity to that of mammals humans included (about 61% of diseaseassociated genes in humans have a D.m. counterpart) and there is open access to its highthroughput inferred biological network (Costello et al. 2009). By discussing some features of the network structure of this highly studied species we introduce the problem of finding direct and indirect interactions in complex networks inferred from experimental data. Once this problem has been outlined, we proceed to illustrate how the methods of information theory may be appropriate to distinguish between direct and indirect interactions in order to sketch (at least partially), the network structure on a gene regulatory network inferred from experimental data obtained from 1191 whole genome gene expression experiments in breast tissue from breast cancer patients/controls and on a social network inferred from researchers at Mexico’s National Institute of Genomic Medicine coauthorship collaborations data, retrieved from the PubMed database.
As explained before, the methods of information theory used here correspond to the implementation of MI calculations and DPI to infer and prune respectively such networks. There is a number of different methods for computing this quantities in the literature (HernándezLemus and RangelEscareño 2011) and most of them are quite functional and almost equivalent in performance. Here we used the C++ implementation of the aracne algorithm (in particular we resort to aracne 1.0 even if there is a new version 2.0 with an improved algorithmic complexity performance, because version 2.0 uses a bootstrapping method that we have found to be still a little bit unstable) (Margolin et al. 2006) for Biological Networks and Python scripts (some customized and others from the NetworkX library) for the Social Networks. The aracne 1.0 algorithm is useful for our purposes since it is based on crystal clear MI calculations (HernándezLemus and RangelEscareño 2011), it is possible to implement DPI thresholds and its algorithmic complexity and performance are quite good (we have benchmarked aracne 1.0 against other informationtheoretical methodologies such as Information BasedSimilarity (ibs) and linear correlation predictors in the past (HernándezLemus et al. 2009) with very acceptable results). Cytoscape and Python’s library NetworkX were used to depict and analyze the networks (Assenov et al. 2008).
Microarray preprocessing of the data was performed by using the affy library in BioConductor running under [R] on a 128 Gb RAM 8Power5+ dual coreprocessor, symmetric multiprocessing (SMP) unit by IBM. All statistical tests were performed on a Dell Precision Series 16 Gb RAM QuadCore Workstation by using limma package in [R]/BioConductor. Information theoretical measure calculations for biological systems were performed by the aracne v 1.0 program in the IBM SMP machine. Python scripts were used instead for Social network calculations. Graphical depiction and network analyses were performed on a MacBook Pro 8 Gb i7.
The Drosophila melanogaster GRN used to highlight the presence of hierarchical structure was not further used here to demonstrate IT methods of network assessment. The reason for this is that it was inferred (Costello et al. 2009) by using Pearson correlation metric, which is a linear measure, thus unable to capture the whole statistical dependency spectrum. Let us recall that for two statistically independent random variables Pearson correlation coefficient is 0. However, the converse is not always true, because Pearson correlation coefficient detects only linear dependencies between two variables. Null Pearson correlation coefficients only implies statistical independence for the special case of jointly normal distributions. Since this is not the general case in gene expression distributions, values of linear correlations are not enough to determine statistical dependency (HernándezLemus et al. 2009).