Index

Computing and Information Systems
Volume 3 Number 1



Self Organising Nonlinear Artificial Neural Network Architectures Which Separate Independent Component Signals From Their Mixture : A Preliminary Review

Mark Girolami


This paper reviews three ANN (artificial neural network) architectures and learning algorithms which separate mixtures of signals into their components. The first examines nonlinear extensions to linear PCA (principal components analysis) learning algorithms. The second example extends Linsker's Infomax principle to nonlinear units and maximises the mutual information between the inputs and outputs of the network for the zero noise case. The final approach utilises a recursive interconnected linear network and nonlinear learning rule which minimizes the mean squared variance of the linear output neurons.

BLIND SEPARATION OF SOURCES

Separation of signal sources is a problem in which attempts are made to separate N linear mixtures of N independent signals. If N sources S are mixed by an unknown matrix A the received signals X, where X = AS, have then to be separated into the original subcomponents S. With no a priori knowledge of the sources or mixing matrix the problem is then known as the 'blind separation of sources' or 'cocktail party' problem.

1. LINEAR PCA TYPE LEARNING

Consider a single layer network (Figure 1) with L input neurons and M linear output neurons where M << L. The output of each neuron is defined by

(1)

Where Wji and Xi are the components of the weight matrix and input data vector.

Applying Oja's [1] symmetric hebbian learning rule with multiplicative constraints will cause the network weight matrix to converge to a subspace spanned by the most significant eigenvectors of the input data covariance matrix.

(2)

Figure 1

The outputs of the network will then be decorrelated and their variance maximised, so the information content of the L dimensional vectors will be maximally transferred by the M dimensional output vectors in the mean square sense.

PCA networks can be utilized for data compression, decorrelation of data, and maximisation of data variance. PCA has also been applied to the pruning of excess parameters in trained MLP (multi layer perceptron) networks to improve their generalisation ability [2].

The input data may be a mixture of statistically independent components, the uncorrelated outputs will, however, be a mixture of these subcomponents. In many cases it may be desirable to separate the input data into its independent sub components, it has been shown in [3] that higher order moments are required for this than are yielded by PCA .

A Gaussian distribution is fully described by the 1'st and 2'nd moments (mean and standard deviation), however skewed or kurtotic distributions require 3'rd and 4'th order moments for their description. Linear PCA networks are based on 2'nd order moments, the application of a suitable nonlinearity to the output neurons will then introduce higher order moments to the network processing. Fyfe & Baddeley [4] have investigated the higher order moments introduced in a nonlinear PCA network and their use for EPP (exploratory projection pursuit). We now turn our attention to nonlinear networks.

2. NONLINEAR PCA TYPE LEARNING

Karhunen and Joutsensalo [5] have derived from Oja's subspace algorithm (1) three nonlinear equations by introducing a nonlinearity in one of three ways:

(3)

(4)

(5)

The equations show an increase in the strength of the nolinearity as 1, 2 and 3 nonlinear terms are introduced to each equation respectively.

Karhunen & Joutensalo state that the mildly nonlinear algorithm (3) will extract the actual eigenvectors of the covariance matrix. It is interesting to note that although the symmetry of the network is preserved, the nonlinearity in the decay term has introduced asymmetry to the learning phase. So the symmetric equation (3) performs a PCA on the input data, as would the non-symmetric generalised hebbian algorithm (GHA). The author has verified this in preliminary simulations with data corrupted with white noise. Work is required to show that (3) performs well in coloured noise, and does yield superior results to (2).

Equation (5) is derived from minimising the reconstruction error of approximating a vector x of length L in terms of M basis vectors w(1) ... w(M) where the error vector e is given by:

The cost function to be minimised with respect to W is then

(6)

if f1 is chosen as the square of the error it can then be shown (5) is an approximation to a stochastic gradient descent algorithm for minimising the mean square representation error.

Equation (4) is derived by maximising the response of neuron (j) under the constraints of orthonormality :

Karhunen and Joutensalo show that for  the objective function  is maximised by stochastic gradient ascent using (4).

Simulations

The author has carried out simulations using (3), (4) and (5) on a linear mixture of two sinusoids and gaussian noise. A tanh() nonlinearity was used for g(x), work is required to understand the effect of the nonlinearities on the network performance. The frequencies and amplitudes of the sinusoids were :

f1 = 0.11, f2 = 0.2, A1 = 0.8, A2 = 1.2

and the mixture equation used was a simple summation of the individual sinusoids with the noise

Figure 2

The input vector consisted of fourteen successive samples of the signal in figure 2, four output neurons were employed. With a learning rate of 0.001 rapid convergence was seen, no annealing of the learning rate was utilised. The effect of annealing will be investigated and is ongoing work. Figure 3 and 4 show the outputs of neurons 1 and 2, the remaining neurons had similar outputs.

Separation of the component signals from the input mixture has taken place.

Figure 3

Figure 4

This simulation requires to be extended to investigate the effects of :

  1. Signals with a greater number of sub-components.
  2. Sub-components which are less easily distinguished from each other.
  3. The addition of coloured noise, and varying power densities.
  4. Annealing of the learning rate on convergence and final results.
  5. Choice of nonlinearity on network performance.
  6. Complex real data ie voice signatures, non-stationary inputs, and images.

The same input data set was used with (4), which maximises the response of the output neurons. This robust PCA algorithm, as referred to in [5] is claimed to be resistant to outliers and impulsive noise, this is yet to be confirmed by simulation. The weights of the network converged to an orthonormal basis. It is interesting to note that the first three output neurons produced uncorrelated but mixed outputs, (though there are indications that separation is developing) the fourth neuron successfully separated the higher frequency subsignal. Further work is required to investigate the performance of the 'robust' PCA algorithm for a range of data sets.

These preliminary simulations show that :

1) The strongly nonlinear learning rule derived from the minimisation of the approximation error has signal separation and noise filtering properties. To what extent these properties can be extended and how robust they are to ranges of input data will require further work. How close the converged weight vectors are to a full ICA (independent component analysis), is a question that will be answered with further research.

Figure 5

Figure 6

Figure 7

Figure 8

2) The second learning rule (4) causes the weights to converge to an orthonormal basis and may perform weak component separation. If this algorithm is found to be resistant to noise and outliers it may prove to be an attractive alternative to standard statistical methods.

3) The learning rule with one nonlinearity (3) will converge to the actual principle eigen vectors and so in many instances is more useful than Oja's subspace algorithm.

3. INFORMATION MAXIMISATION

Bell and Sejnowski [6] propose an N x N neural network architecture, based on a non linear extension of Linsker's [7] Information Maximisation Network, which appears to perform blind separation and blind deconvolution in most cases. For nonlinear neurons and the zero noise case additional properties arise from the linear situation. As outlined in Section 2, nonlinearities pick up higher order moments from the input data distribution and has been shown to allow separation of statistically independent components.

Laughlin [8] shows that maximum information transfer occurs through a sigmoid when the sloping part of the sigmoid is optimally aligned with the highest values of the input data (pdf) probability density function. A stochastic gradient ascent algorithm is derived to effect this alignment. Bell & Sejnowski state that the algorithm defined performs ICA, however, Karhunen & Joutensalo have shown that although (5) has signal separating properties the weight vectors do not converge fully to the ICA vectors. It will be a worthwhile exercise to assess the extent to which this algorithm performs full ICA on input data.

Shannon's information theory shows that the mutual information that an output Y has about an input X is given by

I(Y,X) = H(X) - H(Y | X)

as the second term on the right hand side is the conditional entropy of Y given X (ie the entropy of Y not attributed to X) this can be considered as noise on the inputs. Taking derivatives with respect to the weights we have

H(Y | X) is not dependent on w so does not appear in the above equation. The equation shows that for deterministic transfers of data X to outputs Y the mutual information between input and output can be maximised by maximising the entropy of the output alone. Bell & Sejnowski show that for a sigmoidal nonlinearity (for a single neuron) the input and output pdf's are related by :

The entropy of the output H(y) is equal to the expectation of the natural log of the output pdf.

As the input data pdf is independent of y the second term on the right hand side is constant irrespective of the weights.

For a sigmoid  where u is the input to the neuron, wx + bias terms, then  will yield a gradient ascent algorithm to maximise the output entropy. Bell & Sejnowski derive this for the single neuron case :

(6)

A similar rule is derived for bias weights.

(6 b)

We can see that (6 b) acts to align the largest value of the pdf of x with the point of maximum slope of the sigmoid. The (1 - 2y) term acts as a balance in altering the direction and magnitude of the weight delta. (6) will scale the slope of the sigmoid to match the variance of the input pdf, high variance inputs will lead to gently sloping sigmoids.

The extension of (6) to N inputs and N outputs is detailed in [6] and yields the following in matrix format :

(7)

Simulations

Preliminary simulations have been carried out using the Bell & Sejnowski network, using the same artificial data as found in Section 2. A random non-singular 2 x 2 mixing matrix was used Figures 9 to 12 show that separation has occurred, although there appears to be a residual base signature which the individual components are modulated with.

Figure 9

Figure 10

Figure 11

Figure 12

Extensions of these simulations to real data are ongoing.

Bell & Sejnowski conjecture that if the inputs to the network are platykurtic (negative kurtosis) then (7) may not maximise the information transfer and so will not minimise mutual information of the output neurons. This requires to be confirmed with simulations using sub gaussian data. [6] shows the probability density functions of most naturally occurring sounds such as speech and music are supergaussian and leptokurtic (positive kurtosis) in nature. So for practical signal processing instances this may not be of concern. For the range of possible neural applications sub gaussian data cannot be ignored. In [6] a proposal to utilise a 'flexible' sigmoid to overcome the pdf mismatching problem is made. Baram & Roth have proposed a sigmoid which is shaped by two parameters [9], this is referred to in [6] but is not developed further. This is viewed as an important area for further research.

4. A RECURSIVE LINEAR NETWORK

Jutten and Herault [10] proposed a linear neural architecture viewed as a recursive linear adaptive filter. The problem to be solved was similar to that described by Bell & Sejnowski where the input vector X is a linear mixture of the sources S, given by X = AS where A is the unknown mixing matrix. The recursive architecture of [10] has N outputs and N inputs, defined as :

(8)

After some simple algebraic manipulation the output of the network in matrix format is given by

(9)

So relevant values of W will then provide source separation making Y proportional to X the source vector.

Jutten & Herault derive the final adaption rule based on the assumption that the system is already close to convergence. This assumption causes stability problems when attempts are made to extend the network to greater than two inputs [6].

For the 2 x 2 case the rate of change of the output variance with respect to the weights is given by :

(10)

where the q term is the jj'th component of the inverted matrix given in (9). As we wish to minimise the variance of the outputs we require the minimum point of (10), and the corresponding weight values. By iteratively updating the weight vector so that at each iteration we move a short distance in the direction of the greatest rate of decrease of output variance, a technique known as gradient descent, we then have

(11)

(11) will converge when  therefore < yjyi = 0 > ie the time averaged covariance is zero and so the outputs are uncorrelated, as would be expected due to the second order moments yielded in (11). Applying suitable nonlinearities to (11) will then provide an independence test.

(12)

The assumption is made that the output data probability density function is even, an extreme constraint, however Jutten & Herault argue that most natural occurring signals to be used in this application have even pdf's. [11] and [12] investigate these constraints and a detailed analysis of the stability of (12) is given. A detailed review of this will inform further work.

With the pdf of the output data even then the odd moments of the data will be zero, by therefore choosing odd functions for f(x) and g(x), (12) will then provide a statistical independence test at the point of convergence.

Applying a Taylor expansion to the odd functions f(x) and g(x) gives :

(13)

(14)

Applying these to (12) and time averaging

For statistical independence

The final adaption rule is given as

(15)

Simulations

The same data set was used as in the previous two simulations; cube and arc tan functions were used for f() and g(); rapid convergence was found. This was not surprising as the weights were inititialised to zero. The theoretical values for the final weight matrix were  indicating that providing the learning rate was suitably small the converged weight values would be reached in a small number of iterations. The actual final weight matrix was  which provided separation of the input to the independent sub-components. It is reported in [6] that difficulty was found in extending the network beyond three outputs. The choice of f() and g() is critical and it appears that this is the key to successful separation. Further testing with artificial and real data is being carried out, a further review will be available once this work is completed.

5. CONCLUSIONS

I have given a preliminary report on the three neural based signal separation methods which have appeared in the literature. [6] and [10] have originated from the Signal Processing Community, whereas [5] has been motivated by the investigation of extending linear networks with nonlinear output neurons. Karhunen & Joutensalo have not tested the nonlinear PCA networks with real data, which is neccessary in order to fully compare nonlinear PCA with the methods proposed in [6] and [10].

The method in [6] has impressive results detailed in separating a complex mixture of voice and music data. The matrix inversion in (7), however, may prove to be a computational bottleneck. Indeed Bell & Sejnowski utilised efficient batch matrix inversion methods for the 5 x 5 case.

Jutten & Herault's architecture and algorithm has a number of stability and extendibility concerns, further work may show that this method has been superseded by more efficient and stable techniques.

The main point of interest that arises from all three is the importance of the choice of nonlinearity. [6] mentions the potential requirements to 'tune' the shape of the nonlinearity to the pdf of the input data. This is considered as a field of research which will be pursued by the author.

Application of nonlinearities to the negative feedback interneuron network as proposed by Fyfe [13] have been investigated. It is anticipated that further research work using this architecture will be carried out.

An assessment of the effectiveness of neural blind separation techniques over non neural ones may be a worthwhile exercise.

6. ACKNOWLEDGEMENTS

The author would like to acknowledge his indebtedness to Dr Colin Fyfe who suggested this project.

References

  • [1] Oja, E (1989) Neural Networks, Principal Components and Subspaces in International Journal of Neural Systems
  • [2] Moody, J Leen, T Levin, A (1994) Fast Pruning using Principal Components in Advances in Neural Processing 6
  • [3] Comon, P (1994) Independant Component Analysis, A New Concept ? in Signal Processing 36 (287 - 314)
  • [4] Fyfe, C Baddeley, R (1993) Nonlinear Data Structure Extraction Using Simple Hebbian Networks in Biological Cybernetics
  • [5] Karhunen, J Joutensalo, J (1994) Representation And Separation Of Signals Using Nonlinear PCA Type Learning in Neural Networks
  • [6] Bell, A Sejnowski, T (1995) An Information-Maximisation Approach To Blind Separation and Blind Deconvolution in Neural Computation 7
  • [7] Linsker, R (1989) An Application Of The Principle Of Maximum Information Preservation To Linear Systems in Advances in Neural Information Processing Systems
  • [8] Laughlin, S (1981) A Simple Coding Procedure Enhances A Neurons Information Capacity in Z.Naturforsch. 36, 910-912
  • [9] Baram, Y Roth, Z (1994) Multi-Dimensional Density Shaping By Sigmoidal Networks With Application to Classification, Estimation and Forecasting in CIS report No. 9420 Dept of Computer Science, Israel Institute of Technology
  • [10] Jutten, C Herault, J (1991) Blind Separation of Sources, Part 1: An Adaptive Algorithm Based On Neuromimetic Architecture in Signal Processing 24
  • [11] Comon, P (1991) Blind Separation Of Sources, Part II: Problems Statement in Signal Processing 24 11 - 20
  • [12] Sorouchyari, E Blind Separation Of Sources, Part III: Stability Analysis in Signal Processing 24 21 - 29
  • [13] Fyfe, C Baddeley, R (1994) A Projection Pursuit Neural Network in Irish Neural Net Conference

    1 The figure of 93kg was chosen as such a weight enables all objects to be placed in the knapsack i.e. the solution completely dominates the solution for the 50kg criterion


    M. A. Girolami is a Lecturer at Falkirk College and a part-time Researcher at the University of Paisley.