Computing and Information Systems
Volume 3 Number 1
This paper reviews three ANN (artificial neural network) architectures and learning algorithms which separate mixtures of signals into their components. The first examines nonlinear extensions to linear PCA (principal components analysis) learning algorithms. The second example extends Linsker's Infomax principle to nonlinear units and maximises the mutual information between the inputs and outputs of the network for the zero noise case. The final approach utilises a recursive interconnected linear network and nonlinear learning rule which minimizes the mean squared variance of the linear output neurons. |
Where W_{ji} and X_{i} are the components of the weight matrix and input data vector.
Applying Oja's [1] symmetric hebbian learning rule with multiplicative constraints will cause the network weight matrix to converge to a subspace spanned by the most significant eigenvectors of the input data covariance matrix.
Figure 1
The outputs of the network will then be decorrelated and their variance maximised, so the information content of the L dimensional vectors will be maximally transferred by the M dimensional output vectors in the mean square sense.
PCA networks can be utilized for data compression, decorrelation of data, and maximisation of data variance. PCA has also been applied to the pruning of excess parameters in trained MLP (multi layer perceptron) networks to improve their generalisation ability [2].
The input data may be a mixture of statistically independent components, the uncorrelated outputs will, however, be a mixture of these subcomponents. In many cases it may be desirable to separate the input data into its independent sub components, it has been shown in [3] that higher order moments are required for this than are yielded by PCA .
A Gaussian distribution is fully described by the 1'st and 2'nd moments (mean and standard deviation), however skewed or kurtotic distributions require 3'rd and 4'th order moments for their description. Linear PCA networks are based on 2'nd order moments, the application of a suitable nonlinearity to the output neurons will then introduce higher order moments to the network processing. Fyfe & Baddeley [4] have investigated the higher order moments introduced in a nonlinear PCA network and their use for EPP (exploratory projection pursuit). We now turn our attention to nonlinear networks.
The equations show an increase in the strength of the nolinearity as 1, 2 and 3 nonlinear terms are introduced to each equation respectively.
Karhunen & Joutensalo state that the mildly nonlinear algorithm (3) will extract the actual eigenvectors of the covariance matrix. It is interesting to note that although the symmetry of the network is preserved, the nonlinearity in the decay term has introduced asymmetry to the learning phase. So the symmetric equation (3) performs a PCA on the input data, as would the non-symmetric generalised hebbian algorithm (GHA). The author has verified this in preliminary simulations with data corrupted with white noise. Work is required to show that (3) performs well in coloured noise, and does yield superior results to (2).
Equation (5) is derived from minimising the reconstruction error of approximating a vector x of length L in terms of M basis vectors w(1) ... w(M) where the error vector e is given by:
The cost function to be minimised with respect to W is then
if f_{1} is chosen as the square of the error it can then be shown (5) is an approximation to a stochastic gradient descent algorithm for minimising the mean square representation error.
Equation (4) is derived by maximising the response of neuron (j) under the constraints of orthonormality :
Karhunen and Joutensalo show that for the objective function is maximised by stochastic gradient ascent using (4).
Simulations
The author has carried out simulations using (3), (4) and (5) on a linear mixture of two sinusoids and gaussian noise. A tanh() nonlinearity was used for g(x), work is required to understand the effect of the nonlinearities on the network performance. The frequencies and amplitudes of the sinusoids were :
and the mixture equation used was a simple summation of the individual sinusoids with the noise
Figure 2
The input vector consisted of fourteen successive samples of the signal in figure 2, four output neurons were employed. With a learning rate of 0.001 rapid convergence was seen, no annealing of the learning rate was utilised. The effect of annealing will be investigated and is ongoing work. Figure 3 and 4 show the outputs of neurons 1 and 2, the remaining neurons had similar outputs.
Separation of the component signals from the input mixture has taken place.
Figure 3
Figure 4
This simulation requires to be extended to investigate the effects of :
The same input data set was used with (4), which maximises the response of the output neurons. This robust PCA algorithm, as referred to in [5] is claimed to be resistant to outliers and impulsive noise, this is yet to be confirmed by simulation. The weights of the network converged to an orthonormal basis. It is interesting to note that the first three output neurons produced uncorrelated but mixed outputs, (though there are indications that separation is developing) the fourth neuron successfully separated the higher frequency subsignal. Further work is required to investigate the performance of the 'robust' PCA algorithm for a range of data sets.
These preliminary simulations show that :
1) The strongly nonlinear learning rule derived from the minimisation of the approximation error has signal separation and noise filtering properties. To what extent these properties can be extended and how robust they are to ranges of input data will require further work. How close the converged weight vectors are to a full ICA (independent component analysis), is a question that will be answered with further research.
Figure 5
Figure 6
Figure 7
Figure 8
2) The second learning rule (4) causes the weights to converge to an orthonormal basis and may perform weak component separation. If this algorithm is found to be resistant to noise and outliers it may prove to be an attractive alternative to standard statistical methods.
3) The learning rule with one nonlinearity (3) will converge to the actual principle eigen vectors and so in many instances is more useful than Oja's subspace algorithm.
Laughlin [8] shows that maximum information transfer occurs through a sigmoid when the sloping part of the sigmoid is optimally aligned with the highest values of the input data (pdf) probability density function. A stochastic gradient ascent algorithm is derived to effect this alignment. Bell & Sejnowski state that the algorithm defined performs ICA, however, Karhunen & Joutensalo have shown that although (5) has signal separating properties the weight vectors do not converge fully to the ICA vectors. It will be a worthwhile exercise to assess the extent to which this algorithm performs full ICA on input data.
Shannon's information theory shows that the mutual information that an output Y has about an input X is given by
as the second term on the right hand side is the conditional entropy of Y given X (ie the entropy of Y not attributed to X) this can be considered as noise on the inputs. Taking derivatives with respect to the weights we have
H(Y | X) is not dependent on w so does not appear in the above equation. The equation shows that for deterministic transfers of data X to outputs Y the mutual information between input and output can be maximised by maximising the entropy of the output alone. Bell & Sejnowski show that for a sigmoidal nonlinearity (for a single neuron) the input and output pdf's are related by :
The entropy of the output H(y) is equal to the expectation of the natural log of the output pdf.
As the input data pdf is independent of y the second term on the right hand side is constant irrespective of the weights.
For a sigmoid where u is the input to the neuron, wx + bias terms, then will yield a gradient ascent algorithm to maximise the output entropy. Bell & Sejnowski derive this for the single neuron case :
A similar rule is derived for bias weights.
We can see that (6 b) acts to align the largest value of the pdf of x with the point of maximum slope of the sigmoid. The (1 - 2y) term acts as a balance in altering the direction and magnitude of the weight delta. (6) will scale the slope of the sigmoid to match the variance of the input pdf, high variance inputs will lead to gently sloping sigmoids.
The extension of (6) to N inputs and N outputs is detailed in [6] and yields the following in matrix format :
Simulations
Preliminary simulations have been carried out using the Bell & Sejnowski network, using the same artificial data as found in Section 2. A random non-singular 2 x 2 mixing matrix was used Figures 9 to 12 show that separation has occurred, although there appears to be a residual base signature which the individual components are modulated with.
Figure 9
Figure 10
Figure 11
Figure 12
Extensions of these simulations to real data are ongoing.
Bell & Sejnowski conjecture that if the inputs to the network are platykurtic (negative kurtosis) then (7) may not maximise the information transfer and so will not minimise mutual information of the output neurons. This requires to be confirmed with simulations using sub gaussian data. [6] shows the probability density functions of most naturally occurring sounds such as speech and music are supergaussian and leptokurtic (positive kurtosis) in nature. So for practical signal processing instances this may not be of concern. For the range of possible neural applications sub gaussian data cannot be ignored. In [6] a proposal to utilise a 'flexible' sigmoid to overcome the pdf mismatching problem is made. Baram & Roth have proposed a sigmoid which is shaped by two parameters [9], this is referred to in [6] but is not developed further. This is viewed as an important area for further research.
After some simple algebraic manipulation the output of the network in matrix format is given by
So relevant values of W will then provide source separation making Y proportional to X the source vector.
Jutten & Herault derive the final adaption rule based on the assumption that the system is already close to convergence. This assumption causes stability problems when attempts are made to extend the network to greater than two inputs [6].
For the 2 x 2 case the rate of change of the output variance with respect to the weights is given by :
where the q term is the jj'th component of the inverted matrix given in (9). As we wish to minimise the variance of the outputs we require the minimum point of (10), and the corresponding weight values. By iteratively updating the weight vector so that at each iteration we move a short distance in the direction of the greatest rate of decrease of output variance, a technique known as gradient descent, we then have
(11) will converge when therefore < y_{j}y_{i} = 0 > ie the time averaged covariance is zero and so the outputs are uncorrelated, as would be expected due to the second order moments yielded in (11). Applying suitable nonlinearities to (11) will then provide an independence test.
The assumption is made that the output data probability density function is even, an extreme constraint, however Jutten & Herault argue that most natural occurring signals to be used in this application have even pdf's. [11] and [12] investigate these constraints and a detailed analysis of the stability of (12) is given. A detailed review of this will inform further work.
With the pdf of the output data even then the odd moments of the data will be zero, by therefore choosing odd functions for f(x) and g(x), (12) will then provide a statistical independence test at the point of convergence.
Applying a Taylor expansion to the odd functions f(x) and g(x) gives :
Applying these to (12) and time averaging
For statistical independence
The final adaption rule is given as
Simulations
The same data set was used as in the previous two simulations; cube and arc tan functions were used for f() and g(); rapid convergence was found. This was not surprising as the weights were inititialised to zero. The theoretical values for the final weight matrix were indicating that providing the learning rate was suitably small the converged weight values would be reached in a small number of iterations. The actual final weight matrix was which provided separation of the input to the independent sub-components. It is reported in [6] that difficulty was found in extending the network beyond three outputs. The choice of f() and g() is critical and it appears that this is the key to successful separation. Further testing with artificial and real data is being carried out, a further review will be available once this work is completed.
The method in [6] has impressive results detailed in separating a complex mixture of voice and music data. The matrix inversion in (7), however, may prove to be a computational bottleneck. Indeed Bell & Sejnowski utilised efficient batch matrix inversion methods for the 5 x 5 case.
Jutten & Herault's architecture and algorithm has a number of stability and extendibility concerns, further work may show that this method has been superseded by more efficient and stable techniques.
The main point of interest that arises from all three is the importance of the choice of nonlinearity. [6] mentions the potential requirements to 'tune' the shape of the nonlinearity to the pdf of the input data. This is considered as a field of research which will be pursued by the author.
Application of nonlinearities to the negative feedback interneuron network as proposed by Fyfe [13] have been investigated. It is anticipated that further research work using this architecture will be carried out.
An assessment of the effectiveness of neural blind separation techniques over non neural ones may be a worthwhile exercise.
References
1 The figure of 93kg was chosen as such a weight enables all objects to be placed in the knapsack i.e. the solution completely dominates the solution for the 50kg criterion
M. A. Girolami is a Lecturer at Falkirk College and a part-time Researcher at the University of Paisley.