The Eect of Intracortical Competition on the Formation of Topographic Maps in Models of Hebbian Learning C. Piepenbrock and K. Obermayer1 Fachbereich Informatik, Technische Universitat Berlin, FR2-1, Franklinstrae 28/29, 10587 Berlin, Germany email:
[email protected] Abstract Correlation based learning models (CBL) and self-organizing maps (SOM) are two classes of Hebbian models that have both been proposed to explain the activity driven formation of cortical maps. Both models dier signi cantly in the way lateral cortical interactions are treated leading to dierent predictions for the formation of receptive elds. The linear CBL models predict that receptive eld pro les are determined by the average values and the spatial correlations of second order of the aerent activity patterns, wheras SOM models map stimulus features. Here we investigate a class of models which are characterized by a variable degree of lateral competition and which have the CBL and SOM models as limit cases. We show that there exists a critical value for intracortical competition below which the model exhibits CBL properties and above which feature mapping sets in. The class of models is then analyzed with respect to the formation of topographic maps between two layers of neurons. For Gaussian input stimuli we nd that localized receptive elds and topographic maps emerge above the critical value for intracortical competition and we calculate this value as a function of the size of the input stimuli and the range of the lateral interaction function. Additionally, we show that the learning rule can be derived via the optimization of a global cost function in a framework of probabilistic output neurons which represent a set of input stimuli by a sparse code. 1
Corresponding author
1
1 Introduction In the past, two classes of hypotheses have been evaluated in order to explain the activity driven development of the response properties of cortical neurons, and their spatial distribution across the brain's surface (see [Erwin et al., 1995] for a recent review). Both classes of hypotheses are based on activity driven Hebbian learning for the connection strengths of the aerent bers under certain constraints. According to the rst hypothesis, the so-called correlation-based learning hypothesis [Linsker, 1986, Yuille et al., 1989, Miller et al., 1989, Tanaka, 1990, Miller, 1994], neural response properties are determined by the rst and second moments of the distribution of aerent activity patterns, which are present at the time when the cortical response properties develop. According to the second hypothesis, the so-called competitive neural network hypothesis [Takeuchi and Amari, 1979, Kohonen, 1982, Swindale, 1982, Durbin and Mitchison, 1990, Obermayer et al., 1990c, Obermayer et al., 1992, Goodhill and Willshaw, 1990, Bauer, 1995, Riesenhuber et al., 1996], spatial correlations of arbitrary order in the input patterns in uence response properties and maps, and receptive elds become similar to certain features of the individual activity patterns. The dierence between both classes of hypotheses lies in the way the lateral interactions between cortical columns are taken into account. Correlation based learning models generally assume lateral cortical interactions to be linear in the sense, that the activity of cortical neurons is related to their total aerent input via a convolution with a cortical interaction kernel. Competitive neural network models - on the other hand generally assume that only a small localized region in the cortex is active for a given afferent input at a given time located at the region which receives the strongest total input for the current aerent pattern. The dierent assumptions about the nature of cortical interactions (linear vs. winner-take-all) strongly aect the hypotheses' predictions for the emerging response properties and maps. When applied to the formation of cortical maps in the primary visual cortex, for example, correlation based learning models cannot predict the emergence of localized receptive elds (i.e. the fact that a cortical neuron is connected to a localized region of the LGN layers only, hence receives input from a limited part of the receptive eld) and a topographic map [Miller, 1994] while competitive neural networks can [Obermayer et al., 1990c], given realistic assumptions about the afferent activity patterns. The formation of eye-dominance and orientation columns in the same part of cortex is predicted by both hypothesis classes, however, under quite dierent assumption about constraints and about the nature of the aerent activity patterns [Piepenbrock et al., 1997, Obermayer et al., 1990c, Riesenhuber et al., 1996]. Also, the predicted receptive elds dier in their sub eld structure, and the predicted maps dier in the spatial correlations between the response properties [Erwin et al., 1995]. Although the presence and absence of nonlinearities in the lateral cortical interactions has a profound eect on model predictions, not much is known about the nature of the couplings between individual columns in the real cortex. Clearly, the assumptions of linearity and winner-take-all are simpli cations at best. The superposition principle does not hold for cortical activity patterns|as is demonstrated in the literature on contextual eects [Levitt and Lund, 1997]|which argues against linearity. The 2
rag replacements
Ixy y
x
Figure 1: The model architecture. The model consists of two layers of neurons called \LGN" and \cortex". Input patterns P are propagated from the geniculate to the cortical layer and elicit output activities O . Cortical maps emerge as a result Hebbian learning of the synaptic connections S . LGN of Intracortical interactions are described by a softPi max function and a lateral interaction function I .
Ox cortex
Six i
winner-take-all principle does not hold in its pure form either, as becomes clear from the distributed activity patterns seen for extended stimuli, for example, in the visual cortex. But if the nature and the strength of the nonlinear contributions has not yet been determined, how many of the model predictions may actually be of value, i.e. how many of the model predictions are actually robust against the degree of nonlinearity in the lateral cortical interactions? In this contribution we address this question, and we illustrate the in uence of lateral cortical competition for a simple scenario, for the formation of a topographic projection between two layers of connectionist neurons. The lateral interaction in the target layer is implemented via the application of a softmax-function to the pattern of total input, and we introduce a parameter for the degree of nonlinearity. This parameter then allows the interpolation between abovementioned hypothesis' classes: in the limit of weak cortical competition we recover the correlation-based learning model; in the limit of strong cortical competition we recover the self-organizing map. Using mathematical and numerical analyses we nd that important properties of the limit cases are indeed robust. The low and high competition regimes are broad and are separated at a critical value of at which the nature of the emergent response properties and the spatial maps change abruptly. Finally, a family of cost functions is provided, parametrized by the nonlinearity parameter , which implement a sparse coding principle. The degree of sparseness is given by the value of the parameter . The paper is structured as follows: In section 2 we describe the model architecture and learning rule and we show that the correlation based learning and self-organizing map models emerge as limit cases. In sections 3 and 4 we relate our model to cost functions, derive the conditions for the emergence of topographic maps, and predict the critical value of the parameter for the case of Gaussian aerent activity pattern. Section 5 summarizes the results of numerical simulations. The paper concludes with section 6, a discussion of related modelling approaches.
2 The development model Our model (Fig. 1) consists of a set of input neurons called \LGN" (lateral geniculate nucleus) and a layer of output neurons called \cortex" in order to make contact with the 3
CBL and SOM literature about models of visual cortical maps. M connectionist input neurons (indexed with letters i, j or k) are fully connected with N connectionist output neurons (indexed with letters x, y or z) by connections with synaptic weights Six. Let Pi denote the activity of the input neuron i for a pattern out of a given set = 1 : : : U and Ox the output of a cortical neuron x as a result of a presented pattern . The development of the synaptic weights Six in response to the input pattern is then described by the learning rule exp ( j Sjy Pj) Six() = Oy Ixy Pi with Oy = exp ( S P ) ; (1) j jz j z y where is the learning rate and Ixy is the cortical interaction function. The mean output activities Oy (right) are determined by the aerent inputs j Sjy Pj subject to a nonlinear cortical activation function { a softmax function { which models a competition between the output neurons for the input stimulus and sharpens the cortical response. The parameter determines the degree of cortical competition. The learning rule (left) is of Hebbian type, because the weight change is determined by the product of input and output activities. The interaction function Ixy (mean I = N1 x Ixy ) ensures that neighboring neurons in the cortical layer perform similar learning steps. It depends only on the distance j~x ? ~yj between neurons x and y and leads to the formation of cortical maps. Learning rule and output activities are chosen such that (1) is compatible to the CBL and SOM cases. Synaptic growth under Hebbian development rules like (1) is typically unstable and needs to be constrained, otherwise the connection strengths become in nitely large. Throughout this paper we use multiplicative constraints to bound the synaptic weights for each neuron y, either by a hard constraint that enforces j (Sjy )2 = const. exactly or by soft approximations to the hard constraint that take the form of biologically more appealing synaptic decay terms. The in uence of subtractive constraints and weight clipping is discussed in section 6. For the biological interpretation we refer the reader to [Miller et al., 1989, Miller, 1994, Obermayer et al., 1990c], however, two comments need to be made here. Firstly, the soft-max normalization of the neurons' outputs in eq. (1) implicitely assumes non-local lateral interactions. For the purpose of this study one may think of eq. (1) as to apply to a small region in cortex de ned by the spatial extend of the intracortical lateral interactions. Otherwise, an arbor function must be introduced. The purpose of this arbor function, however, is not to limit receptive eld size, because the nal receptive elds may be smaller than the spatial extend of the arbor if is large enough (cf. section 5). Rather, the arbor function is needed to limit the extend of cortical activation. Secondly, eq. (1) may be interpreted the following way: The total aerent input is calculated as a weighted sum over activities of the LGN cells, but is then normalized via the in uence of lateral competition. This normalization is (eectively) described by soft competition. The interaction function Ixy describes non-local eects during learning, which could for example be mediated by diusable substances as suggested by [Miller et al., 1989, Miller, 1994]. As for eq. (1) Ixy does not mediate the competitive normalization. P
X
P
P
P
P
P
4
2.1 CBL and SOM as limit cases
Equation (1) now allows us to interpolate between the SOM and the CBL models. In the limit ! 1 of strong cortical competition the output Oy becomes a delta function, and we obtain the winner-take-all dynamics, (2) Six() = Ix;q()Pi with q() = argmaxy Sjy Pj; X
j
of the self-organizing map (SOM) [Obermayer et al., 1990b, Obermayer et al., 1990a]. In the limit ! 0 of weak cortical competition, we expand (1) into a Taylor series around = 0 and average over the ensemble of input patterns = 1 : : : U to obtain the correlation based learning rule @ O Six = U1 Oy =0 + @ y =0 + O( 2) Ixy Pi ;y (3) = IP + N1 (Ixy ? I)Cij Sjy + O( 2): X
X
y;j
This rule is linear in the synaptic weights Six and depends only on the rst and second order statistics of the input patterns, i.e. the mean activity P = M1 j Pj = U1 Pj ) and the two point correlation functions Cij = U1 PiPj ). An expansion to higher order in would yield terms involving increasingly higher order statistics of the input patterns. To establish the full equivalence with previous applications of CBL [Miller et al., 1989, Miller, 1994], we split the set of input neurons into dierent input layers and we let each index i = (a; ) now represent an input neuron at a two-dimensional location in layer a 2 fleft; rightg (two eyes for ocular dominance development) or a 2 fON; OFFg (dierent cell response properties for orientation selectivity development). Because (3) cannot explain the emergence of a topographic map, localized receptive elds and a topographic map must be hard-wired. This is usually done by an arbor function Ax that forces all a to zero that do not represent topographic connections, for example model weights Sx by setting Ax = 0 if their distance j~ ? ~xj (measured in the input coordinate system) is larger than a given maximum receptive eld radius. If I is modeled by a Mexican hat function with I = 0 or if subtractive normalization is used, the constant growth term in (3) vanishes and we, nally, obtain the form used in [Miller et al., 1989, Miller, 1994] (with a learning rate = =N ), a = A ab S b : Sx Ixy C (4) x y P
P
P
X
y;b;
3 Relation to Cost Functions The learning rule (1) cannot be derived from a global cost function as is, because the conditions of integrability, @Sjy (Six) = @Six (Sjy ), are not satis ed. In order to appropriately modify (1), let us assume that some cost is associated with the activity Ox 5
of each output neuron. The average cost E over all neurons and input patterns is then given by E (fSixg; fOx g) = U1 Ox costx with costx = ? Ixy Sjy Pj : (5) ;x y;j We now postulate that the set fSixg should minimize E . For linear connectionist units an output neurons' activity is given by (6) Ixy Sjy Pj: Ox = X
X
X
y;j
Gradient descent on the cost function E then yields a correlation based learning model ixg; fOx g) Six = ? @E (fS@S ix 2 Ixz Izy PiPj Sjy = U X
;y;z;j
X
= 2
y;j
[I 2]xy Cij Sjy ;
(7)
which is equivalent to (4 (without an explicit arbor function A) for an eective cortical interaction function [I 2]xy = z Ixz Izy . In the next step, we consider binary output neurons that are either active (Ox = 1) or inactive (Ox = 0). The neurons interact in a competitive way such that only one neuron is active at any given time. We assume that the neurons are stochastic and that it is more likely for a neuron to become active, if its associated cost is lower than the cost of the currently active one: exp[? (costx ? costy)] Prob(Ox = 1 ! Oy = 1) = : z exp[? (costx ? costz )] now represents a noise parameter. For this type of stochastic network response, the stationary probability distribution over the network's state space is given by the Gibbs distribution Prob(fSixg; fOxg) = Z1 exp(? E (fSixg; fOxg) (8) P
P
where Z is the normalization constant (partition function). The set fSixg of synaptic weights should maximize (8) for a given set of input patterns on average leading to Prob(fSixg; fOxg) Prob(fSixg) = X
fOxg
Ox Ixy Sjy Pj = Z1 exp ;x;y;j fOxg = 1 I exp xy Sjy Pj Z x y;j ! = max;
X
YX
6
X
X
where we have used the assumption that Ox is always zero for all except one neuron. Gradient descent on the negative log-likelihood then leads to the learning rule exp ( v;j Iyv Sjv Pj) Six = Oy Ixy Pi with Oy = (9) : z exp ( v;j Izv Sjv Pj ) ;y The values Oy are the expectation values for the binary probabilistic outputs Oy and can be interpreted as output ring rates for a given level of cortical noise. They correspond to the mean elds of the binary state variables Oy at a computational temperature T = 1= . Note, that (9) and (1) dier only in the interaction function Iyv in the output term Oy. In the limit of small linearization of (9) yields a CBL model similar to (3) just with Ixy ? I replaced by [I 2]xy ? N I2. Interestingly, using (9) does not lead to qualitative changes in the model predictions as will be shown in the following sections, but has the advantage over (1) that a global cost function exists which is minimized. P
X
P
P
4 Emergence of a topographic map CBL and SOM models have both been applied to the development of ocular dominance and orientation selectivity maps. They dier, however, in the fact that a topographic map emerges only in the SOM model. Thus, in this section, we study the in uence of the cortical nonlinearity on the formation of localized receptive elds and topographic maps, for an arbitrary set of positive input patterns. Let us constrain the sum of the squared weights for every cortical receptive eld to 2 2 j (Sjy ) = const: = M S after each learning step Six according to (1), p U constr : (10) Six = M SSUix2 ? Six with SixU = Six + Six: j (Sjx ) P
q P
This rule has a xpoint for uniform synaptic weights Six = const: = S and uniform cortical outputs Oy = N1 . This is the only stable xpoint in the CBL limit (low ) and all receptive elds cover the whole input layer and have no structure. We now perform a stability analysis by perturbing the xpoint weights S with small values ix. Substitution of Six by S + ix and a subsequent Taylor series expansion of the learning rule around S (see Appendix for details of the derivation) yields p U p U M SS @ M SSix ? S ix constr : 2 ? S ix = + ix ix =0 jz + O( ) =0 U 2 U 2 @ jz j;z j (Sjx ) k (Skx )
q P
|
{z
X
q
}
P
p =0 1 21 M S(ix ? M p j jx + j;z N (Ixz ? I)UKij jz ) ? ix; = ) M (S + IPU P
P
7
where Kij = Cij ? P 2 is the covariance function of the input stimuli. The xpoint is stable if the perturbations decay (ixconstr: < 0), i.e. if ? 1 + 2 N1 (Ixz ? I)UKij jz S ix > ix M j jx S +j;z IPU 1 21 I PU S ix > ? M jx + N (Ixz ? I)UKij jz P
X
X
j
|
ix
{z
P
}
j;z
assume=0 > S N1I(Ixz ? I) P1 Kij jz : X
(11)
j;z
To decouple this system of linear equations, we diagonalize the N N matrix f N1I(Ixz ? I)g and the M M matrix f P1 Kij g and obtain I K ^ ; ^ > S (12) where I and K are the eigenvalues of the matrices f N1I(Ixz ? I)g and f P1 Kij g. If exceeds a critical value , 1 (13) = S ImaxKmax ; given by the maximum eigenvalues of abovementioned matrices, the xpoint condition (11) is no longer satis ed. The solution with uniform weights becomes unstable, and localized, structured receptive elds may form. Note, that the critical degree of competition depends on the constraint value S, the cortical interaction function I , and the covariance function K of the input patterns. For (9), which is derived from the cost function (5), f N1I(Ixz ? I)g has to be replaced by ( N1I y Ixy Iyz ? I) in (11). We have analyzed the learning rule (1) under a hard normalization rule. The same constraint j;y (Sjy )2 = M S2 may also be enforced in a biologically more realistic way via a local synaptic decay term, which can be derived from (10) by a rst order approximation of the hard normalization for small learning steps [Oja, 1982], p U p U M SS M SSix + O()2 ? S @ constr : ix Six = + ix U 2 =0 U 2 =0 @ j (Sjx ) j (Sjx ) = Six ? Six M1S2 SjxSjx: (14) j P
P
q P
q
P
X
A stability analysis for the uniform xpoint (see Appendix) again leads to the critical given by (13).
4.1 Topographic maps from Gaussian stimuli
In this section, we study a particular example of the development of a topographic map. We use equally distributed Gaussian shaped input patterns and Gaussian shaped 8
lateral interactions, which allow us (i) to determine the network state in the CBL limit (uniform weights, Six = S, 8i; x), (ii) to calculate the critical , and (iii) to predict the receptive eld pro les of the topographic map in the SOM limit. The model network now consists of two-dimensional input and output layers with m m = M = N neurons each and with periodic boundary conditions. The neuron indexes become two-dimensional grid location vectors ~i for the input and ~x for the output neurons, and ~ denotes the location of a stimulus in the input layer. The inputs and the interaction functions are then given by I~y;~x = G(~y ? ~y; 2) Gaussian interactions P~j~ = G(~j ? ~; p2) Gaussian input patterns 1 exp ? ~x2 : with G(~x; 2) = 2 2 22 To calculate the critical we need to determine the maximum eigenvalues of the matrices f U1P PiPj ? P g and f( N1IIxz ? N1 )g. For large networks and large variances of the Gaussian functions the eigenvalues become the Fourier coecients of the Gaussians ~ ~K = exp(?p2~2) ? P 2 ~I = exp(? 2 ~2) ? N1 ~:
P
The maximum eigenvalues are determined by the centers of the Gaussians or { since those have negative delta peaks at ~ = ~0 and ~ = ~0 { by the nearest point on our discrete grid for K and I . For a square grid with sidelength m we obtain (15) Kmax = exp ? p2( 2m )2 2 (16) Imax = exp ? 2 ( 2m )2
for learning rule 9) from which can now be calculated as a function of the parameters
, p, m and S. The result shows that for larger input patterns, for wider input correlation functions and for broader cortical interaction functions stronger cortical competition is required to develop a topographic map. This is also true for center-surround stimuli for which the critical can be calculated by the same method. We now examine the topographic maps with localized receptive elds that form for larger than . We make the ansatz that the nal weights are of Gaussian shape and are arranged in a topographic order, p S~i;~ = M 2SsG(~x ? ~i; s2); (17) x where the constant factor has been chosen to satis es the constraint. In the continuum limit we then obtain for the unconstrained learning step (1), exp( S~j~y P~j~ d~j ) ~ d~ I P y d~: (18) S~i~x = U x ~ y ~ ~ ~ i exp ( S~j~zP~j d~j )d~z R
Z Z
R
R
9
For multiplicative constraints, the xpoint condition is given by Six / Six. The integral inside the exponential function in (18) can be solved by using the convolution rule for Gaussian functions: Z
G(~x ? ~y; 2)G(~y ? ~z; 2)d~y = G(~x ? ~z; 2 + 2):
In order to calculate the remaining integrals we use the saddle point approximation, exp( g(~x)) = exp ( g(~x0)) exp ( 21 g00(~x0)(~x ? ~x0)2) exp ( g(~y))d~y exp ( g(~x0)) exp( 12 g00(~x0)(~y ? ~x0)2)d~y = G ~x ? ~x0; ? g100(~x ) ; R
R
0
which is valid for large values of . Inserting g(~x) = G(~x; 2) and g00(~x0 = 0) = ? 21 4 we obtain exp( G(~x; )) = G ~x; 24 ; exp ( G(~y; ))d~y and we are left with a simpli ed stability condition 2 2 2 G(~x ? ~i; s2) / U G ~y ? ~; 2p(s + p) G ~x ? ~y; 2 G ~i ? ~; p2 d~y d~ M 2Ss 2 / U G ~x ? ~i; p (s2 + p2 )2 + 2 + p2 : M 2Ss For large ansatz (17) indeed satis es the xpoint condition with the size s of the receptive elds given by s2 = p 2 (s2 + p2)2 + 2 + p2 : M 2Ss
R
Z Z
For the learning rule (9) we obtain s
s =
M S h(h p2) with h = 1 1 ? 16 ( 2 + p2): 4 M 2S2 v u u t
s
In the limit ! 1 we obtain
s2 = 2 + p2
(19)
for both learning rules (1) and (9) (see [Obermayer et al., 1990b] for an alternative derivation for the SOM). Equation (19) also demonstrates that for ! 1 and Gaussian stimuli the SOM rule (1), like (9), converges to the global minimum of the cost function (5), albeit not via gradient descent. 10
2 0 1 2 3 4 5 6
receptive eld size
log2 CBL !SOM analytic prediction receptive eld size asympt. size= 2:12
5 4 3 2
= 1:783 analytic prediction
log2 CBL !SOM receptive eld size predicted asympt. size= 2:12 predicted asympt. size= 2:12 1 2 4 0 3 5 CBL !SOM6 log2
5
-1.0
-0.5
= 2:002 analytic prediction
4 cost
receptive eld size
2 0 1 2 3 4 5 6
3 2
predicted asympt. size= 2:12 2 1 4 0 3 5 CBL !SOM6 log2
-1.0
1 0 CBL
2
3 log2
4
!SOM6
5
Figure 2: The development of a topographic map for Gaussian stimuli Pi (variance p2 = 2:25), Gaussian interaction functions Ixy (variance 2 = 2:25), and a network of 1616 neurons with periodic boundary conditions. Each dot represents the result of an independent numerical simulation for a dierent value of . Top : Receptive eld size as a function of for learning rule (1). Bottom left : Receptive eld size as a function of for learning rule (9). Bottom right : Averaged cost as a function of for learning rule (9). The average receptive eld size is determined by tting a Gaussian to every receptive eld pro le Six and averaging the standard deviation (in grid points) over all neurons x. The eld size t for the homogeneous xpoint yields 4.65. The averaged cost is calculated from (5) for the numerically obtained stationary state. and the asymptotic receptive eld size s are calculated using 16, 15, and 19.
5 Simulation results Figure 2 shows the results of numerical simulations for the average receptive eld size (learning rules (1) and (9)) and the average cost (learning rule (9)) as a function of the competition parameter . The onset of localization occurs at the predicted value of , and the average receptive eld size reaches the predicted asymptotic value for high values of . Constraints were enforced by hard normalization, cf. (10), but simulations 11
4 2
0.0 0
a topographic map develops 2
no topographic map
on etiti omp ng c stro 4:0 =
6
8 2:82 n = titio mpe k co wea 2:0 =
interaction range 2
8
4 6 stimulus size p2
8
10
Figure 3: Eects of stimulus size p and the range of the cortical interaction I on the critical value of the competition parameter. Solid lines denote analytical results for learning rule (1) as given by (13). Each data point marks the rst topographic map from a series of simulations performed in steps of 0:1 along directions in parameter space orthogonal to the plotted solid lines. All other parameters as in gure 2. For each given topographic maps develop to the left of its corresponding line. using local synaptic decay, cf. (14), yielded identical results for suciently small learning steps (data not shown). Figure 3 shows the critical combinations of stimulus size p and the range of the lateral interaction I for three values of . Solid lines indicate the analytical predictions, crosses indicate the results of numerical simulations. The analytical results are well reproduced by the results of the numerical simulations; the slight shift of the dots to the left of the solid lines is an artefact of the discretization of parameter space. The gure shows that larger stimuli ol interactions of longer range require stronger competition for localized receptive elds and topographic maps to emerge. All simulations in gures 2 and 3 were performed on SUN ULTRA SPARC workstations and use square grids of 1616 input and output neurons with periodic boundary conditions. The learning rate is set at the beginning of the simulation to change the initially random synaptic weights by 0:002S on average for the neuron that responds best to the rst stimulus. Every simulation is run for 30000 pattern presentations. The nal state is usually reached within 10000 iterations, except for values of close to . To avoid local minima (maps with twists as in Kohonen feature maps or topographic discontinuities with bi- or multimodal receptive elds) an annealing scheme can be used like shrinking the range of the interactions during the simulation (cf. SOM models) or slowly increasing . Instead of annealing a parameter, however, we chose only a slight topographic bias in the random initial conditions. We initialize the synaptic weights, Six = 1 + 0:05~ + 0:01~ix , with a constant plus noise (~ 2 [?1; 1], uniformly distributed) 12
and add the weights ~ix of an ideal topographic map. This is a much weaker constraint than the CBL arbor function A in (4) which eectively de nes the maximum receptive eld size and sets all other synapses to zero.
6 Related Models In the low limit (correlation-based learning) the algorithm is driven by the second order statistics of the input patterns. In the case of no interactions, Ixy ! xy , the synaptic weight vector of each neuron converges to a vector which is parallel to the eigenvector of the correlation matrix of the input patterns which has maximum eigenvalue. The learning rule extracts the principal component of the dataset or { if the maximum eigenvalue is degenerate { one vector from the principal subspace (cf. [MacKay and Miller, 1990]). For the learning rule 1 with hard normalization p or with an Oja constraint, (14), the resulting weight vectors have nite length M S . Alternatively, (14) can be simpli ed by replacing the decay term with ?Six i(Six)2 which was used in [Yuille et al., 1989] and which yields synaptic weight vectors with a length depending on the maximal eigenvalue. The learning rule (1) could also be implemented with dierent types of normalization for the synaptic weights [Wiskott and Sejnowski, 1998, Miller and MacKay, 1994]. A common approach is to introduce a Langrange parameter for each output neurons' receptive eld and iterate the learning rule subject to j (Sjy ) = const. for all neurons y. For = 2 this yields multiplicative constraints as used in this paper. For = 1 we obtain subtractive constraints, that compensate for constant growth and have been used in CBL models [Miller, 1994]. However, the development under subtractive constraints is still unstable and additionally requires clipping of the synaptic weights at some minimum and maximum values. As a result, all synapses (except for at most one for each neuron) saturate at either the minimum or the maximum value before the system reaches a stable state. A transition similar to gure 2 could also be observed for subtractive constraints: from identical large patchy receptive elds (CBL limit) to a topographic map (SOM limit). In this case, however, the detailed structure of the nal map critically depends on the relation between the initial noise, the learning rate, and the clipping values for the weights. We have shown that the learning rule in (9) can be derived from a cost function, (5), that implicitly includes an objective for the cortical development: for strong competition, every stimulus should be represented by few highly active neurons (and in the limit ! 1 only by one) { which is a sparse representation under the given constraint of limited synaptic resources. A closely related learning rule can be obtained for a dierent objective: each output neuron should represent (\encode") a subset of the input pattern ensemble such that the input patterns can be as faithfully reconstructed as possible [Luttrell, 1994]. In our notation, each input pattern Pi should be represented by one output neuron ( is encoded as Ox ) given some noise Ixy on the output. The objective is to minimize the mean squared reconstruction error, E (fSixg; fOxg) = Ox Ixy (Pj ? Sjy )2; (20) P
P
X
X
;x
y;j;
13
between the presented stimulus and the neurons' set of weights, which can be achieved in an EM fashion or by gradient decent [Luttrell, 1994]: Six = Ix;q()(Pi ? Six) with q() = argminy Iyz (Pj ? Sjz )2: (21) X
X
z;j
Expansion of the error term in (20) to j [Iyz (Pj)2 ? 2Pj)Sjy + (Sjy )2] leads us to the observation that minimizing the cost functions (20) and (5) is equivalent if all input patterns have equal power j (Pj )2 and if the synaptic weight vectors are con ned to a hypersphere, j (Sjy )2 = const. These assumptions are reasonable in models for the development of cortical maps, because sensory systems have many input channels and apply mechanisms of gain control such that the high dimensional input patterns dier more by stimulus pattern than by stimulus power. In the limit Iyx ! yz inside the argmin term, learning rule (21) becomes equivalent to Kohonen's self-organizing feature map [Kohonen, 1982], which has been used to model the development of cortical maps using a few stimulus feature dimensions rather than stimuli as inputs [Obermayer et al., 1992, Bauer, 1995]. The emergence of a topographic projection in feature space has recently been investigated for an extension of learning rule (21. For input patterns with independent activities along each axis distributed around zero, a network of output neurons starts to span a topographic map as soon as exceeds a critical value = I1K (cf. (13)) where K is now determined by the maximal variance of the data in feature rather than pattern space. Scherf et al. [Scherf et al., 1995] have developed a \convolution model" for ocular dominance development that has the self-organizing map and the elastic net model [Goodhill and Willshaw, 1990] as special cases. It represents a stimulus as well as a model neuron's receptive eld as points in a feature space, and the neuron's activation is a Gaussian function of the distance between these two points. The interactions between dierent neurons are also modeled by a Gaussian interaction function. For small receptive elds, the model becomes a Kohonen feature map, whereas for short range interactions it follows the elastic net dynamics (without elastic force). The model by Scherf et al. establishes a relation between two kinds of nonlinear feature space models, but does not address the transition between a dominant nonlinear to a dominant linear regime. Yuille et al. [Yuille et al., 1996] study a generalized deformable model which relates the elastic net to correlation-based learning models. They use a statistical mechanical framework where the synaptic connections are binary stochastic variables (as opposed to (9) where the stochastic output activities are binary). By eliminating the binary stochastic synaptic weights from the energy they obtain an elastic net algorithm (which represents the stimuli in a feature space). By integrating out all the feature representatives, they obtain mean elds for the stochastic weights which turn out to be a correlation based learning map. Their approach shows that a high dimensional CBL model and an elastic net with a reduced feature space minimize the same objective function. Again, the issue of cortical nonlinearities for the emergence of receptive elds in high-dimensional competitive neural network models, i.e. the selection of the relevant features, remains unaddressed. P
P
P
14
7 Conclusions We have shown that a Hebbian learning rule can be formulated under the assumption of soft competition between cortical neurons that has correlation-based learning models and self-organizing maps as limit cases. We nd that varying the degree from low to high competition leads to a transition to feature mapping at a critical value for the nonlinearity parameter . Applied to the formation of topographic maps we nd (i) that in the limit of weak competition the system becomes linear, no topographic projection emerges, and only the correlations of the input patterns (together with the shape of the interaction function) determine the developing receptive elds and maps, and (ii) that for strong competition, higher order pattern statistics does matter, localized receptive elds may form, and a topographic map emerges. For Gaussian stimuli and Gaussian interactions among the output neurons we can analytically predict the transition between a correlation-based learning regime and a topographic map as well as the size of the receptive elds in a topographic map for strong competition. Wide input correlations or long-range cortical interactions require strong competition to develop localized receptive elds.
Acknowledgements
This work was supported by the Boehringer Ingelheim Fonds (C. Piepenbrock) and by the Wellcome Trust (grant no. 050080/Z/97).
Appendix Here we present the detailed steps for the derivation of the stability condition for the initial phase transition. For our stability analysis of (10) we need
= =
? =
@ @Sjz
SixU Sjz U )2 S =S ( S j;z k kx U @SixU ? SixU 1 U @Skx S kx U 2 U 2 @Sjz @Sjz S=SSjz k (Skx ) k j;z k (Skx ) now use the expressions derived below Six + j;zp 2 N1 (Ixz ? I) Pj PiSjz ? ) M (S + IPU 1 1 21 M k Skx + M pj;z;k N (Ixz ? I ) Pj Pk Sjz ) M (S + IPU 1 1 2 2 Six ? M k Skx + j;z p N (Ixz ? I )( Pj Pi ? U P )Sjz M (S + IPU ) 1 de ne Kij = U (Pi ? P )(Pj ? P ) and obtain
X
q P
X
X
q P
P
P
P
P
P
P
P
P
P
P
15
21 p + j;z N(Ixz ? I )UKij Sjz : = M (S + IP U ) In the above calculation we have used the following terms @ Oy Hy @ @ exp( S ky Pk ) k @Sjz = @Sjz x exp( j Sjx Pj) = @Sjz x Hx Hy [ x Hx ] yz Pj [Hz Pj] = ? Hx x Hx x Hx x = Oy yz ? Oz Pj @ Oy ( ? 1 )P = @Sjz S=S N yz N j @ Six = O ( ? O )P I P yz y z j xy i @S
Six ? M1
k Skx
P
P
P
P
P
P
P
P
P
P
X
jz
;y
@ Six 2 1 (I ? I) = PjPi @Sjz S=S N xz U Six S=S = S + IPU: Next we show that a normalization with an Oja constraint yields the same critical as hard normalization. Starting from (14), we perturb the xpoint weights Six = S with small values Sjz = S + jz and expand the learning rule to rst order in the jz , ixconstr: = Six ? Six M1S2 SkxSkx =0 k
X
X
|
{z
}
=0
@ S ? S 1 2 S ix ix 2 kx Skx =0 jz + O( ) MS k j;z @jz 2 N1 (Ixz ? I) Pi Pjjz = j;z IPU ix ? S 12 jx IPU ? M1S2 M S MS j 2 1 (Ixz ? I) Pk Pjjz ? S M1S2 S N j;z k 2 N1 (Ixz ? I)(( PiPj ) ? U P 2)jz = j;z IPU ? S (ix + M1 jx): j constr : The xpoint stability condition ix < 0 becomes 0 > S N1I(Ixz ? I) P1 Kij jz ? ix ? M1 jx; +
X
X
X
X
X
X
X
X
X
X
X
X
X
j;z
|
16
j
{z
assume=0
}
which is identical to the result for the hard normalization from (11).
References [Bauer, 1995] Bauer, H.-U. (1995). Development of oriented ocular dominance bands as a consequence of areal geometry. Neur. Comp., 7:36{50. [Durbin and Mitchison, 1990] Durbin, R. and Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps. Nature, 343:644{647. [Erwin et al., 1995] Erwin, E., Obermayer, K., and Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neur. Comp., 7:425{468. [Goodhill and Willshaw, 1990] Goodhill, G. J. and Willshaw, D. J. (1990). Application of the elastic net algorithm to the formation of ocular dominance stripes. Network, 1:41{59. [Kohonen, 1982] Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43(1):59{69. [Levitt and Lund, 1997] Levitt, J. B. and Lund, J. S. (1997). Contrast dependence of contextual eects in primate visual cortex. Nature, 387:73{76. [Linsker, 1986] Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation columns. Proc. Natl. Acad. Sci. USA, 83:8779{8783. [Luttrell, 1994] Luttrell, S. P. (1994). A Bayesian analysis of self-organizing maps. Neur. Comp., 6(5):767{794. [MacKay and Miller, 1990] MacKay, D. J. C. and Miller, K. D. (1990). Analysis of Linsker's application of Hebbian rules to linear networks. Network, 1:257{297. [Miller, 1994] Miller, K. (1994). A model for the development of simple cell receptive elds and the ordered arrangements of orientation columns through activitydependent competition between ON- and OFF-center inputs. J. Neurosci., 14:409{ 441. [Miller et al., 1989] Miller, K., Keller, J. B., and Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245:605{615. [Miller and MacKay, 1994] Miller, K. D. and MacKay, D. J. (1994). The role of constraints in Hebbian learning. Neur. Comp., 6:100{126. [Obermayer et al., 1992] Obermayer, K., Blasdel, G. G., and Schulten, K. (1992). A statistical mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A, 45:7568{7589. 17
[Obermayer et al., 1990a] Obermayer, K., Ritter, H., and Schulten, K. (1990a). Largescale simulations of self-organizing neural networks on parallel computers: Application to biological modelling. Par. Comp., 14:381{404. [Obermayer et al., 1990b] Obermayer, K., Ritter, H., and Schulten, K. (1990b). A neural network model for the formation of topographic maps in the CNS: Development of receptive elds. In Proceedings of the International Joint Conference of Neural Networks II, San Diego, pages 423{429. [Obermayer et al., 1990c] Obermayer, K., Ritter, H., and Schulten, K. (1990c). A principle for the formation of the spatial structure of cortical feature maps. Proc. Nat. Acad. Sci. USA, 87:8345{49. [Oja, 1982] Oja, E. (1982). A simpli ed neuron model as a principal component analyzer. J. Math. Biol., 15:267{273. [Piepenbrock et al., 1997] Piepenbrock, C., Ritter, H., and Obermayer, K. (1997). The joint development of orientation and ocular dominance: Role of constraints. Neur. Comp., 9:959{970. [Riesenhuber et al., 1996] Riesenhuber, M., Bauer, H.-U., and Geisel, T. (1996). Analyzing phase transitions in high-dimensional self-organizing maps. Biol. Cybern., 75:397{407. [Scherf et al., 1995] Scherf, O., Pawelzik, K., and Geisel, T. (1995). From elastic net to SOFM: the energy function of the convolution model. In Fogelman-Soulie, F. and Gallinari, P., editors, Arti cial Neural Networks: ICANN95, volume 1, pages 39{43, Paris. EC2 Cie. [Swindale, 1982] Swindale, N. (1982). A model for the formation of orientation columns. Proc. R. Soc. Lond. B, 215:211{230. [Takeuchi and Amari, 1979] Takeuchi, A. and Amari, S. (1979). Formation of topographic maps and columnar microstructures. Biol. Cybern., 35:63{72. [Tanaka, 1990] Tanaka, S. (1990). Theory of self-organization of cortical maps: Mathematical framework. Neur. Netw., 3:625{640. [Wiskott and Sejnowski, 1998] Wiskott, L. and Sejnowski, T. (1998). Constrained optimization for neural map formation: A unifying framework for weight growth and normalization. Neur. Comp., 10:671{716. [Yuille et al., 1989] Yuille, A. L., Kammen, D. M., and Cohen, D. S. (1989). Quadrature and the development of orientation selective cortical cells by Hebb rules. Biol. Cybern., 61:183{194. 18
[Yuille et al., 1996] Yuille, A. L., Kolodny, J. A., and Lee, C. W. (1996). Dimension reduction, generalized deformable models and the development of ocularity and orientation. Neur. Netw., 9:309{319.
19