IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 4, JULY 1996
1305
and consequently
On the Convergence of Linear Stochastic Approximation Procedures Michael A. Kouritzin
Abstract-Many stochastic approximation procedures result in a stochastic algorithm of the form
By definition of Vn and the relation (6)
limsupP{z:
112
- Qn(r)l1 2 S}
hk+l = h k
2~
T
>0
-Akhk),
for all IC = 1 , 2 , 3 . . . . .
Here, { b k , IC = 1 , 2 , 3 ,. . .} is a Rd-valuedprocess, { Ak ,IC = 1 , 2 . 3 . . . .} is a symmetric, positive semidefinite Redxd-valuedprocess, and { h k k = 1 , 2 , 3 , . .} is a sequence of stochastic estimates which hopefully converges to
/ -3 ~ / q= ~ / 1 2
n-cc
and therefore l i m ~ i i p , ~DT(Qn) -~ > 0 for every
1
+ -k( b k
-1
0 h
b
[$?- El
V E4h] k=l
ACKNOWLEDGMENT The author wishes to thank R. Alexander for bringing the paper
[ l ] to his attention.
{
lim
N-m
1 N
Ebk}
-
(2)
k=l
(assuming everything here is well defined). In this correspondence, we give an elementary proof which relates the almost sure convergence of { I i k , k = 1 , 2 , 3 , .. .} to strong laws of large numbers for { b k , k: = 1 , 2 , 3,...} and { A k , k = 1 ; 2 , 3 , . . . }.
Index Terms- Almost sure convergence, stochastic approximation, adaptive filtering, recursive algorithms, dependent random variables, Robbins-Monro process.
REFERENCES
R. Alexander, “A problem about lines and ovals,” Amer. Math. Monthly, vol. 75, no. 5 , pp. 482487, 1968. J. A. Bucklew and G. L. Wise, “Multidimensional asymptotic quantization theory with rth distortion measures,” IEEE Trans. Inform. Theory, vol. IT-28, pp. 239-247, 1982. S. Cambains and N. Gerr, “A simple class of asymptotically optimal quantizers,” IEEE Trans. lnform. Theory, vol. IT-29, pp. 664-676, Sept. 1983. A. Gersho, “Asymptotically optimal block quantization,” IEEE Trans. Inform. Theory, vol. IT-25, pp. 373-380, 1979. R. M. Gray, K. L. Oehler, K. 0. Perlmutter, and R. A. Olshen, “Combining treestructured vector quantization with classification and regression trees,” in Proc. 27th Asilomar Con5 an Circuits Systems and Computers, 1993. G. Lugosi and A. B. Nobel, “Consistency of data-driven histogram methods for density estimation and classification,” to appear in Annals of Statistics, 1996. S. Na and D. L. Neuhoff, “Bennett’s integral for vector quantizers,” IEEE Trans. Inform. Theory, vol. 41, pp. 886-900, 1995. D. L. Neuhoff, “On the asymptotic distribution of the errors in vector quantization,” IEEE Trans. on Inform. Theory, vol. 42, no. 2, pp. 461-468, Mar. 1996. A. B . Nobel, “Recursive partitioning to reduce distortion,” Tech. Rep. UIUC-BI-95-01, Beckman Institute, University of Illinois, UrbanaChampaign, 1995. -, “Histogram regression estimation using data-dependent partitions,” to appear in Annals Stat. K. L. Oehler, P. C. Cosman, R. M. Gray, and J. May, “Classification using vector quantization,” in Proc. 25th Asilomar Con$ on Signals, Systems and Computer~s(Pacific Grove, CA, Nov. 1991), pp. 439445. K. L. Oehler and R. M. Gray, “Combining image compression and classification using vector quantizaton,” IEEE Trans. Pattern. Anal. Machine Intell., vol. 17, pp. 461473, 1995. Q. Xie, C. A. Laszlo, and R. K. Ward, “Vector quantization for nonparametric classifier design,’’ IEEE Truns. Pattern. Anal. Machine Intell., vol. 15, no. 12, pp. 1326-1330, 1993. Y. Yamdda, S. Tazaki, and R. M. Gray, “Asymptotic performance of block quantizers with difference distortion measures,” IEEE Trans. Inform. Theory, vol. IT-26, pp. 6-14, 1980. P. Zador, “Asymptotic quantization of continuous random variables,” unpublished memo., Bell Labs., 1966. -, “Asymptotic quantization error of continuous signals and quantization dimension,’’ IEEE Trans. Inform. Theory, vol. IT-28, pp. 139-149, 1982.
I. INTRODUCTION Since the inception of stochastic approximation procedures many statisticians, probabilists, and engineers have strived to establish limit theorems and invariance principles for these procedures. Much of the earlier effort (see, e.g., Sacks [21], Fabian [7], McLeish 1171, Gaposhkm and Krasulina [lo], Heyde 1131, and Ruppert [20]) was concerned with procedures which can be written in the algorithmic form 1 h k + l = hk - ( b k - Akhk) for all k = 1 , 2 , 3 , . . . (1.1)
+k
where h l is some possibly random vector { A k , k = 1 , 2 , 3 , . . . } and { b k , k = 1;2; 3 , . . .} are, respectively, Rdx”-valued and Rdvalued processes on some probability space (n,F,P ) , and A p is constant or at least converges almost surely to some positive-definite matrix A. More recently, applications of the procedures (1.1j where Ak is symmetric and positive semidefinite but does not convierge (see, e.g., Widrow and Stearns [24, ch. 61, Benveniste, MCtivier, and Priouret [ l , ch. 11, and the introduction of Farden [SI) have promlpted many authors (see, e.g., Fritz [9], Gyorfi [ll], Farden [SI, Eweda and Macchi [ 5 ] , 161, Ljung [16], and MCtivier and Priouret [IS]) to study strong consistency for (1.1) under less-stringent conditions on { A k ; k = 1 , 2 , 3 , . . .}. In the present note, we bring forth seemingly natural almost sure convergence results for (1.1) analogous to the strong laws of large numbers for partial sums of random variables. Although our results are in some respects more general than previous results, our main contribution might be considered our elementary proof which was motivated in part by Fabian [7, Lemma 2.11 and Eweda and Macchi [5]. (There are also some similarities between Manuscript received October 1, 1994; revised December 19, 1995. ‘This work was done while the author was visiting the Laboratory for Research in Statistics and Probability, Carleton University, Ottawa, Ont., Canada K1S 5B6. The author is with the Institute for Mathematics and Its Applications, University of Minnesota, Minneapolis, MN 55455-0436 USA. Publisher Item ldentifier S 0018-9448(96)03642-5.
0018-9448/96$05.00 0 1996 IEEE
Authorized licensed use limited to: University of Tehran. Downloaded on May 04,2010 at 07:42:16 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 4, JULY 1996
1306
the matrix computations in Bitmead [2] and those in the sequel.) Finally, we mention that there is substantial literature (see, e.g., the books of Kushner and Clark [14] and Benveniste, MCtivier, and Priouret [ I ] ) motivating and treating nonlinear (in h k ) stochastic approximation procedures as well as procedures with state-dependent noise. However, to apply such results to our algorithm ( l . l ) , one would necessarily have to impose more stnngent conditions than those proposed in this correspondence.
11. NOTATION AND RESULTS In this section, we will define our notation and provide our results.
A
of ( l . l ) , converges to h = Ab provided .
and lim
V-m
.4k
It1 a
IATI
max{i E No: i
< 00
2 t } , for
z,k aL,k bL,k.
means that there is a e > 0 not depending on i or k such that I a , , k / 5 clb,,kl for all i. k . I" = d x d identity matrix. L?i (with each L?( being a R"xd-matrix) = B,B,-l . . . B p
0 be arbitrary, recalling some basic theory for
Z-lA)
symmetric matrices (see, e.g., [25, pp. 57-81), and using the fact
I
In-l
5 klim -m
(3.13)
m!
m=2
Indeed, with the definition
lim 1u,1
777
(3.4)
it follows by (3.4), ( 3 . 3 , and Lemma A i) and ii) (to follow) that nI
\
0, where
(3.6)
Moreover; letting Arr,irl be the smallest eigenvalue of A and a be a number small enough that
>1
+ IlAll)log ( a ) < 2, we find by (3.8) that there is some
that (A,;,
IC, such that
5
1 -, , ,A
log ( a )
+
E,
for all k >_ IC