Dynamic network models and graphon estimation

Report 5 Downloads 23 Views
Dynamic network models and graphon estimation

arXiv:1607.00673v1 [math.ST] 3 Jul 2016

Marianna Pensky Department of Mathematics, University of Central Florida Abstract In the present paper we consider a dynamic stochastic network model. The objective is estimation of the tensor of connection probabilities Λ when it is generated by a Dynamic Stochastic Block Model (DSBM) or a dynamic graphon. In particular, in the context of DSBM, we derive b of Λ and show that Λ b satisfies an oracle inequality and also penalized least squares estimator Λ attains minimax lower bounds for the risk. We extend those results to estimation of Λ when it is generated by a dynamic graphon function. The estimators constructed in the paper are adaptive to the unknown number of blocks in the context of DSBM or of the smoothness of the graphon function. The technique relies on the vectorization of the model and leads to to much simpler mathematical arguments than the ones used previously in the stationary set up. In addition, all our results are non-asymptotic and allow a variety of extensions. Keywords and phrases: dynamic network, graphon, stochastic block model, nonparametric regression, minimax rate AMS (2000) Subject Classification: Primary: 60G05. Secondary: 05C80, 62F35

1

Introduction

Networks arise in many areas of research such as sociology, biology, genetics, ecology, information technology to list a few. An overview of statistical modeling of random graphs can be found in, e.g., Kolaczyk (2009) and Goldenberg et al. (2011). While static network models are relatively well understood, the literature on the dynamic network models is fairly recent. In this paper, we consider a dynamic network defined as an indirected graph with n nodes with connection probabilities changing in time. Assume that we observe the values of a tensor Bi,j,l ∈ {0, 1} at the time tl where 0 < t1 < · · · < tL = T . For simplicity, we assume that time instances are equispaced and the time interval is scaled to one, i.e. tl = l/L. Here Bi,j,l = 1 if a connection between nodes i and j is observed and Bi,j,l = 0 otherwise. We set Bi,i,l = 0 and Bi,j,l = Bj,i,l for any i, j = 1, · · · n and l = 1, · · · , L, and assume that Bi,j,l are independent Bernoulli random variables with Λi,j,l = P(Bi,j,l = 1) and Λi,i,l = 0. In this paper, we consider a Dynamic Stochastic Block Model (DSBM) where all n nodes are grouped into m classes Ω1 , · · · , Ωm , and probability of a connection Λi,j,l is entirely determined by the groups to which the nodes i and j belong at the moment tl . In particular, if i ∈ Ωk and j ∈ Ωk′ , then Λi,j,l = Gk,k′ ,l . Here, G is the connectivity tensor at time tl with Gk,k′ ,l = Gk′ ,k,l . The dynamic network described above can generalized as a dynamic graphon. Let ζ = (ζ1 , · · · , ζn ) be a random vector sampled from a distribution Pζ supported on [0, 1]n . Although the most common choice for Pζ is the i.i.d. uniform distribution for each ζi , we do not make this assumption in the present paper. We further assume that there exists a function f : [0, 1]3 → [0, 1] such that for any t one has f (x, y, t) = f (y, x, t) and Λi,j,l = f (ζi , ζj , tl ),

i, j = 1, · · · , n, l = 1, · · · , L. 1

(1.1)

Then, function f summarizes behavior of the network and can be called a dynamic graphon, similarly to the situation of a stationary network. This formulation allows to study a different set of stochastic network models than the DSBM. It is known that graphons play an important role in the theory of graph limits described in Lov´asz and Szegedy (2006) and Lov´asz (2012). Given an observed adjacency tensor B sampled according to model (1.1), the graphon function f is not identifiable since the topology of a network is invariant with respect to any change of labeling of its nodes. Therefore, for any f and any measure-preserving bijection µ : [0, 1] → [0, 1] (with respect to Lebesgue measure), the functions f (x, y, t) and f (µ(x), µ(y), t) define the same probability distribution on random graphs. For this reason, we are considering equivalence classes of graphons. Note that in order it is possible to compare clustering of nodes across time instances, we introduce an assumption (which, to the best of our knowledge, first appeared in Matias and Miele (2015)) that there are no label switching in time, that is, every node carries the same label at any time tl , so that function µ is independent of t. The objective of the paper is estimation of the tensor Λ from observations B in the case of DSBM and the dynamic graphon (1.1). It is reasonable to assume that the tensor G does not change significantly with time. The nodes of the network, however, can switch between groups and this can significantly affect the tensor Λ. In the last few years, dynamic network models attracted a great deal of attention (see, e.g., Durante et al. (2015), Durante et al. (2016), Han et al. (2015), Kolar et al. (2010), Matias and Miele (2015), Minhas et al. (2015), Xing et al. (2010), Xu (2015), Xu and Hero III (2014) and Yang et al. (2011) among others). Majority of those paper describe changes in the connection probabilities and group memberships via various kinds of Bayesian or Markov random field models and carry out the inference using the EM or iterative optimization algorithms. While procedures described in these papers show good computational properties, they come without guarantees for the estimation precision. On the other hand, very recently, several authors carried out minimax studies in the context of stationary network models. In particular, Gao et al. (2015) developed the upper and the minimax lower bounds for the risk of estimation of the matrix of connection probabilities. Klopp et al. (2016) extended these results to the case when the network is sparse in a sense that probability of connection is uniformly small and tends to zero as n → ∞. Also, Zhang and Zhou (2015) investigated minimax rates of community detection in two-class stochastic block model. Our paper can be viewed as a nontrivial extension of the above results to the case of time-dependent network models and graphons. To the best of our knowledge, this is the first minimax study of dynamic networks and graphons: the only paper known to us that is concerned with estimation precision in dynamic network models is by Han et al. (2015) where the authors are studying consistency of their procedures when n → ∞ or L → ∞. In the present paper, under the assumption that tensor G is smooth as a function of time, b of Λ and show that they satisfy oracle in Section 3, we derive penalized least squares estimators Λ inequalities. The technique relies on the vectorization of the model, described in Section 2.2, and allows one to take advantage of the well studied methodologies in nonparametric regression estimation. Initially, we do not place any assumptions on the changes in the memberships of the nodes in time, however, in Section 3, as a particular case, we consider a situation where only at most n0 nodes can change their memberships between two consecutive time points. For the DSBM, under the latter assumption, in Section 4, we derive minimax lower bounds for the risk of any estimator of Λ and confirm that the estimators constructed in the paper attain those lower bounds. In Section 5 we extend those results to estimation of the tensor Λ when it is generated by a graphon function. We show that the estimators are minimax optimal within a logarithmic factor of L. Estimators,

2

constructed in the paper, do not require knowledge of the number of classes m in the context of the DSBM, or a degree of smoothness of the graphon function f if Λ is generated by a dynamic graphon. Finally, in Section 6 we discuss various generalizations of the technique proposed in the paper. For example, it can be adapted to a situation where the number of nodes in the network depends on time or when the connection probabilities have jump discontinuities. Moreover, since our oracle inequality for the DSBM is derived under no assumptions about clustering mechanism, one can accommodate a much more diverse collection of scenarios than the ones studied in the paper. The proofs of all statements are placed into Section 7. Note that unlike in Klopp et al. (2016) we do not consider a network that is sparse in a sense that probabilities of connections between classes are uniformly small. However, since our technique is based on model selection, it allows to study a network where probabilities of some connections are equal to zero as it is done in Bickel and Chen (2009), Bickel et al. (2011) or Wolfe and Olhede (2013). In particular, from our theory, one automatically obtains an oracle inequality for a sparse stationary stochastic block model (SBM). The present paper makes several key contributions. First, to the best of our knowledge, this is the first minimax study of estimation of the tensor of connection probabilities in a dynamic stochastic network model. Second, we provide estimators that are adaptive to the number of blocks in the context of the DSBM and adaptive to the smoothness of the graphon function. In the case of L = 1 we automatically obtain an adaptive estimator in a sparse SBM model where probabilities of interactions between some groups of nodes are identically equal to zero. All out results are nonasymptotic. Finally, for derivation of estimators and their analysis, we use a novel approach that is based on vectorization of the model and leads to much simpler mathematical arguments than the ones used in both Gao et al. (2015) and Klopp et al. (2016). The latter feature may be generally useful for research in the area of network analysis. Finally, the technique allows various extensions that we discuss later in Section 6.

2 2.1

Notation and data structures Notation

For any two positive sequences {an } and {bn }, an ≍ bn means that there exists a constant C > 0 independent of n such that C −1 an ≤ bn ≤ Can for any n. For any set Ω, denote cardinality of Ω by |Ω|. For any x, [x] is the largest integer no larger than x. For any vector t ∈ Rp , denote its ℓ2 , ℓ1 , ℓ0 and ℓ∞ norms by, respectively, ktk, ktk1 , ktk0 and ktk∞ . Denote by kt1 − t2 kH the Hamming distance between vectors t1 and t2 . Denote by 1 and 0 vectors that have, respectively, only unit or zero elements. Denote by ej the vector with 1 in the j-th position and all other elements equal to zero. For a matrix A, its i-th row and j-th columns are denoted, respectively, by Ai,∗ and A∗,j . Similarly, for a tensor A ∈ Rn1 ×n2 ×n3 , we denote its l-th (n1 × n2 )-dimensional sub-matrix by A∗,∗,l . Let vec(A) be the vector obtained from matrix A by sequentially stacking its columns. Denote by A⊗ B the Kronecker product of matrices A and B. Also, Ik is the identity matrix of size k. For any subset J of indices, any vector t and any matrix A, denote the restriction of t to indices in J by tJ and the restriction of A to columns A∗,j with j ∈ J by AJ . Also, denote by t(J) the modification of vector t where all elements tj with j ∈ / J are set to zero. For any matrix A, denote its spectral and Frobenius norms by, respectively, kAk and kAkF . n1 ×n2 ×n3 , denote Denote kAk H and kAk0 ≡ kvec(A)k0 . For any tensor A ∈ R Pn3H ≡ kvec(A)k 2 2 kAk2 = l=1 kA∗,∗,l kF . 3

Denote by M(m, n) a collection of membership (or clustering) matrices Z ∈ {0, 1}n×m , i.e. matrices such that Z has exactly one 1 per row and Zik = 1 iff a node i belongs to the class Ωk and is zero otherwise. Denote by Cn,m a set of clustering matrices such that C(m, n, L) ⊆

2.2

L Y l=1

M(m, n).

(2.1)

Vectorization of the model

Let Z(l) ∈ M(m, n) be the clustering matrix at the moment tl . Then, it is easy to check that Λ∗,∗,l = Z(l) G∗,∗,l (Z(l) )T , Apply operation of vectorization to Λ∗,∗,l . Denote λ(l) = vec(Λ∗,∗,l ),

b(l) = vec(B∗,∗,l ),

g(l) = vec(G∗,∗,l ).

(2.2)

Then, Theorem 1.2.22(i) of Gupta and Nagar (2000) yields

Note that

λ(l) = (Z(l) ⊗ Z(l) )g(l)

(2.3)

(l)

(2.4)

(l)

i = 1, · · · , n2 ,

bi ∼ Bernoulli(λi ), (l)

where bi are independent for different values of l but not i due to the symmetry. In addition, some (l) (l) values of bi and λi are equal to zero by construction. In order to remove redundant entries from λ(l) and b(l) , we use double-indexing. In particular, we replace the index i in λ(l) corresponding to the element Λi1 ,i2 ,l by i = (i1 , i2 ). Remove all rows i in (Z(l) ⊗ Z(l) ) and λ(l) that correspond to i1 ≥ i2 . Then, (2.3) and (2.4) can be re-written as ˜ (l) g(l) , θ(l) = C

(l)

(l)

ai ∼ Bernoulli(θ i ), i = 1, · · · , n(n − 1)/2,

(2.5)

where θ (l) and a(l) are the reductions of vectors λ(l) and b(l) to the collection of indices i = (i1 , i2 ) ˜ (l) is obtained from (Z(l) ⊗ Z(l) ) by removing the rows with with 1 ≤ i1 < i2 ≤ n and matrix C (l) (l′ ) (l) i1 ≥ i2 . Note that unlike b , elements ai and ai′ are independent if i 6= i′ or l 6= l′ . Observe that although we removed the redundant elements from vectors λ(l) and b(l) , we (l) have not done so for the vector g(l) . Indeed, if we replace the index k in gk corresponding to the element Gk1 ,k2 ,l by k = (k1 , k2 ), then (2.5) implies that (l) θ i1 ,i2

= =

m X m X

k1 =1 k2 =1 m X m X

(l) ˜ (l) C i1 ,i2 ,k1 ,k2 gk1 ,k2

k1 =1 k2 =k1

(l) ˜ (l) ˜ (l) [C i1 ,i2 ,k1 ,k2 + Ci1 ,i2 ,k2 ,k1 I(k1 6= k2 )] gk1 ,k2 .

Denote the reduction of the vector g(l) to components k = (k1 , k2 ) with 1 ≤ k1 ≤ k2 ≤ m by q(l) and (l) ˜ (l) ˜ (l) Ci1 ,i2 ,k1 ,k2 = C i1 ,i2 ,k1 ,k2 + Ci1 ,i2 ,k2 ,k1 I(k1 6= k2 ),

4

1 ≤ i1 < i2 ≤ n, 1 ≤ k1 ≤ k2 ≤ m.

(2.6)

˜ (l) was obtained from the Let us reflect on the structure of the matrix C(l) . Recall that matrix C (l) (l) matrix (Z ⊗ Z ) by removing some of the rows. Therefore, using double-indexing, we obtain (l) ˜ (l) that C i1 i2 k2 k1 = 1 only if i1 ∈ Ωk1 and i2 ∈ Ωk2 , so that it has exactly one 1 per row. Matrix C ˜ (l) by adding the columns corresponding to indices (k1 , k2 ) and (k2 , k1 ) is obtained from matrix C with k1 6= k2 . Therefore, matrix C(l) is also a binary matrix with exactly one 1 per row and, hence, (l) it is a clustering matrix: Ci1 i2 k1 k2 = 1 iff i1 ∈ Ωk1 and i2 ∈ Ωk2 or i1 ∈ Ωk2 and i2 ∈ Ωk1 where 1 ≤ i1 < i2 ≤ n and 1 ≤ k1 ≤ k2 ≤ m. Hence, C(l) ∈ M(M, N ) and one has a(l) = θ (l) + ξ (l)

with θ (l) = C(l) q(l) ,

l = 1, · · · , L,

(2.7) (l)

where θ (l) ∈ RN , q(l) ∈ RM , N = n(n − 1)/2 and M = m(m + 1)/2. Here, components ai of vector (l) (l) a(l) are independent Bernoulli variables with P(ai = 1) = θ i , so that components of vectors ξ (l) are independent for different values of i or l. Consider matrices A, Θ ∈ RN ×L and Q ∈ RM ×L with columns a(l) , θ (l) and q(l) , respectively, and denote a = vec(A), θ = vec(Θ) and q = vec(Q). Note that vectors a, θ ∈ RN L and q ∈ RM L are obtained by stacking vectors a(l) , θ (l) and q(l) vertically for l = 1, · · · , L. Define a block diagonal matrix C ∈ {0, 1}N L×M L with blocks C(l) , l = 1, · · · , L, on the diagonal. Then, (2.7) implies that a=θ+ξ

with

θ = Cq = C vec(Q),

(2.8)

where ai are independent Bernoulli(θ i ) variables, i = 1, · · · , N L. Observe that if the matrix C were known, then equations in (2.8) would represent a regression model with independent Bernoulli errors. Moreover, note that, for every l, matrix (C(l) )T C(l) = p (l) (D(l) )2 is diagonal with Dk1 ,k2 = Nk1 ,k2 , where Nk1 ,k2 is the number of pairs (i1 , i2 ) of nodes such that i1 < i2 and one node is in class Ωk1 while another is in class Ωk2 .

3

Assumptions and oracle inequalities for the DSBM

It is reasonable to assume that the values of the probabilities q(l) of connections do not change dramatically from one time instant to another. In other words, we assume that, for various k = (L) (1) 1, · · · , M , vectors qk = (qk , · · · , qk ) represent values of smooth functions. In order to quantify this phenomenon, we assume that vectors qk have sparse representation in some orthogonal basis H ∈ RL×L with HT H = HHT = IL , so that vector HqTk has only few nonzero coefficients. Using matrix Q defined in the previous section, the latter can be expressed as kHQT k0 = kQHT k0 being small. Note that by Theorem 1.2.22 of Gupta and Nagar (2000), one has vec(QHT ) = (H ⊗ IM )vec(Q) = (H ⊗ IM )q,

(3.1)

where W = (H ⊗ IM ) is an orthogonal matrix such that WT W = WWT = IM L . Denote d = Wq,

d ∈ RM L , J ≡ JM = {j : dj 6= 0} , dJ C = 0.

(3.2)

Consider a set of clustering matrices C(m, n, L) satisfying (2.1). At this point we do not impose any restrictions on the set of clustering matrices but later on we shall consider some special cases such as fixed membership (initial groups remain valid on the whole time interval) or limited change (only at most n0 nodes can change their memberships between two consecutive time points). We find m, J, d and C as a solution of the following penalized least squares optimization problem   b C) b = argmin ka − CWT dk2 + Pen(|J|, m) s.t. dJ c = 0, C ∈ C(m, n, L) (3.3) (m, b Jb, d, m,J,d,C

5

where a is defined in (2.8), d ∈ RM L , W ∈ RM L×M L , M = m(m + 1)/2 and   11 25 m2 L Pen(|J|, m) = 11 log(|C(m, n, L)|) + |J| log . 2 |J|

(3.4)

Note that since minimization is carried out also with respect to m, optimization problem should be bM , C b M and JbM . After that, one needs to select solved separately for every m = 1, · · · , n, yielding d c the value M = m( b m b + 1)/2 that delivers the minimum in (3.3), so that b=d b c, d M

b =C b c, C M

Jb = JbM c.

c = (H ⊗ I c) and calculate Finally, due to (3.2), we set W M c T d, b b=W q

b = Cb b q. θ

(3.5)

(3.6)

b into the tensor and taking the symmetries into account. b by packing vector θ We obtain Λ Denote the true value of tensor Λ by Λ∗ . Also, denote by m∗ the true number of groups, by q∗ and θ ∗ the true values of q and θ in (2.8) and by C∗ the true value of C. Let M ∗ = m∗ (m∗ +1)/2 and W∗ = (H ⊗ IM ∗ ) be true values of M and W, respectively. Note that vector θ ∗ is obtained by vectorizing Λ∗ and then removing the redundant entries. Then, it follows from (2.8) that a = θ∗ + ξ

with

θ ∗ = C∗ q∗ = C∗ W∗ T d∗ .

(3.7)

Due to the relation between the ℓ2 and the Frobenius norms, one has kθ − θ ∗ k2 ≤ kΛ − Λ∗ k2 ≤ 2kθ − θ ∗ k2 ,

(3.8)

and the following statement holds. b obtained Theorem 1 Consider a DSBM with a true matrix of probabilities Λ∗ and estimator Λ according to (3.3)–(3.6). Then, for any t > 0, one has   # "  kΛ b − Λ∗ k2 6 kCWT d(J) − θ ∗ k2 4 Pen(|J|, m) 38  + 2 t ≥ 1 − 9e−t (3.9) ≤ min + P m,J,q  n2 L n2 L n2 L n L  C∈C(m,n,L)

and

E

b − Λ∗ k2 kΛ n2 L

!



min

m,J,q

C∈C(m,n,L)

"

# 6 kCWT d(J) − θ ∗ k2 4 Pen(|J|, m) 342 , + + 2 n2 L n2 L n L

(3.10)

where d(J) is the modification of vector d where all elements dj with j ∈ / J are set to zero. The proof of the Theorem is given in Section 7. Here, we just explain its idea. Note that if the values of m and C are fixed, the problem (3.3) reduces to a regression problem with a complexity b of d∗ is just a projection penalty Pen(|J|, m). Moreover, if J is known, the optimal estimator d estimator. Indeed, denote Υ = CWT and let ΥJ = (CWT )J be the reduction of matrix CWT to b one obtains M c = (H ⊗ I c), Υ b =C bW c T and c = m( columns j ∈ J. Given m, b Jb and C, b m b + 1)/2, W M b J = (C bW c T )J . Let Υ ΠJ = ΥJ (ΥTJ ΥJ )−1 ΥTJ

and 6

b TbΥ b b(Υ b −1 b T bb=Υ Π J Jb) ΥJb J J

(3.11)

b b, respectively. Then, it is easy to see be projection matrices on the column spaces of ΥJ and Υ J b is of the form bd b=Π b ba and vector d that Υ J b = (Υ b TbΥ b −1 Υ b Tb a. d J Jb) J

(3.12)

b can be obtained as a solution of the following optimization problem Hence, the values of m, b Jb and C   b m, (C, b Jb) = argmin ka − ΠJ ak2 + Pen(|J|, m) s.t. C ∈ C(m, n, L), m,J,C

where ΠJ and the penalty Pen(|J|, m) are defined in (3.11) and (3.4), respectively. After that, the proof uses arguments that are relatively standard for the penalized least squares estimation.

Note that in the right-hand sides of expressions (3.9) and (3.10), kCWT d(J) − θ ∗ k2 is the bias term that quantifies how well one can estimate the true values of probabilities θ ∗ by blocking them together, averaging the values in each block and simultaneously setting all but |J| coefficients of the expansions of the vector q in (2.8) in the basis W to zero. The penalty constitutes the ”price” for choosing too many blocks and coefficients. In particular, the second term in (3.4), (n2 L)−1 |J| log(4m2 L/|J|) is due to the need of finding and estimating |J| elements of the Lm(m + 1)/2-dimensional vector. The first term, log(|C(m, n, L)|, accounts for the difficulty of clustering and is due to application of the union bound in probability. Theorem 1 holds for any collection C(m, n, L) of clustering matrices. Denote by Z(m, n, n0 , L) the collection of clustering matrices corresponding to the situation where at most n0 nodes can change their memberships between any two consecutive time points, so that   L−1 n n n0 |Z(m, n, n0 , L)| = m m ; |Z(m, n, 0, L)| = mn ; |Z(m, n, n, L)| = mnL . (3.13) n0

Note that the case of n0 = 0 corresponds to the scenario where the group memberships of the nodes are constant and do not depend on time while the case of n0 = n means that memberships of all nodes can change arbitrarily from one time instant to another. b obtained Corollary 1 Consider a DSBM with a true matrix of probabilities Λ∗ and estimator Λ according to (3.3)–(3.6) where C(m, n, L) = Z(m, n, n0 , L). Then, inequalities (3.9) and (3.10) hold with     mne 25 m2 L 11 log m 11 n0 (L − 1) 11 |J| Pen(|J|, m) = + log log + (3.14) n2 L nL n2 L n0 2n2 L |J| Remark 1 (Sparse SBM). Theorem 1 provides an oracle inequality in the case of a timeindependent SBM (L = 1). Indeed, in this case, by taking H = 1 and W = IM , obtain for any t>0 "  # ∗ 2 2 6 kCq − θ k 25 m 44 log m 22|J| 342 1 (J) b − Λ∗ k2 ≤ min EkΛ + + log + 2 (3.15) 2 2 2 m,J,q n n n n |J| n C∈M(m,n)

and a similar result holds for the probability. Note that if |J| = m(m + 1)/2, our result coincides with the one of Gao et al. (2015). However, when |J| is small, the right-hand of (3.15) may be smaller than C(log m/n+m2 /n2 ) obtained in Gao et al. (2015). In addition, the estimator obtained above is adaptive to the unknown number of classes. In particular, (3.15) automatically provides an upper bound for the error when the network is sparse and only |J| pairs of classes have a nonzero probability of connection. 7

4

The lower bounds for the risk for the DSBM

In order to prove that the estimator obtained as a solution of optimization problem (3.3) is optimal, we need to show that the minimax lower bound for the mean squared error of any estimator of Λ is proportional to Pen(|J|, m) defined in (3.4). In order to be more specific, we consider the collection of clustering matrices Z(m, n, n0 , L) with cardinality given by (3.13) that corresponds to the situation where at most n0 nodes can change their memberships between consecutive time instants. In this case, Pen(|J|, m) is defined in (3.14). In order to derive the lower bounds for the error, we impose mild conditions on the orthogonal matrix H: for any binary vector ω ∈ {0, 1}L one has √ √ (4.1) kHT ωk∞ ≤ kωk1 / L and H 1 = Le1 , where 1 = (1, 1, · · · , 1)T and e1 = (1, 0, · · · , 0)T . Assumptions (4.1) are not restrictive. In fact, they are satisfied for a variety of common orthogonal transforms such as Fourier or wavelet transforms. Let Gm,L,s be a collection of tensors such that G ∈ Gm,L,s implies that the vectorized versions q of G can be written as q = WT d with kdk0 ≤ s. Theorem 2 Let orthogonal matrix H satisfy condition (4.1). Consider the DSBM where G ∈ Gm,L,s with s ≥ κm2 where κ > 0 is independent of m, n and L. Denote γ = min(κ, 1/2) and assume that L ≥ 2, n ≥ 2m, n0 ≤ min(γn, 4/3 γn m−1/9 ) and s2 log(2LM/s) ≤ 68LM n2 .

(4.2)

Then inf c Λ

sup G∈Gm,L,s

C∈Z(m,n,n0 ,L)



(

b − Λk2 kΛ ≥ Cγ n2 L



log m n0 + 2 log nL n



mne n0



s log(Lm2 /s) + n2 L

)

1 ≥ , (4.3) 4

b is any estimator of Λ, Cγ is an absolute constant that depends on γ only. where Λ

Note that the terms log m/(nL) and n0 n−2 log(mne/n0 ) in (4.3) correspond to, respectively, the error of initial clustering and the clustering error due to membership changes. The third term  2 −1 2 (n L) s log Lm /s) is due to nonparametric estimation and model selection. Condition that s ≥ κm2 for some κ > 0 independent of m, n and L, ensures that one does not have too many classes where nodes have no interactions with each other or members of other classes while (4.2) is just a technical condition which ensures that elements of tensor Λ are probabilities. If G ∈ Gm,L,s and C ∈ Z(m, n, n0 , L), then the lower bound for the error in (4.3) coincides, up to a constant, with the penalty in (3.14). Since the bias term is equal to zero for the true tensor of probabilities Λ∗ , the latter means that the estimator constructed above is minimax optimal up to a constant.

5

Dynamic graphon estimation

Consider the situation where tensor Λ is generated by a dynamic graphon f , so that Λ is given by expression (1.1) where function f : [0, 1]3 → [0, 1] is such that f (x, y, t) = f (y, x, t) for any t and ζ = (ζ1 , · · · , ζn ) is a random vector sampled from a distribution Pζ supported on [0, 1]n . We further assume that probabilities Λi,j,l do not change drastically from one time point to another, i.e. that f is smooth in t. We shall also assume that f is piecewise smooth in x and y. 8

In order to quantify those assumptions, for each x, y ∈ [0, 1]2 , we consider a vector f (x, y) = (f (x, y, t1 ), · · · , f (x, y, tL ))T and an orthogonal transform H used in the previous sections. We assume that elements vl (x, y) of vector v(x, y) = Hf (x, y) satisfy the following assumption: (A). There exist constants 0 = β0 < β1 < · · · < βr = 1 and ν1 , ν2 , K1 , K2 > 0 such that for any x, x′ ∈ (βi−1 , βi ] and y, y ′ ∈ (βj−1 , βj ], 1 ≤ i, j ≤ r, one has [vl (x, y) − vl (x′ , y ′ )]2 ≤ K1 [|x − x′ | + |y − y ′ |]2ν1 , L X l=2

l2ν2 vl2 (x, y) ≤ K2 .

(5.1) (5.2)

Note that, for a graphon corresponding to the DSBM model, on each of the rectangles (βi−1 , βi ] × (βj−1 , βj ], functions vl (x, y) are constant, so that ν1 = ∞. The sum in (5.2) starts from l = 2 since, for majority of transforms, the term v1 (x, y) returns the average of the f (x, y, t) over t ∈ [0, 1] and does not need to be penalized. Moreover, if one simply considers values of l starting from zero, PL−1 PL−1 l2ν2 vl2 (x, y) ≤ K2 become equivalent. conditions l=1 l2ν2 vl2 (x, y) ≤ K2 and l=0

We denote the class of graphons satisfying assumptions (1.1)–(5.2) by Σ(ν1 , ν2 , K1 , K2 ). Note that, since ν1 , ν2 , K1 and K2 in Assumption A are independent of x and y, one can simplify optimization procedure as follows. Denote V = QHT . Since the memberships do not change with time, matrices C(l) are independent of l, so that C(l) = Z. Then, Θ = ZQ = ZVH. Denote Φ = ΘHT = ZV. Apply transform H to the data matrix A obtaining matrix X = AHT . Choose ρ ∈ (0, 1] and remove all columns X∗,l with l > Lρ obtaining matrix X(ρ) with EX(ρ) = ZV(ρ) ≡ Φ(ρ) where matrix V(ρ) has Lρ columns. Since |J| = M Lρ ≤ m2 Lρ , optimization procedure (3.3) in this case can be reformulated as   11 2 ρ 1−ρ (b ρ) b (ρ) (ρ) 2 b m L log(25 L ) (5.3) (m, b ρb, V , Z) = argmin kX − ZV k + 11n log m + 2 m,ρ,V(ρ) Z∈Z(m,n,0,L)

b = ZV(ρ) H and obtain Λ b by packing Θ b where Z(m, n, 0, L) is defined in (3.13). After that, set Θ into a tensor. The following corollary provides a minimax upper bound for the risk of the estimator b Λ. b be obtained as a solution of optimization problem (5.3) as described above. Theorem 3 Let Λ Then, for Σ ≡ Σ(ν1 , ν2 , K1 , K2 ), one has  ρ−1  b − Λ∗ k2 EkΛ L I(ρ < 1) (h + r)2 (1 + log L) log(h + r) sup ≤ C min + 2ρν2 +1 + + , (5.4) 1≤h≤n−r n2 L h2ν1 L n2 L1−ρ nL f ∈Σ 0≤ρ≤1

where constant C depends on ν1 , ν2 , K1 and K2 only. b does not require knowledge of ν1 , ν2 , K1 and K2 , Note that construction of the estimator Λ so the estimator is fully adaptive. On the other hand, an expression in the right hand side of (5.4) is rather complex and hard to analyze. For this reason, we shall consider only two regimes: r = rn,L ≥ 2 may depend on n or L and ν1 = ∞; or r = r0 ≥ 1 is a fixed quantity independent of n and L. The first regime corresponds to a piecewise constant (in x and y) graphon that generates the DSBM while the second regime deals with the situation where f is a piecewise smooth function of all three arguments. In the first case, we set h = 2 and, in the second, choose h to be a function of n and L. By minimizing the right-hand side of (5.4), we obtain the following statement. 9

b be obtained as a solution of optimization problem (5.3) as described above. Corollary 2 Let Λ Then, for Σ ≡ Σ(ν1 , ν2 , K1 , K2 ) and C independent of n and L, one has  ( ) h  2  i 2ν2ν2 +1 2 2  log r 1 n r r  C min L n log r + C nL , if r = rn,L ; ; n    b − Λ∗ k2  EkΛ ≤ sup (5.5) ( )  n2 L f ∈Σ 2ν1 ν2 ν1        log n L ν1 +1 1 log L (ν1 +1)(2ν2 +1)  , if r = r0 . + C nL ; log   C min L n2 n2 In order to assess optimality of the penalized least squares estimator obtained above, we derive lower bounds for the minimax risk over the set Σ(ν1 , ν2 , K1 , K2 ). These lower bounds are constructed separately for each of the two regimes.

Theorem 4 Let matrix H satisfy assumptions (4.1) and ν2 ≥ 1/2 in (5.2). Then, for C independent of n and L, one has ( ) b − Λk2 kΛ 1 sup PΛ inf ≥ ∆(n, L) ≥ , (5.6) 2L n 4 c Λ f ∈Σ(ν1 ,ν2 ,K1 ,K2 ) where

∆(n, L) =

 ( h   1  C min  L          C min L1

 i 2ν2 r 2 2ν2 +1

n

1 n2



;

2ν1 ν2 (ν1 +1)(2ν2 +1)

 r 2

n

;

) 1 n2

+ 

C log r nL ,

ν1 ν1 +1



if r = rn,L ; (5.7)

+

C log n nL ,

if r = r0 .

It is easy to see that the value of ∆(n, L) coincides with the upper bound in (5.5) up to a at most a logarithmic factor of n/r or L. In both cases, the first quantities in the minimums correspond to the situation where f is smooth enough as a function of time, so that application of transform H improves estimation precision by reducing the number of parameters that needs to be estimated. The second quantities represent the case where one needs to keep all elements of vector d and hence application of the transform yields no benefits. The latter can be due to the fact that ν2 is too small or L is too low. The upper and the lower bounds in Theorems 3 and 4 look somewhat similar to the ones appearing in anisotropic functions estimation (see, e.g., Lepski (2015)). Note also that although in the case of a stationary graphon (L = 1), the estimation precision does not improve if ν1 > 1, this is not the case in the case of a dynamic graphon. Indeed, the right-hand sides in (5.7) can be significantly smaller when ν1 , ν2 and L are large. Remark 2 (DSBM and dynamic graphon). Observe that since we assume that vector ζ is independent of t, our set up corresponds to the situation where the nodes of the network do not change their memberships in time and n0 = 0 in (3.13). Therefore, a piecewise constant graphon is just a particular case of a general DSBM since the latter allows any temporal changes of nodes’ memberships. On the other hand, the dynamic piecewise constant graphon formulation allows to derive specific minimax convergence rates for the model.

10

6

Extensions and generalizations

In the present paper we considered estimation of connection probabilities in the context of dynamic network models. Our approach is based on vectorization of the model and allows a variety of extensions. 1. (Inhomogeneous or non-smooth connection probabilities). Although assumption (5.2) essentially implies that probabilities of connections are represented by smooth functions of time that belong to the same Sobolev class, one can easily extend our theory to the case where those probabilities have jump discontinuities and belong to different functional classes. Indeed, by generalizing condition (5.2) and assuming that H is a wavelet transform, one can accommodate this case similarly to how this was done in Klopp and Pensky (2015). 2. (Time-dependent number of nodes). One can apply the theory above even when the number of nodes in the network changes from one time instant to another. Indeed, in this case we can form a set that includes all nodes which have ever been in the network and denote their number by n. Consider a class Ω0 such that all nodes in this class have zero probability of interaction with each other or any other node in the network. At each time instant, place all nodes that are not in the network into the class Ω0 . After that, one just needs to modify the optimization procedures by placing additional restrictions that out-of-the-network nodes indeed belong to class Ω0 and that G0,k,l = 0 for any k = 0, 1, 2, · · · , m and l = 1, · · · , L. 3. (Adaptivity to clustering complexity). Although, in the case of the DSBM model, our estimator is adaptive to the unknown number of classes, it requires knowledge about the extent to which nodes’ memberships can change from one time instant to another. For example, if at most n0 nodes can change their memberships between two consecutive time points and n0 is a fixed quantity independent of n and m, we can replace n0 by log n that dominates n0 if n is large enough. However, if n0 depends on n and m, development of an adaptive estimator requires additional investigation. 4. (More general dynamic graphon function). In the present paper, we consider graphon function which is based on a time-independent random vector ζ = (ζ1 , · · · , ζn ) sampled from a distribution Pζ supported on [0, 1]n and connection probabilities are defined by (1.1). This model corresponds to the case where nodes do not change their memberships in time. It is possible, however, to consider a more diverse scenario where there exists a latent random field ζ(t) = (ζ1 (t), · · · , ζn (t)) and probability of connection is given by Λi,j,l = f (ζi (tl ), ζj (tl ), tl ), i, j = 1, · · · , n, l = 1, · · · , L. The latter model is much more complex since one needs to answer a question of how assumptions about trajectories of ζi (t) affect the behavior of f .

Acknowledgments Marianna Pensky was partially supported by National Science Foundation (NSF) grant DMS1407475.

11

7

Proofs

7.1

Proofs of the upper bounds for the risk

b C, b m, b are solutions of optimization problem (3.3), for any m, Proof of Theorem 1. Since (d, b J) J, C and d one has b b k2 + Pen(|Jb|, m) bW cT d ka − C b ≤ ka − CWT d(J) k2 + Pen(|J|, m), (J )

(7.1)

For any m, J, C and d, it follows from (3.7) that

ka − CWT d(J) |2 = k(a − C∗ W∗T d∗ ) + (C∗ W∗T d∗ − CWT d(J) )k2

= kξk2 + kCWT d(J) − C∗ W∗T d∗ k2 + 2ξ T (C∗ W∗T d∗ − CWT d(J) ).

Hence, plugging the last identity into the inequality (7.1), derive that for any m, J, C and d b b − C∗ W∗T d∗ k2 ≤ kCWT d(J) − C∗ W∗T d∗ k2 + 2∆ + Pen(|J|, m) − Pen(|J|, b m). bW cT d b (7.2) kC (J )

Here

bW cT d b b − CWT d(J) )| ≤ ∆1 + ∆2 + ∆3 , ∆ = 2|ξ T (C (J)

∆1 = 2|ξ T (C∗ W

∗T

d∗ − CWT d(J) )|,

b b)C∗ W ∆2 = 2|ξ T (IN L − Π J

(7.3) ∗T

d∗ |,

b b ξ, (7.4) ∆3 = 2ξ T Π J

b b =Π bW cT d b ba where a is given by (3.7). Now, we need since, due to (3.11) and (3.12), one has C J (J) to find upper bounds for each of the terms in (7.4). By Lemma 1 with α = 1/2 and any t > 0, one has n o P ∆1 − 0.5kξ T (C∗ W∗ T d∗ − CWT d(J) )k2 ≤ 4t ≥ 1 − 2e−t . (7.5)

Note that

bW cT d b b − C∗ W∗T d∗ k2 = kΠ b b)C∗ W∗T d∗ k2 . b b)C∗ W∗T d∗ k2 ≥ k(IN L − Π b b ξk2 + k(IN L − Π kC J J J (J )

Therefore, applying Lemma 1 with α = 1/2 and an union bound over m = 1, · · · , n, C ∈ C(m, n, L) and J with |J| = 1, · · · , M L derive that for any t > 0 n o bW cT d b b − C∗ W∗ T d∗ k2 − 4R(m, b, L) ≤ 4t ≥ 1 − 6e−t , P ∆2 − 0.5 kC b J (7.6) (J)

where

 R(m, J, L) = log(|C(m, n, L)|) + |J| log m b 2 Le/|J| + 2 log(m|J|).

(7.7)

b b and again use the Finally, in order to obtain an upper bound for ∆3 , apply Lemma 2 with A = Π J union upper bound over m = 1, · · · , n, C ∈ C(m, n, L) and J with |J| = 1, · · · , M L. Since for any projection matrix ΠJ , one has kΠJ k = 1 and kΠJ k2F = |J|, derive for any t > 0 that   3 3t b b P ∆3 − |J | − R(m, b J , L) ≤ ≥ 1 − e−t , (7.8) 2 2

b=C bW cT d b b and where R(m, J, L) is defined in (7.7). Combining (7.2)–(7.8) and recalling that θ (J)

θ ∗ = C∗ W∗ T d∗ , obtain   b − θ ∗ k2 ≤ P kθ 

min

m,J,q

C∈C(m,n,L)



   3 kCWT d(J) − θ ∗ k2 + 2 Pen(|J|, m) + 19 t ≥ 1 − 9e−t  12

provided 2Pen(|J|, m) ≥ 11R(m, J, L) + 2|J|.

(7.9)

In order to complete the proof of (3.9), apply formula (3.8) and note that log(|C(m, n, L)| ≥ log(|M(m, n)| = n log m > 2 log m for n ≥ 2 and 2 log(|J|) ≤ 2|J|, so that Pen(|J|, m) in (3.4) guarantees (7.9). inequality (3.10) can be proved by noting that for any random variable ζ one has RFinally, ∞ Eζ ≤ 0 P(ζ > z)dz. Proof of Theorem 3. To prove (5.4), we use inequality (3.10) in Theorem 1. In order to find an upper bound for the bias term, we need to cluster the nodes and create an approximate connectivity tensor Q. For this purpose, let h be a positive integer and denote κj = 1 + [(βj − βj−1 )h] where [x] is the largest integer no larger than x. For k = 1, · · · , κj and j = 1, · · · , r, consider a set of intervals (1)

(2)

Uj,k = (Uj,k , Uj,k ] with

(1)

Uj,k = βj−1 + (k − 1)/h,

(2)

Uj,k = min{βj−1 + k/h, βj }.

Intervals Uj,k subdivide every interval (βj−1 , βj ] into κj sub-intervals of length at most 1/h and the total number of intervals is equal to r X κj . m= j=1

Re-number the intervals consecutively as U1 , · · · , Um and observe that since (βj − βj−1 )h ≤ κj ≤ 1 + (βj − βj−1 )h, one has h ≤ m ≤ h + r. (7.10) The value m in (7.10) acts as a number of classes. Indeed, if ζi ∈ Uk , we place node i into class Ωk and set Zi,j = I(j = k). Let Θ∗ be the true tensor of connection probabilities. Set Q = (ZT Z)−1 ZT Θ∗ and V = QHT . Since H is an orthogonal matrix, the bias term in oracle inequality (3.10) is equal to kZV(ρ) − ΘHT k2F = kZV(ρ) − Φk2F = kZV(ρ) − Φ(ρ) k2F + kΦ − Φ(ρ) k2F .

(7.11)

The upper bound for the first term can be found by repeating calculation of Gao et al. (2015) with the only difference that f is replaced by vl and there is an additional sum over l = 1, · · · Lρ . Then, we obtain kZV(ρ) − Φ(ρ) k2F ≤ K1 22ν1 n2 Lρ h−2ν1 . (7.12) On the other hand, if ρ < 1, then by Assumption A, kΦ − Φ(ρ) k2F ≤

n X

i1 ,i2 =1

L X

l=Lρ +1

vl2 (ζi1 , ζi2 ) ≤ K2 n2 L−2ν2 ρ

(7.13)

and kΦ − Φ(ρ) k2F = 0 if ρ = 1. Now, note that for given m and ρ, one has |J| ≤ m2 Lρ , so that   Pen(|J|, m) ≤ C n log m + m2 Lρ log(25 L1−ρ ) . (7.14)

Therefore, (7.10)–(7.14) yield (5.4).

13

Proof of Corollary 2. Note that if ν1 = ∞, one can set h = 2, so that the first term in (5.4) is equal to zero, h + r ≤ 3r and b − Λ∗ k2 EkΛ ≤C n2 L



 I(ρ < 1)  r 2 log L log r + + . L2ρν2 +1 n L1−ρ nL

Minimizing this expression with respect to ρ ∈ [0, 1] obtain the result in (5.5) for r = rn,L . If r = r0 , then  ρ−1  b − Λ∗ k2 L EkΛ I(ρ < 1) h2 log L log h ≤C + 2ρν2 +1 + 2 1−ρ + . (7.15) n2 L h2ν1 L n L nL ∗

Minimizing (7.15) with respect to h and ρ, obtain that the values h∗ and L∗ = Lρ delivering the minimum in (7.15) are such that h∗ ≍ (n2 / log L)1/(2(ν1 +1) . Hence, for some absolute constants C, one has log(h∗ ) ≤ C log n and o n ν1  L∗ = min L, C n2 / log L (ν1 +1)(2ν2 +1) . Therefore, (5.5) holds for r = r0 .

7.2

Proofs of the lower bounds for the risk

Proof of Theorem 2. The error consists of two parts, the clustering error and the nonparametric estimation error. We shall consider those terms separately. The clustering error. Without loss of generality, assume that γm and γn are integers. Assume that connectivity tensor G does not change with l, so G∗,∗,l = V is an m × m symmetric matrix. Let V be block diagonal and such that the diagonal blocks are equal to zero and the non-diagonal blocks are equal to F and FT , respectively, so that Gi,j = 0 if 1 ≤ i, j ≤ (1 − γ)m or (1 − γ)m + 1 ≤ i, j ≤ m and Gi,(1−γ)m+j = Fi,j if i = 1, · · · (1 − γ)m, j = (1 − γ)m + 1, · · · , m. Since components of vectors Gk1 ,k2 ,∗ are constant for any k1 , k2 , due to condition (4.1), the set J has at most γ(1 − γ)m2 < s nonzero elements. Consider a collection of binary vectors ω ∈ {0, 1}(1−γ)m . By Varshamov-Gilbert Lemma (see Tsybakov (2009)), there exists a subset Ξ of those vectors such that for ant ω, ω ′ ∈ Ξ one has kω − ω ′ kH = kω − ω ′ k2 ≥ (1 − γ)m/8 ≥ m/16 and |Ξ| ≥ exp((1 − γ)m/8). Assume, without loss of generality, that m is large enough, so that exp((1 − γ)m/8) ≥ γm, otherwise, choose a smaller value of γ (inequality exp((1 − γ)m/8) ≥ γm is always valid for γ ≤ 1/9). Choose γm vectors ω in Ξ, enumerate them as ω (1) , · · · , ω (γm) and use them as columns of matrix F: F∗,j = 0.5 1 + ρω (j) ,

j = 1, · · · , γm.

(7.16)

Then, for any j, j ′ = 1, · · · , γm, obtain kF∗,j − F∗,j ′ k2 ≥ ρ2 m/16. Note that for every l and k one has       k ke k ≤ log , ≤ l log l log l l l

log(k!) = k log k − k +

14

(7.17)

1 log(2πk) + o(1), 2

(7.18)

where the o(1) term is smaller than 1. Therefore,     mn mne n log m + n0 (L − 1) log ≤ log |Z(m, n, n0 , L)| ≤ n log m + n0 (L − 1) log n0 n0

(7.19)

The term n log m in (7.19) is due to the initial clustering while the term n0 (L − 1) log (mn/n0 ) is due to temporal changes in the clusters’ memberships. In what follows, we shall utilize clustering functions z (l) : [n] → [m] corresponding to clustering matrices C(l) such that z (l) (j) = k iff at the moment tl node j belongs to class Ωk , k = 1, · · · , m. Clustering error due to initial clustering. First consider the case when initial clustering error dominates. If m = 2 or takes a small value, the proof is almost identical to Gao et al. (2015). Hence, we shall skip this part and consider the case when m is large enough, so that γm ≥ 2. Following Gao et al. (2015), we consider clustering matrices and clustering functions independent of l, so that z (l) ≡ z. Consider a sub-collection of clustering matrices F(m, n, γ) ⊂ M(m, n) such that they cluster the first n(1 − γ) nodes into the first m(1 − γ) classes uniformly and sequentially, n/m nodes in each class. The remaining γn nodes are clustered into the remaining γm classes, n/m nodes into each class. Then, by Lemma 4, log |F(m, n, γ)| ≥ γn log(γm)/4. Now, apply Lemma 3 with r = γn/32. Derive that there exists a subset S(m, n, γ) of the set F(m, n, γ) such that, for any C, C′ ∈ S(m, n, γ), one has 2 {#j : z(j) 6= z ′ (j)} = kC − C′ kH ≥ γn/32. Also, by (7.52), γn log(32mγe) γn γn log(γm) − = log(γm). (7.20) log |S(m, n, γ)| ≥ 4 32 16 Let Λ and Λ′ be the tensors of probabilities corresponding to, respectively, clustering matrices ′ C, C ∈ Sn,m with related clustering functions z and z ′ . Then, by (7.17), due to the fact that the first n(1 − γ) nodes are clustered uniformly and sequentially, obtain n(1−γ) ′ 2

kΛ − Λ k = 2L =

2Ln m

X i=1

n X

j=n(1−γ)+1 n X

j=n(1−γ)+1

2Ln (Fz(i),z(j) − Fz ′ (i),z ′ (j) ) = m 2

kF∗,z(j) − F∗,z ′ (j) k2 ≥

m(1−γ)

X k=1

n X

j=n(1−γ)+1

(Fk,z(j) − Fk,z ′ (j) )2

Lnρ2  #j : z(j) 6= z ′ (j) , 16

so that kΛ − Λ′ k2 ≥ 2−9 Ln2 ρ2 γ.

(7.21)

On the other hand, if ρ ≤ 1/4, then, by Proposition 4.2 of Gao et al. (2015), obtain that the Kullback divergence is bounded above K(PΛ , PΛ′ ) ≤ 8 kΛ − Λ′ k2 ≤ 16ρ2 n2 Lγ.

(7.22)

Set ρ2 = Cρ log(γm)/nL and apply Theorem 2.5 of Tsybakov (2009). Observe that, due to (7.20) and (7.22), conditions of Theorem 2.5 are satisfied if Cρ is a small enough absolute constant. Since Ln2 ρ2 γ = Cρ n γ log(γm) and log(γm) ≥ C(γ) log m for some constant C(γ) dependent on γ only, derive ) ( b − Λ∗ k2 Cγ log m kΛ ≥ ≥ 1/4 (7.23) PΛ sup inf n2 L nL c G∈Gm,L,s Λ C∈S(m,n,γ)

15

Clustering error due to changes in the memberships. Now, we consider the case when the clustering error which is due to the temporal changes in memberships dominates the error of initial clustering. Use the same construction for G and F as before. Consider the following QL collection of clustering matrices F = l=1 Fl where Fl is defined as follows. When l is odd, Fl contains only one matrix that clusters nodes uniformly and sequentially, i.e., the first n/m nodes go to class Ω1 , the second n/m nodes go to class Ω2 and the last n/m nodes go to class Ωm . If l is even, Fl = P(m, n, n0 , γ) where P(m, n, n0 , γ) is the set of clustering matrices that corresponds to a perturbation of the uniform consecutive clustering with at most n0 nodes moved to different classes in the manner described below. Let k0 be an integer such that k0 ≤ n0 /(γm) < k0 + 1.

(7.24)

If k0 = 0, then n0 < γm and we choose n0 clusters out of the last γm clusters, choose one element in each of those clusters and then put those n0 elements back in such a manner that every element goes to a different cluster and no elements goes back to its own cluster. If k0 ≥ 1, we choose k0 elements in each of the last γm clusters and then put each of those k0 -tuples back, one tuple per cluster, so that none of the tuple goes back to its own cluster. Then, log |F| = [L/2] log |P(m, n, n0 , γ)| where [L/2] ≥ (L − 1)/2 is the largest integer not exceeding L/2 and (  log γm n0 + n0 log(n/m) + log[(n0 − 1)!], if k0 = 0; log |P(m, n, n0 , γ)| = γm log n/m + log[(γm − 1)!], if k0 ≥ 1 k0

If k0 = 0, so that n0 < γm, then, by (7.18), obtain that log |P(m, n, n0 , γ)| ≥ n0 log(γm/n0 ) + n0 log(n/m) = n0 log(γn/n0 ). If n0 ≥ γm and k0 ≥ 1, then, by (7.18), obtain log |P(m, n, n0 , γ)| ≥ γmk0 log(n/(mk0 )). Since k0 + 1 ≤ 2k0 , obtain that k0 ≥ n0 /(2mγ). Hence, for any k0 ≥ 0 log |P(m, n, n0 , γ)| ≥

n0 log (γn/n0 ) . 2

(7.25)

For every even value of l, apply Lemma 3 with r = n0 /40 obtaining that there exists a subset Sl (m, n, n0 , γ) of the set P(m, n, n0 , γ) such that, for any C(l) , C′(l) ∈ Sl (m, n, n0 , γ), one has kC(l) − C′(l) kH ≥ n0 /40,

l = 2k, k = 1, · · · , [L/2].

(7.26)

By (7.25) and Lemma 3, for every even l, one has       n0 γn 80nemγ 2 nem n0 n0 log |Sl (m, n, n0 , γ)| ≥ log log log − ≥ 2 n0 40 n0 40 n0 if γn/n0 ≥ m1/9 (80e2 )1/18 ≤ 0.75 m1/10 . For odd values of l, let Sl (m, n, n0 , γ) contain just one clustering matrix corresponding to the uniform sequential clustering. Now, consider the set S(m, n, n0 , γ, L) =

L Y l=1

(L − 1)n0 Sl (m, n, n0 , γ) with log |S(m, n, n0 , γ, L)| ≥ log 80



nem n0



. (7.27)

Let C = (C(1) , C(2) , · · · , C(L) ) and C′ = (C′(1) , C′(2) , · · · , C′(L) ) be two sets of clustering matrices with C(l) , C′(l) ∈ Sl (m, n, n0 , γ) and let z = (z1 , · · · , zL ) and z ′ = (z1′ , · · · , zL′ ) be the corresponding clustering functions. Let Λ and Λ′ be the tensors of probabilities corresponding to sets of clustering

16

matrices C, C′ ∈ S(m, n, n0 , γ, L). Then, similarly to the previous case, using (7.17), derive [L/2] n(1−γ) ′ 2

kΛ − Λ k = 2 =

X l=1

X i=1

j=n(1−γ)+1

[L/2] m(1−γ) 2n X X m l=1

[L/2]

2n X = m

n X

k=1

2 ′ (i),z ′ (j) ) (Fz2l (i),z2l (j) − Fz2l 2l

n X

j=n(1−γ)+1

n X

l=1 j=n(1−γ)+1

2 ′ (j) ) (Fk,z2l (j) − Fk,z2l

[L/2] nρ2 X ′ (j) k ≥ kF∗,z2l (j) − F∗,z2l kC(2l) − C′(2l) kH , 8 2

l=1

so that by (7.26), kΛ − Λ′ k2 ≥ (L − 1)nn0 ρ2 /1280.

(7.28)

Again, similarly to the previous case, if ρ ≤ 1/4, then by Proposition 4.2 of Gao et al. (2015), obtain that the Kullback divergence is bounded above K(PΛ , PΛ′ ) ≤ 8 kΛ − Λ′ k2 ≤ 8Lnn0 ρ2 .

(7.29)

Set ρ2 = Cρ n−1 log(nem/n0 ) where Cρ is an absolute constant and apply Theorem 2.5 of Tsybakov (2009). Observe that if Cρ is small enough, then, due to (7.27) and (7.29), conditions of this theorem are satisfied, hence, (  ) b − Λ∗ k2 nem n0 kΛ ≥ Cγ 2 log ≥ 1/4. (7.30) PΛ sup inf n2 L n n0 c G∈Gm,L,s Λ C∈S(m,n,n0 ,γ,L)

The nonparametric estimation error. Consider sequential uniform clustering with n/m nodes in each group and group memberships remaining the same for all l = 1, · · · , L. Let Q ∈ RM ×L be the matrix with columns q(l) , l = 1, · · · , L, defined in Section 2.2. Denote V = QHT ∈ RM ×L and recall that for G ∈ Gm,L,s , by (3.1) and (3.2), matrix V should have at most s nonzero entries. Let k0 = min(s/2, M ). Choose k0 rows among M rows of matrix V and denote this set by X . If k0 = M , set X = {1, · · · , M }. We have already distributed k0 non-zero entries and have s − k0 entries left. We distribute those entries into the k0 rows Vk,∗ where k ∈ X . Let s0 = [(s − k0 )/k0 ] = [s/k0 ] − 1 with

s/2 ≤ k0 s0 < s,

(7.31)

where [s/k0 ] is the largest integer no larger than s/k0 . Consider a set of binary vectors ω ∈ {0, 1}L with exactly s0 ones in each vector. By Lemma 4.10 of Massart (2007), there exists a subset T of those vectors such that for any ω, ω ′ ∈ T , one has kω − ω ′ kH ≥ s0 /2

and

log |T | ≥ 0.233 s0 log(L/s0 ).

Q Denote T˜ = k∈X Tk , where Tk is a copy of the set T corresponding to row k of matrix V. For ω (k) ∈ Tk , set √ Vk,∗ = ( L/2, · · · , 0) + ρm/n ω (k) , if k ∈ X , Vk,∗ = 0 if k ∈ / X. (7.32) It is easy to see that matrix V has at most s nonzero entries as required. 17

Let V and V′ be matrices corresponding to sequences ω (k) and ω ′(k) in Tk , k ∈ X . Let Λ and Λ be the tensors corresponding to V and V′ . Then, due to uniform clustering and (7.31), (7.32) implies that ′

kΛ − Λ′ k2 ≥

 n 2

kΛ − Λ′ k2 ≤ 4

kV − V′ k2F ≥

m  m 2 n

ρ2

 n 2 m

 n 2 X m

k∈X

′ k2 ≥ kVk,∗ − Vk,∗

ρ2 s ρ2 k0 s0 ≥ ; 2 4

k0 s0 ≤ 4ρ2 s.

Set ρ = Cρ log(L/s0 ). It is easy to check that, due to assumptions (4.1) and (4.2), one has Qij ∈ [1/4, 3/4] for any i and j. Hence, by Lemma 4.2 of Gao et al. (2015), obtain K(PΛ , PΛ′ ) ≤ 8 kΛ − Λ′ k2 ≤ 32ρ2 s. If Cρ is small enough, conditions of Theorem 2.5 of Tsybakov (2009) hold. Finally, observe that if s < 2M , then k0 = s/2, s0 = 1 and L/s0 = L = Lm2 /m2 ≥ γLm2 /s. If s ≥ 2M , then k0 = M , s0 ≤ s/M and L/s0 ≥ LM/s ≥ Lm2 /(2s) ≥ γLm2 /s. Since for some constant Cγ , one has log(γLm2 /s) ≥ Cγ log(Lm2 /s), one has (  ) b − Λ∗ k2 Lm2 s kΛ ≥ 1/4. (7.33) ≥ Cγ 2 log PΛ sup inf n2 L n L s c G∈Gm,L,s Λ C∈S(m,n,n0 ,γ,L)

Finally, in order to obtain the lower bound in (4.3) observe that max(a, b, c) ≤ a + b + c ≤ 3 max(a, b, c) for any a, b, c ≥ 0 and combine (7.23), (7.30) and (7.33). Proof of Theorem 4. We consider cases when r = rn,L and r = r0 corresponding to piecewise constant and piecewise smooth graphon separately. Piecewise constant graphon. Assume, without loss of generality, that h = n/r is an integer. Consider a set up where the nodes are grouped into r classes and values of ζj ’s are fixed: ζkh+i = βk + i(βk+1 − βk )/h,

k = 0, · · · , r − 1, i = 1, · · · , h.

Then, there are h nodes in each class. Let Gk,k,l = 0 for any k = 1, · · · , r and l = 1, · · · , L. Consider an even L0 such that 1 ≤ L0 ≤ L/2 and a set of vectors ω ∈ {0, 1}L0 with exactly L1 = L0 /2 nonzero entries. By Lemma 4.10 of Massart (2007), there exists a subset T of those vectors such that for any ω, ω ′ ∈ T one has kω − ω ′ kH ≥ L0 /4,

log |T | ≥ 0.233(L0 /2) log 2 ≥ 0.08L0 .

(7.34)

Denote K = {(k1 , k2 ) : 1 ≤ k1 < k2 ≤ r} and let Tk1 ,k2 be the copies of T for (k1 , k2 ) ∈ K. Denote Q ˜) ˜ ∈ T˜ such that, for βk1 −1 < x ≤ βk1 T˜ = (k1 ,k2 )∈K Tk1 ,k2 and consider a set of functions f (ω with ω and βk2 −1 < y ≤ βk2 one has √ ˜) (ω v1 (x, y) = L/2;

˜) (ω

vl

(k ,k )

1 2 , l = L0 + 1, · · · , 2L0 , (k1 , k2 ) ∈ K, (x, y) = ρω l−L 0

(k ,k2 )

˜ is a binary matrix with elements ω l 1 where ω (5.1) holds. In order (5.2) is satisfied, we set −(2ν2 +1)

ρ2 ≤ C1 L0

,

,l = 1, · · · , L0 and (k1 , k2 ) ∈ K. Then, Assumption

C1 ≤ min(K2 21−2ν2 , 1/8). 18

(7.35)

˜ and ω˜′ in T˜ . Then, Denote by Λ and Λ′ the probability tensors corresponding, respectively, to ω due to (7.34) and symmetry, kΛ − Λ′ k2 ≥ 2ρ2

r X

r X

k1 =1 k2 =k1 +1

kω (k1 ,k2 ) − ω ′(k1 ,k2 ) kH

kΛ − Λ′ k2 ≤ ρ2 r(r − 1)(n/r)2 L0 ≤ ρ2 n2 L0 .

 n 2 r



ρ2 n2 L0 ; 8

Note that one has 1/4 ≤ f (x, y, t) ≤ 3/4 provided kHT ω (k1 ,k2) k∞ ≤ 1/4 for for (k1 , k2 ) ∈ K. By Assumption (4.1), the latter is guaranteed by ρ2 L20 /L ≤ 1/4, so that, due to L ≥ 2L0 and ν2 ≥ 1/2, it is ensured by (7.35). Then, by Proposition 4.2 of Gao et al. (2015), one has K(PΛ , PΛ′ ) ≤ 8 kΛ − Λ′ k2 ≤ 8ρ2 n2 L0 ≤ log |T˜ |/8 provided ρ2 ≤ C2 (r/n)2

(7.36)

where C2 is an absolute constant. Therefore, application of Theorem 2.5 of Tsybakov (2009) yields (5.6) with ∆(n, L) = C ρ2 L0 /L where C is an absolute constant. Now, we set C32 = (C2 /C1 )2−(2ν2 +1) and consider two cases. If n ≤ C3 rLν2 +1/2 , choose L0 −(2ν +1) such that ρ2 = C1 L0 2 = C2 (r/n)2 which holds when L0 = [(C1 /C2 )(n/r)2 ]1/(2ν2 +1) . It is easy to check that L0 ≥ 2 and that L0 ≤ L/2, so that  2ν2  C  r 2 2ν2 +1 . ∆(n, L) = L n If n > C3 rLν2+1/2 , choose ρ2 = C2 (r/n)2 and set L0 = L/2. Then, (7.35) holds and ∆(n, L) = C(r/n)2 which completes the proof of the lower bound when r = rn,L . Piecewise smooth graphon. Since r = r0 is a fixed quantity, without loss of generality, we set r = 1. Let ζj = j/n, j = 1, · · · , n, be fixed. Let h be a positive integer, 1 ≤ h ≤ n, and denote δ = 1/h. Consider a kernel function F (x) such that F (x) is n1 > ν1 times continuously differentiable and for any x, x′ ∈ R and some CF > 0 supp(F ) = (−1/2; 1/2),

|F (x) − F (x′ )| ≤ CF |x − x′ |ν1 .

(7.37)

It follows from (7.37) that |F (x)| ≤ CF for any x. Denote Ψk1 ,k2 (x, y) = h−ν1 F (h(x − uk1 ))F (h(y − uk2 ))

where uk = (k − 1)δ + δ/2, k = 1, · · · , h. (7.38)

It is easy to see that Ψk1 ,k2 (x, y) = Ψk2 ,k1 (y, x) and, for different pairs (k1 , k2 ), functions Ψk1 ,k2 (x, y) have disjoint supports. Similar to the case of the piecewise constant graphon, consider an even L0 ≤ L/2 and a set of vectors ω ∈ {0, 1}L0 with exactly L0 /2 nonzero entries. By Lemma 4.10 of ′ Massart (2007), there exists a subset T of those vectors such that (7.34) Q holds for any ω, ω ∈ T . ˜ Let again Tk1 ,k2 be the copies of T for (k1 , k2 ) ∈ K and denote T = (k1 ,k2 )∈K Tk1 ,k2 where K = ˜ ∈ T˜ and l = L0 + 1, · · · , 2L0 , {(k1 , k2 ) : 1 ≤ k1 < k2 ≤ h}. Then, log |T˜ | ≥ 0.04 L0 h2 . For any ω define v1 (x, y) =



L/2;

vl (x, y) = ρ

h X

h X

k1 =1 k2 =k1 +1

19

(k ,k )

1 2 ω l−L [Ψk1 ,k2 (x, y) + Ψk2 ,k1 (x, y)]. 0

(7.39)

It is easy to see that v(x, y) = v(y, x) for any x, y ∈ [0, 1]. Now we need to check that conditions (5.1) and (5.2) hold. Note that for any x, y, x′ , y ′ , due to (7.37), obtain  |Ψk1 ,k2 (x, y) − Ψk1 ,k2 (x′ , y ′ )| ≤ h−ν1 |F (h(x − uk1 ) − F (h(x′ − uk1 )||F (h(y − uk2 )|  ν  +|F (h(y − uk2 ) − F (h(y ′ − uk2 )||F (h(x′ − uk1 )| ≤ Cψ |x − x′ | + |y − y ′ | 1 ,

where constant Cψ depends only on CF and ν1 . Since functions Ψk1 ,k2 (x, y) have disjoint supports for different pairs (k1 , k2 ), the latter implies that (5.1) holds if ρ ≤ K1 /(4Cψ ). Also, it is easy to check that, by (7.37), one has [vl (x, y)]2 ≤ Cv ρ2 h−2ν1 where Cv depends only on CF and ν1 . Therefore, 2L0 X l2ν2 vl2 (x, y) ≤ Cv2 ρ2 h−2ν1 L02ν2 +1 . l=L0 +1

Hence, both assumptions, (5.1) and (5.2) are valid provided   −(ν +1/2) ρ ≤ min K1 /(4Cψ ), K2 /Cv hν1 L0 2 .

(7.40)

Now, note that, by (4.1) and (7.39), since Ψk1 ,k2 (x, y) have disjoint supports, we derive that f (x, y, t) ∈ [1/4; 3/4] provided √ ρ ≤ L/(4Cv2 L0 ). (7.41) Then, by Proposition 4.2 of Gao et al. (2015), one has 2 −2ν1

K(PΛ , PΛ′ ) ≤ 4ρ h

L0

"

n X h X i=1 k=1

2

#2

F (h(i/n − uk ))

.

Here, n X h X i=1 k=1

F 2 (h(i/n − uk )) =

h X k=1

≈ n

X

uk −δ/2