Minimax Capacity Loss under Sub-Nyquist Universal Sampling

Report 2 Downloads 18 Views
Minimax Capacity Loss under Sub-Nyquist Universal Sampling Yuxin Chen, Andrea J. Goldsmith, and Yonina C. Eldar∗

arXiv:1304.7751v3 [cs.IT] 19 Aug 2014

August 20, 2014

Abstract This paper considers the capacity of sub-sampled analog channels when the sampler is designed to operate independent of instantaneous channel realizations. A compound multiband Gaussian channel with unknown subband occupancy is considered, with perfect channel state information available at both the receiver and the transmitter. We restrict our attention to a general class of periodic sub-Nyquist samplers, which subsumes as special cases sampling with periodic modulation and filter banks. We evaluate the loss due to channel-independent (universal) sub-Nyquist design through a sampled capacity loss metric, that is, the gap between the undersampled channel capacity and the Nyquist-rate capacity. We investigate sampling methods that minimize the worst-case (minimax) capacity loss over all channel states. A fundamental lower bound on the minimax capacity loss is first developed, which depends only on the band sparsity ratio and the undersampling factor, modulo a residual term that vanishes at high signal-to-noise ratio. We then quantify the capacity loss under Landau-rate sampling with periodic modulation and low-pass filters, when the Fourier coefficients of the modulation waveforms are randomly generated and independent (resp. i.i.d. Gaussian-distributed), termed independent random sampling (resp. Gaussian sampling). Our results indicate that with exponentially high probability, independent random sampling and Gaussian sampling achieve minimax sampled capacity loss in the Landau-rate and super-Landau-rate regime, respectively. While identifying a deterministic minimax sampling scheme is in general intractable, our results highlight the power of randomized sampling methods, which are optimal in a universal design sense. Along the way, we derive concentration of several log-determinant functions, which might be of independent interest.

Index Terms: Channel capacity, sub-Nyquist sampling, universal sampling, non-asymptotic random matrix, minimaxity, log-determinant, concentration of measure, universality phenomenon, compressed sensing, discrete-time sparse vector channel

1

Introduction

The maximum rate of information that can be conveyed through an analog communication channel largely depends on the sampling technique and rate employed at the receiver end. In wideband communication systems, hardware and cost limitations often preclude sampling at or above the Nyquist rate, which presents a major bottleneck in transferring wideband and energy-efficient receiver design paradigms from theory to practice. Understanding the effects upon capacity of sub-Nyquist sampling is thus crucial in circumventing this bottleneck. In practice, receiver hardware and, in particular, sampling mechanisms are typically static and hence designed based on a family of possible channel realizations. During operation, the actual channel realization will vary over this class of channels. Since the sampler is typically integrated into the hardware and difficult to change during system operation, it needs to be designed independent of instantaneous channel state ∗ Y. Chen is with the Department of Electrical Engineering and the Department of Statistics, Stanford University, Stanford, CA 94305, USA (email: [email protected]). A. J. Goldsmith is with the Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA (email: [email protected]). Y. C. Eldar is with the Department of Electrical Engineering, Technion, Israel Institute of Technology Haifa, Israel 32000 (email: [email protected]). The contact author is Y. Chen. This work was supported in part by the NSF under grant CCF-0939370 and CIS-1320628, the AFOSR under MURI Grant FA9550-09-1-0643, and BSF Transformative Science Grant 2010505. It has been presented in part at the IEEE International Symposium on Information Theory (ISIT) 2013. Manuscript date: August 20, 2014.

1

information (CSI). This has no effect if the sampling rate employed is commensurate with the maximum bandwidth (or the Nyquist rate) of the channel family. However, at the sub-Nyquist sampling rate regime, the sampler design significantly impacts the information rate achievable over different channel realizations. As was shown in [1], the capacity-maximizing sub-Nyquist sampling mechanism for a given linear timeinvariant (LTI) channel depends on specific channel realizations. In time-varying channels, sampled capacity loss relative to the Nyquist-rate capacity is necessarily incurred due to channel-independent (universal) subNyquist sampler design. Moreover, it turns out that the capacity-optimizing sampler for a given channel structure might result in very low data rate for other channel realizations. In this paper, our goal is to explore universal design of a sub-Nyquist sampling system that is robust against the uncertainty and variation of instantaneous channel realizations, based on sampled capacity loss as a metric. In particular, we investigate the fundamental lower limit of sampled capacity loss in some overall sense (as will be detailed as minimax capacity loss in Section 2.3), and design a sub-Nyquist sampling system for which the capacity loss can be uniformly controlled and optimized over all possible channel realizations.

1.1

Related Work

In various scenarios, sampling at or above the Nyquist rate is not necessary for preserving signal information if certain signal structures are appropriately exploited [2,3]. Take multiband signals for example, that reside within several subbands over a wide spectrum. If the spectral support is known, then the sampling rate necessary for perfect signal reconstruction is the spectral occupancy, termed the Landau rate [4]. Such signals admit perfect recovery when sampled at rates approaching the Landau rate, assuming appropriately chosen sampling sets (e.g. [5, 6]). Inspired by recent “compressive sensing” [7–9] ideas, spectrum-blind subNyquist samplers have also been developed for multiband signals [10], pulse streams [11, 12], etc. These sampling-theoretic works, however, were not based on capacity as a metric in the sampler design. On the other hand, the Shannon-Nyquist sampling theorem has frequently been used to investigate analog waveform channels (e.g. [13–17]). One key paradigm to determine or bound the channel capacity is converting the continuous-time channel into a set of parallel discrete-time channels, under the premise that sampling, when it is performed at or above the Nyquist rate, preserves information. In addition, the effects upon capacity of oversampling have been investigated in the presence of quantization [18,19]. However, none of these works considered the effect of reduced-rate sampling upon capacity. Another recent line of work [20] investigated the tradeoff between sparse coding and subsampling in AWGN channels, but did not consider capacity-achieving input distributions. Our recent work [1,21] established a new framework for investigating the capacity of linear time invariant (LTI) Gaussian channels under a broad class of sub-Nyquist sampling strategies, including filter-bank and modulation-bank sampling [10, 22] and, more generally, time-preserving sampling. We showed that periodic sampling or, more simply, sampling with a filter bank, is sufficient to approach maximum capacity among all sampling structures under a given sampling rate constraint, assuming that perfect CSI is available at both the receiver and the transmitter. Practical communication systems often involve time-varying channels, e.g. wireless slow fading channels [16,23]. Many of these channels can be modeled as a channel with state (see a detailed survey in [23, Chapter 7]), where the channel variation is captured by a state that may be fixed over a long transmission block, or, more simply, a compound channel [24] whereby the channel realization lies within a collection of possible channels [25]. One class of compound channel models concerns multiband Gaussian channels, whereby the instantaneous frequency support active for transmission resides within several continuous intervals, spread over a wide spectrum. This model naturally arises in several wideband communication systems, including time division multiple access systems and cognitive radio networks, as will be discussed in Section 2.1. However, to the best of our knowledge, no prior work has investigated, from a capacity perspective, a universal (channel-independent) sub-Nyquist sampling paradigm that is robust to channel variations in the above channel models. Finally, we note that the design of optimal sampling / sensing matrices have recently been investigated in discrete-time settings from a information theoretical perspective. In particular, Donoho et. al. [26] assert that: random and band-diagonal sampling systems admit perfect signal recovery from an information theoretically minimal number of samples. However, the optimality was not defined based on channel capacity as a metric, but instead based on a fundamental rate-distortion limit [27].

2

1.2

Main Contributions

In this paper, we consider a compound multiband channel, whereby the channel bandwidth W is divided into n continuous subbands and, at each timeframe, only k subbands are active for transmission. We consider the class of periodic sampling systems (i.e. a system that consists of a periodic preprocessor and a recurrent sampling set, detailed in Section 2.2) with period n/W and sampling rate fs = mW/n for some integer m (m ≤ n). Under this model, we define β := k/n as the band sparsity ratio, and α := m/n as the undersampling factor. The sampling mechanism is termed a Landau-rate sampling (resp. super-Landau-rate sampling) system if fs is equal to (resp. greater than) the spectral size of the instantaneous channel support. Our contributions are as follows. 1. We derive, in Theorem 4, a fundamental lower bound on the largest sampled capacity loss (defined in Section 2) incurred by any channel-independent sampler, under both Landau-rate and super-Landaurate sampling. This lower bound depends only on the band sparsity ratio and the undersampling factor, modulo a small residual term that vanishes when SNR and n increase. The bound is derived by observing that at each frequency within [0, W/n], the exponential sum of the capacity loss over all states s is independent of the sampling system, except for a relatively small residual term that vanishes with SNR. 2. Theorem 5 characterizes the sampled capacity loss under a class of periodic sampling with periodic modulation (of period n/W ) and low-pass filters with passband [0, W/n], when the Fourier coefficients of the modulation waveforms are randomly generated and independent (termed independent random sampling). We demonstrate that with exponentially high probability, the sampled capacity loss matches the fundamental lower bound of Theorem 4 uniformly across all channel realizations. This implies that independent sampling achieves the minimum worst-case (or minimax ) capacity loss among all periodic sampling methods with period n/W . To be more concrete, an independent random sampling system achieves minimax sampled capacity loss as long as the Fourier coefficients of the modulation waveforms are independently sub-Gaussian [28] distributed with matching moments up to the second order. This universality phenomenon occurs due to sharp concentration of spectral measures of large random matrices [29]. 3. For a large portion of the super-Landau-rate regime, we quantify the sampled capacity loss under independent random sampling when the Fourier coefficients of the modulation waveforms are i.i.d. Gaussian-distributed (termed Gaussian random sampling), as stated in Theorem 6. With exponentially high probability, Gaussian random sampling achieves minimax capacity loss among all periodic sampling with period n/W . 4. Similar results for a discrete-time sparse vector channel with unknown channel support can be delivered as an immediate consequence of our analysis framework. When the number of measurements is equal to the channel support size, independent random sensing matrices are minimax in terms of channel-blind sampler design. When the sample size exceeds the channel support size, Gaussian sensing matrices achieve minimax sampled capacity loss. 5. In order to establish the optimality of random sampling, we derive sharp concentration of several logdeterminant functions for i.i.d. random matrix ensembles, which might be of independent interest for other works involving log-determinant metrics.

1.3

Organization

The remainder of this paper is organized as follows. In Section 2, we introduce our system model of compound multiband channels. A metric called sampled capacity loss, and a minimax sampler are defined with respect to sampled channel capacity. We then determine in Section 3 the minimax capacity loss that is achievable within the class of periodic sampling systems. Specifically, we develop a lower bound on the minimax capacity loss in Section 3.1. Its achievability under Landau-rate and super-Landau-rate sampling are treated in Section 3.2 and Section 3.3, respectively. Along the way, we derive concentration of several log-determinant functions in Section 4.4. Section 5.1 summarizes the key observation and implications from our results. We present

3

Table 1: Summary of Notation and Parameters binary entropy function, i.e. H(x) = −x log x − (1 − x) log(1 − x) impulse response, and frequency response of the LTI analog channel power spectral density of the noise η(t) aggregate sampling rate, and the corresponding sampling interval (Ts = 1/fs ) W , W0 channel bandwidth, size of instantaneous channel support n, m, k number of subbands, number of sampling branches, number of subbands being simultaneously active α = fs /W = m/n, β = k/n undersampling factor, sparsity ratio −1 sampling matrix, whitened sampling matrix Q, Qw = (QQ∗ ) 2 Q LQ capacity loss associated with a sampling matrix Q given state s s Ai∗ , A∗i ith row of A, ith column of A card (A) cardinality of a set A [n] [n] := {1, 2, · · · , n}  [n] A set of all k-element subsets of A, set of all k-element subsets of [n] k , k Wp (n, Σ) p × p-dimensional central Wishart distribution with n degrees of freedom and covariance matrix Σ Z, R set of integers, set of real numbers H(x) h(t),H(f ) Sη (f ) fs , Ts

in Section 6 extensions to discrete-time sparse vector channels, and discuss connections with compressed sensing literature. Section 7 closes the paper with a short summary of our findings and potential future directions.

1.4

Notation

Q We define the following two functions: log (x) := log (max (, x)), and det X := i max (λi (X) P,n). Denote by H(β) := −β log β−(1−β) log(1−β) the binary entropy function, and H ({x1 , · · · , xn }) := − i=1 xi log xi the more general entropy function. The standard notation f (n) = O (g(n)) means there exists a constant c (not necessarily positive) such that |f (n)| ≤ cg(n), f (n) = Θ (g(n)) means there exist constants c1 and c2 such that c1 g(n) ≤ f (n) ≤ c2 g(n), f (n) = ω (g(n)) means that limn→∞ fg(n) (n) = 0, and f (n) = o (g(n)) indicates that limn→∞

f (n) g(n)

= 0. For a matrix A, we use Ai∗ and A∗i to denote the ith row and ith column  of A, respectively. We let [n] denote the set {1, 2, · · · , n}. For any set A ⊆ [n], we denote by A the set k  of all k-combinations of A. In particular, we write [n] for the set of all k-element subsets of {1, 2, · · · , n}. k We also use card (A) to denote the cardinality of a set A. Let W be a p × p random matrix that can be expressed as W = Σni=1 Z i Z > i , where Z i ∼ N (0, Σ) are jointly independent Gaussian vectors. Then W is said to have a central Wishart distribution with n degrees of freedom and covariance matrix Σ, denoted as W ∼ Wp (n, Σ). Our notation is summarized in Table 1.

2 2.1

Problem Formulation and Preliminaries Compound Multiband Channel

We consider a compound multiband Gaussian channel. The channel has a total bandwidth W , and is divided into n continuous subbands each of bandwidth W/n. A state s ∈ [n] is generated, which dictates k the channel support and realization1 . Specifically,´ given a state s, the channel is an LTI filter with impulse ∞ response hs (t) and frequency response Hs (f ) = −∞ hs (t) exp(−j2πf t)dt. It is assumed throughout that 1 Note that in practice, n is typically a large number. For instance, the number of subcarriers ranges from 128 to 2048 in LTE [30, 31].

4

there exists a general function H(f, s) such that for every f and s, Hs (f ) can be expressed as ( 1, if f lies within subbands at indices from s, Hs (f ) = H(f, s)1s (f ), where 1s (f ) = 0, else. A transmit signal x(t) with a power constraint P is passed through this multiband channel, which yields a channel output rs (t) = hs (t) ∗ x(t) + η(t), (1) where η(t) is stationary zero-mean Gaussian noise with power spectral density Sη (f ). We assume that perfect CSI is available at both the transmitter and the receiver. The above model subsumes as special cases the following communication scenarios. • Time Division Multiple Access Model. In this setting the channel is shared by a set of different users. At each timeframe, one of the users is selected for transmission. The receiver (e.g. the base station) allocates a subset of subbands to the designated sender over that timeframe. • White-Space Cognitive Radio Network. In a white-space cognitive radio network, cognitive users exploit spectrum holes unoccupied by primary users and utilize them for communications. Since the locations of the spectrum holes change over time, the spectral subbands available to cognitive users is varying over time.

2.2

Sampled Channel Capacity

We aim to design a sampler that works at rates below the Nyquist rate (i.e. the channel bandwidth W ). In particular, we consider the class of periodic sampling systems, which subsume the most widely used sampling mechanisms in practice. 2.2.1

Periodic Sampling

The class of periodic sampling systems is defined in [1, Section IV], which we restate as follows. Definition 1 (Periodic Sampling). Consider a sampling system consisting of a preprocessor with an impulse response q(t, τ ) followed by a sampling set Λ = {tk | k ∈ Z}. A linear sampling system is said to be periodic with period Tq and sampling rate fs (fs Tq ∈ Z) if for every t, τ ∈ R and every k ∈ Z, we have q(t, τ ) = q(t + Tq , τ + Tq );

tk+fs Tq = tk + Tq .

(2)

Consider a periodic sampling system P with period Tq = n/W and sampling rate fs := mW/n for some integer m. A special case consists of sampling with a combination of filter banks and periodic modulation with period n/W , as illustrated in Fig. 1(a). Specifically, the sampling system comprises m branches, where at each branch, the channel output is passed through a pre-modulation filter, modulated by a periodic waveform of period Tq , and then filtered with a post-modulation filter followed by uniform sampling at rate fs /m. The channel capacity in a sub-sampled LTI Gaussian channel has been derived in [1, Theorem 5]. As will be shown later, in the high SNR regime, employing water-filling power allocation harvests only marginal capacity gain relative to equal power allocation, as will be shown later. For this reason and for mathematical convenience, we only restate below the sampled channel capacity under uniform power allocation, which suffices for us to bound the fundamental minimax limit as well as the convergence rate with SNR. Specifically, if we denote by si and fi the ith smallest element in s and the lowest frequency of the ith subband, respectively, and define H s (f ) as a k × k diagonal matrix obeying |Hs (fsi + f )| , (H s (f ))ii = p Sη (fsi + f ) then the sampled channel capacity, when specialized to our setting, is given as follows.

5

q1 (t)



F1 ( f )

η (t ) x (t )

h (t )

S1 ( f )

qi (t)



Fi ( f )

r(t)



y1 (t)

y1[l]

η (t )

t = l(mTs )

Si ( f )

qm (t) Fm ( f )

q1 (t)

t = l(mTs )

yi (t)

x (t ) yi [l]

r(t)

t = l(mTs )

Sm ( f )

ym (t)

h (t )

ym [l]

(a)



LPF

y1 (t)

y1[l]

qi (t)



LPF

yi (t)

yi [l]

qm (t)



LPF

ym (t)

ym [l]

(b)

Figure 1: (a) Sampling with modulation and filter banks. The channel output r(t) is passed through m branches, each consisting of a pre-modulation filter, a periodic modulator and a post-modulation filter followed by a uniform sampler with sampling rate fs /m. (b) Sampling with a modulation bank and lowpass filters. The channel output r(t) is passed through m branches, each consisting of a modulator with modulation waveform qi (t) and a low-pass filter of pass band [0, fs /m] followed by a uniform sampler with sampling rate fs /m. Theorem 1 (Sampled Capacity with Equal Power Allocation [1]). Consider a channel with total bandwidth W and instantaneous band sparsity ratio β := nk . Assume perfect CSI at both the transmitter and the receiver, and equal power allocation employed over active subbands. If a periodic sampler P with period n/W and sampling rate fs = m n W is employed, then the sampled channel capacity at a given state s is given by   ˆ W 1 n P w∗ w 2 Q (3) Cs = log det I m + Q (f )H s (f )Q (f ) df, 2 0 βW −1/2

where Qw (f ) := (Q(f )Q∗ (f )) Q(f ). Here, Q(f ) is an m × n matrix that only depends on P, and Qs (f ) denotes the submatrix consisting of the columns of Q(f ) at indices of s. In general, Q(f ) is a function that varies with f . Unless otherwise specified, we call Q(·) the sampling coefficient function and Qw (·) the whitened sampling coefficient function with respect to the sampling system P. Note that Qw (f )Qw∗ (f ) = I. 2.2.2

Flat Sampling Coefficient Function

A special class of periodic sampling concerns the ones whose Q(·) are flat over [0, fs /m], in which case we can use an m × n matrix Q to represent the sampling coefficient function, termed a sampling coefficient matrix. This class of sampling systems can be realized through the m-branch sampling system illustrated in Fig. 1(b). In the ith branch, the channel output is modulated by a periodic waveform qi (t) of period n/W , passed through a low-pass filter with pass band [0, Pfns /m], and then uniformly sampled at rate fs /m, where the Fourier transform of qi (t) obeys F (qi (t)) = l=1 Qi,l δ (f − lW/n). In this paper, a sampling system within this class is said to be (independent) random sampling if the entries of Q are randomly generated (and are independent). In addition, a sampling system is termed Gaussian sampling if the entries of Q are i.i.d. Gaussian distributed. It turns out that this simple class of sampling structures is sufficient to achieve overall robustness in terms of sampled capacity loss, provided that the entries of Q are sub-Gaussian with zero mean and unit variance, as will be detailed in Section 3.

2.3

Universal Sampling

As was shown in [1], the optimal sampling mechanism for a given LTI channel with perfect CSI extracts out a frequency set with the highest SNR and hence suppresses aliasing. Such alias-suppressing sampler may achieve very low capacity for some channel realizations. In this paper, we desire a sampler that operates 6

independent of the instantaneous CSI, and our objective is to design a single linear sampling system that achieves to within a minimal gap of the Nyquist-rate capacity across all possible channel realizations. The availability of CSI to the transmitter, the receiver and the sampler is illustrated in Fig. 2.

Figure 2: At each timeframe, a state is generated from a finite set S, which dictates the channel realization Hs (f ). Both the transmitter and the receiver have perfect CSI, while the sampler operates independently of s.

2.3.1

Sampled Capacity Loss

Universal sub-Nyquist samplers suffer from information rate loss relative to Nyquist-rate capacity. In this subsection, we make formal definitions of this metric. For any state s, when equal power allocation is performed over active subbands, the Nyquist-rate capacity can be written as   ˆ W/n 1 P P Cs eq = log det I k + H 2s (f ) df, (4) 2 βW 0 which is a special case of (3). In contrast, if power control at the transmitter side is allowed, then the Nyquist-rate capacity is given by ˆ W/n X k   1 2 Csopt = (5) log+ ν (H s (f ))ii df, 2 i=1 0 where ν is determined by the equation ˆ

W/n

k X

0

i=1

P =

ν−

!+

1 2

(H s (f ))ii

df.

(6)

We can then make formal definitions of sampled capacity loss as follows. Definition 2 (Sampled Capacity Loss). For any sampling coefficient function Q(·) and any given state s, we define the sampled capacity loss without power control as P

eq LQ − CsQ , s := Cs

and define the sampled capacity loss with optimal power control as LQ,opt := Csopt − CsQ . s These metrics quantify the capacity gaps relative to Nyquist-rate capacity due to universal (channelindependent) sub-Nyquist sampling design. When sampling is performed at or above the Landau rate (which is equal to kW/n in our case) but below the Nyquist rate, these gaps capture the rate loss due to channel-independent sampling relative to channel-optimized design, either with or without power control. M ,opt For notational convenience, for an m × n matrix M , we denote by LM the capacity loss with s and Ls respect to a sampling coefficient function Q(f ) ≡ M , which is flat across [0, W/n]. 7

Figure 3: Minimax sampler v.s. the sampler that maximizes worst-case capacity, when sampling is channelindependent and performed below the Nyquist rate. The blue solid line represents the Nyquist-rate (analog) capacity, the black dotted line represents the capacity achieved by minimax sampler, the orange dashed illustrates the Nyquist-rate capacity minus the minimax capacity loss, while the purple dashed line corresponds to maximum worst-case capacity. 2.3.2

Minimax sampler

Frequently used in the theory of statistics (e.g. [32]), minimaxity is a metric that seeks to minimize the loss function in some overall sense, defined as follows. Definition 3 (Minimax Sampler). A sampling system associated with a sampling coefficient function Qm , which minimizes the worst-case capacity loss, that is, which satisfies m

max LQ = inf max LQ s s , Q(·) s∈([n]) s∈([n] k ) k  is called a minimax sampler with respect to the state alphabet [n] k . The minimax criteria is of interest for designing a sampler robust to all possible channel states, that is, achieving to within a minimal gap relative to maximum capacity for all channel realizations. It aims to control the rate loss across all states in a uniform manner, as illustrated in Fig. 3. Note that the minimax sampler is in general different from the one that maximizes the lowest capacity among all states (worst-case capacity). While the latter guarantees an optimal worst-case capacity that can be achieved regardless of which channel is realized, it may result in significant capacity loss in many states with large Nyquist-rate capacity, as illustrated in Fig. 3. In contrast, a desired minimax sampler controls the capacity loss for every single state s, and allows for robustness over all channel states with universal channel-independent sampling. It turns out that in the compound multiband channels, ∀s,

m

m

LQ = max LQ s ˜ s ˜∈([n] s k )

except for some vanishingly small residual terms, as will be shown in the next section.

3

Minimax Sampled Capacity Loss

The minimax sampled capacity loss problem can be cast as minimizing maxs∈S LQ s over all sampling coefficient functions Q(f ). In general, the problem is non-convex in Q(f ), and hence it is difficult to identify the optimal sampling systems. Nevertheless, the minimax capacity loss can be quantified reasonably well at moderate-to-high SNR. It turns out that under both Landau-rate sampling and a large regime of superLandau-rate sampling, the minimax capacity loss can be well approached by random sampling. Define the undersampling factor α := m/n, and recall that the band sparsity ratio is β := k/n. Our main results are summarized in the following theorem. 8

Theorem 2. Consider any sampling coefficient function Q(·) with an undersampling factor α, and let the sparsity ratio be β. Define 2

SNRmin :=

P βW

|H(f, s)| , Sη (f ) 0≤f ≤W,s∈([n] k )

SNRmax :=

P βW

|H(f, s)| , Sη (f ) [n] 0≤f ≤W,s∈( )

inf

(7)

2

sup

(8)

k

and suppose that SNRmax ≥ 1. (i) (Landau-rate sampling) If α = β (or k = m), then   √   SNRmax log n + log SNRmax W Q √ H (β) + O min , + ∆L , inf max Ls = Q s∈([n]) 2 n n1/4 k where −√

(9)

2 β ≤ ∆L ≤ . SNRmin SNRmin

(ii) (Super-Landau-rate sampling) Suppose that there is a small constant δ > 0 such that α − β ≥ δ and 1 − α − β ≥ δ. Then       W β log n inf max LQ = H (β) − αH + O + ∆ , (10) SL s Q s∈([n]) 2 α n1/3 k where −√

β 2 ≤ ∆SL ≤ . SNRmin SNRmin

Remark 1. Note that H(·) denotes the binary entropy function. Its appearance is due to the fact that it is a tight estimate of the rate function of binomial coefficients. Theorem 2 provides a tight characterization of the minimax sampled capacity loss relative to the Nyquistrate capacity, under both Landau-rate sampling and super-Landau-rate sampling. Note that the Landau-rate sampling regime in (i) is not a special case of the super-Landau-rate regime considered in (ii). For instance, if β > 1/2, then α + β > 1, which falls within a regime not accounted for by Theorem 2(ii). The capacity loss expressions (9) and (10) contain residual terms that vanish for large n and high SNR. In the regime where SNRnmax  1 and SNRmin  1, these fundamental minimax limits are approximately equal to a constant modulo a vanishing residual term. Since the Nyquist rate capacity scales as Θ (W log SNR), our results indicate that the ratio of the minimax capacity loss to the Nyquist-rate capacity vanishes at a rate Θ (1/ log SNR). Note that even if we allow power control at the transmitter side, the results are still valid at high SNR. This is summarized in the following theorem. Theorem 3. Consider the metric LsQ,opt with power control. Under all other conditions of Theorem 2, the Q,opt bounds (9) and (10) (with LQ ) continue to hold if ∆L and ∆SL are respectively replaced s replaced by Ls opt opt by some residuals ∆L and ∆SL that satisfy −√

2 β+A opt ≤ ∆opt , L , ∆SL ≤ SNRmin SNRmin

where A is a constant defined as  ´ 2 2   maxs∈ [n] 0W |H(f,s)|  sup0≤f ≤W,s∈([n]) |H(f,s)| Sη (f ) df S (f ) (k) η k A := min . 2 , 2 |H(f,s)|  βW inf  inf 0≤f ≤W,s∈([n]) |H(f,s)| Sη (f ) 0≤f ≤W,s∈([n]) Sη (f ) k

k

9

SNR is bounded by a constant (where Theorem 3 demonstrates that if the average-to-minimum ratio SNR min ´ 2 W |H(f,s)| P SNR := maxs∈([n]) βW 0 Sη (f ) df ), then the minimax sampled capacity gap with power control remains k almost the same as that with power control within a vanishingly small gap per unit bandwidth at high SNR and large n. Note that the constant A given in Theorem 3 is fairly conservative, and can be refined with finer tuning or algebraic techniques. Theorem 3 can be delivered as an immediate consequence of Theorem 2 if we can quantify the gap P between Cs eq and Csopt . In fact, the capacity benefits of using power control at high SNR regime is no larger than O(SNR−1 ) per unit bandwidth. See Appendix A for details. For this reason, our analysis is mainly devoted to LQ s , which corresponds to the capacity loss relative to Nyquist-rate capacity with uniform power allocation. The proof of Theorem 2 involves the verification of two parts: a converse part that provides a fundamental lower bound on the minimax sampled capacity loss, and an achievability part that provides a sampling scheme to approach this bound. As we show, the class of sampling systems with periodic modulation followed by low-pass filters, as illustrated in Fig. 1(b), is sufficient to approach the minimax sampled capacity loss. Throughout the remainder of the paper, we suppose that the noise is of unit power spectral density Sη (f ) ≡ 1 unless otherwise specified. Note that this incurs no loss of generality since we can always include a noise-whitening LTI filter at the first stage of the sampling system.

3.1

The Converse

We need to show that the minimax sampled capacity loss under any channel-independent sampler cannot be lower than (9) and (10). This is given by the following theorem, which takes into account the entire regime including the situation where α + β > 1. Theorem 4. Consider any Riemann-integrable sampling coefficient function Q(·) with an undersampling factor α := m/n. Suppose the sparsity ratio β := k/n satisfies β ≤ α ≤ 1. The minimax capacity loss can be lower bounded by     β 2 log (n + 1) W √ H (β) − αH − − . (11) inf max LQ ≥ s Q s∈([n]) 2 α n SNRmin k For a given β, the bound is decreasing in α. While the active channel bandwidth is smaller than the total bandwidth, the noise (even though the SNR is large) is scattered over the entire bandwidth. Thus, none of the universal sub-Nyquist sampling strategies are information preserving, and increasing the sampling rate can always harvest capacity gain.

3.2

Achievability with Landau-rate Sampling (α = β)

Consider the achievability part when the sampling rate equals the active frequency bandwidth (β = α). In general, it is very difficult to find a deterministic solution to approach the lower bound (11). A special instance of sampling methods that we can analyze concerns the case in which the sampling coefficient functions are flat over [0, W/n] and whose coefficients are generated in a random fashion. It turns out that as n grows large, the capacity loss achievable by random sampling approaches the lower bound (11) uniformly across all realizations. The results are stated in Theorem 5 after introducing a class of sub-Gaussian measure below. Definition 4. A measure ν on R satisfies the logarithmic Sobolev inequality (LSI) with constant cLS if, for any differentiable function g, ˆ ˆ g2 2 g 2 log ´ 2 dν ≤ 2cLS |g 0 | dν. g dν Remark 2. A probability measure obeying the LSI possesses sub-Gaussian tails, and a large class of subGaussian measures satisfies this inequality for some constant. See [29, 33] for examples. In particular, the standard Gaussian measure satisfies this inequality with constant cLS = 1 (e.g. [34]).

10

Theorem 5. Let M ∈ Rk×n be a random matrix such that M ij ’s are jointly independent symmetric random variables with zero mean and unit variance. In addition, suppose that M ij satisfies either of the following conditions: (a) M ij is bounded in magnitude by an absolute constant D; (b) The probability measure of M ij satisfies the LSI with a bounded constant cLS . If SNRmax ≥ 1, then there exist absolute constants c0 , c1 , C > 0 such that   √  SNRmax log n + log SNRmax β W √ , + H (β) + c min (12) max LM ≤ 1 s 2 SNRmin n n1/4 s∈([n] k ) with probability exceeding 1 − C exp (−c0 n). Theorem 5 demonstrates that independent random sampling achieves minimax sampled capacity loss, which is 21 H (β) per unit bandwidth modulo some vanishing residual term. In fact, our analysis demonstrates that the sampled capacity loss approaches the minimax limit uniformly over all states s. Another interesting observation is the universality phenomenon, i.e. a broad class of sub-Gaussian ensembles, as long as the entries are jointly independent with matching moments, suffices to generate minimax samplers.

Achievability with Super-Landau-Rate Sampling (α > β, α + β < 1)

3.3

So far we have considered the case in which the sampling rate is equal to the spectral support. While the active bandwidth for transmission is smaller than the total bandwidth, the noise (even though the SNR is large) is scattered over the entire bandwidth. This indicates that none of the sub-Nyquist sampling strategies preserves all information contents conveyed through the noisy channel, unless they know the channel support. One may thus hope that increasing the sampling rate would improve the achievable information rate. The achievability result for super-Landau-rate sampling is stated in the following theorem. Theorem 6. Let M ∈ Rm×n be a Gaussian matrix such that M ij ’s are independently drawn from N (0, 1). Suppose that 1 − α − β ≥ ε and α − β ≥ ε for some small constant ε > 0. Then there exist universal constants C, c > 0 such that     W β c log n β max LM ≤ H(β) − αH + + s 2 α SNRmin n1/3 s∈([n] k ) with probability exceeding 1 − C exp (−n). Theorem 6 indicates  that i.i.d. Gaussian sampling approaches the minimax capacity loss (which is β 1 1 per Hertz) to within vanishingly small gap. As will be shown in our proof, with about 2 H(β) − 2 αH α exponentially high probability, the sampled capacity loss for all states are equivalent and coincide with the fundamental minimax limit. In contrast to Theorem 5, we restrict our attention to Gaussian sampling, which suffices for the proof of Theorem 2.

4

Equivalent Algebraic Problems

Our main results in Section 3 are established by investigating three equivalent algebraic problems. Recall ∗ − 21 P H 2s  SNRmin I k . Define Qw that βW Qs . Simple manipulations yield s := (QQ ) LQ s

    ˆ P P 1 W/n w 2 w∗ 2 log det I m + Q (f )H s (f )Qs (f ) df + log det I k + H (f ) df βW s 2 0 βW s 0   ˆ 1 W/n P w =− log det I k + H s (f )Qw∗ (f )Q (f )H (f ) df s s s 2 0 βW   ˆ 1 W/n P βW 2 + log det H (f ) df + ∆s (13) 2 0 βW s 2   ˆ 1 W/n βW −2 βW w =− log det H s (f ) + Qw∗ (f )Q (f ) df + ∆s , (14) s s 2 0 P 2 1 =− 2

ˆ

W/n

11

where ∆s denotes some residual term. In particular, ∆s can be bounded as 0 ≤ ∆s ≤

1 . SNRmin

(15)

This is an immediate consequence of the following observation: for any k × k positive semidefinite matrix A,   k 1 1 1X 1 1 0 ≤ log det (I k + A) − log det (A) = log 1 + ≤ . (16) k k k i=1 λi (A) λmin (A) Recall that SNRmin := can be bounded as

P βW

2

inf 0≤f ≤W |H(f )| and SNRmax :=

P βW

2

sup0≤f ≤W |H(f )| . Therefore,

βW P

H −2 s

1 βW −2 1 Ik  Hs  Ik. (17) SNRmax P SNRmin w This bound together with (14) makes det (I k + Qw∗ s Qs ) a quantity of interest (for some small ). In the sequel, we provide tight bounds on several algebraic problems, which in turn establish Theorems 4-6. The proofs of these results rely heavily on non-asymptotic (random) matrix theory. In particular, the proofs for achievability bounds are established based on concentration of log-determinant functions, which we provide as well in this section.

4.1

The Converse

Note that Qw (f ) has orthonormal rows. The following theorem investigates the properties of det (I k + B ∗s B s ) for any m×n matrix B that has orthonormal rows. This, together with the Riemann integrability assumption of Q(f ), establishes Theorem 4. Theorem 7. (1) Consider any m × n matrix B (n ≥ m ≥ k) that satisfies BB ∗ = I m , and denote by B s the m × k submatrix of B with columns coming from the index set s. Then for any  > 0, one has     k  X X m n − l m k−l ∗ ≤ det (I m + B s B s ) =  (18) k k−l l l=0 s∈([n] ) k   √ n+k m . (19) ≤ 1+  k (2) For any positive integer p, suppose that B 1 , · · · , B p are all m × n matrices such that B i B ∗i = I m . Then,     p  √ 1 X 1 m n 1 ∗ min log det I k + (B i )s (B i )s ≤ log − log +2  (20) [n] np n k n k s∈( k ) i=1   √ log (n + 1) β − H (β) + 2  + . (21) ≤ αH α n Note that Qw (f ) has orthonormal rows for any f , and Qw (f ) is assumed to be Riemann integrable. For any δ > 0, we can find a sufficiently large p such that      ˆ W/n p iW iW W X w∗ w w log det (I k + Qw∗ (f )Q log det I + Q Q . (f )) df ≤ δ + k s s s s pn i=1 pn pn 0 Since δ can be arbitrarily small, applying Theorem 7 immediately yields that for any Q(·):     ˆ W/n √ β log (n + 1) w min log det (I k + Qw∗ (f )Q (f )) df ≤ W αH − H (β) + 2  + . s s α n s∈([n] 0 k ) This together with (14), (15) and (17) leads to     W β 2 log (n + 1) Q inf max Ls ≥ H (β) − αH −√ − , Q s∈([n]) 2 α n SNRmin k which completes the proof of Theorem 4. 12

4.2

Achievability (Landau-rate Sampling)

When it comes to the achievability part, the major step is to quantify det(I k +(M M > )−1 M s M > s ) for every  s ∈ [n] . Interestingly, this quantity can be uniformly controlled due to the concentration of spectral measure k of random matrices [29]. This is stated in the following theorem, which demonstrates the achievability for Landau-rate sampling. Theorem 8. Let M ∈ Rk×n be a random matrix of independent entries, and let 0 <  < 1 denote some constant. Under the conditions of Theorem 4, there exist absolute constants c0 , c1 , C > 0 such that      −1 1 log n + log 1 1 > > ≤ min log det I k + M M M sM s (22) −H (β) − c1 min √ , n n n1/4 s∈([n] k ) √ log (n + 1) ≤ −H (β) + 2  + n

(23)

with probability exceeding 1 − C exp (−c0 n). Putting Theorem 8 and equations (14), (15) and (17) together establishes that " (r # )! W log n + log SNR SNR β max max max LM H (β) + O min , + s ≤ 2 n SNRmin n1/4 s∈([n] k ) with exponentially high probability, as claimed in Theorem 5.

4.3

Achievability (Super-Landau-rate Sampling)

Instead of studying a large class of sub-Gaussian random ensembles2 , the following theorem focuses on i.i.d. Gaussian matrices, which establishes the optimality of Gaussian random sampling for the super-Landau regime. Theorem 9. Let M ∈ Rm×n be an i.i.d. random matrix satisfying M ij ∼ N (0, 1). Suppose that 1−α−β ≥ ζ and α − β ≥ ζ for some small constant ζ > 0. Then there exist absolute constants c, C > 0 such that      −1 β c log n 1 > > −H(β) + αH + 1/3 ≤ min log det I k + M s M M Ms α n n s∈([n] k )   √ log (n + 1) β +2 + ≤ −H (β) + αH α n with probability at least 1 − C exp (−n). Combining Theorem 9 and equations (14), (15) and (17) implies that       W β log n β M Ls ≤ H(β) − αH +O + 2 α SNRmin n1/3 with exponentially high probability.

4.4

Concentration of Log-Determinant for Random Matrices

It has been shown above that the achievability bounds can be established by demonstrating measure concentration of certain log-determinant functions. In fact, the concentration of log-determinant has been studied in the random matrix literature as a key ingredient in demonstrating universality laws for spectral statistics (e.g. [35, Proposition 48]). However, these bounds are only shown to hold with overwhelming probability (i.e. with probability at least 1 − e−ω(log n) ), which are not sharp enough for our purpose. As a result, we provide sharper concentration results of log-determinant in this subsection. 2 The

proof argument for Landau-rate sampling cannot be readily carried over to super-Landau regime since M s is now a tall matrix, and hence we cannot separate M s and M M ∗ easily.

13

Lemma 1. Suppose that k/p ∈ (0, 1], and that  > 0. Consider a random matrix A = Rk×p where Aij are symmetric and jointly independent with zero mean and unit variance. √ (a) If Aij ’s are bounded in magnitude by a constant D, then for any δ > 8D√ π :          1 2 1 δ 2 P log det I + AA> < E log det ; I + AA> − δ ≤ 4 exp − k e k 16D2

(24)

      1 e eδ 2 > . I + AA > E log det + δ ≤ 4 exp − 2 k 32D2

(25)





1 P log det I + AA> k



(b) Aij ’s satisfy the LSI with uniformly bounded constant cLS , then for any δ > 0:          1 δ 2 1 > > − E log det I + AA P log det I + AA > δ ≤ 2 exp − 2cLS . k k

(26)

Proof. See Appendix E.  h i   The concentration results for k1 log det I + k1 AA> will be useful if we can quantify E k1 log det I + k1 AA> , as established in the following lemma. Lemma 2. Let A ∈ Rk×k be a random  matrix such that all entries are jointly independent with zero mean and unit variance. For any  ∈ k4 , e12 , we have       √ 1 1 1 1.5 log (ek) 1 1 > > log det I + AA ≤ log E det I + AA ≤ −1 + + 2  log . (27) E k k k k k  Additionally, under Condition (a) or (b) of Lemma 1, one has, for any k1 <  ≤ 0.8,      1 1 log k 1 E log det I + AA> ≥ −1 + −O . k k 2k k In particular, if Aij ∼ N (0, 1), then for any 0 <  ≤ 0.8,    1 log k 2 1 log det I + AA> ≥ −1 + − . E k k 2k k

(28)

(29)

Proof. See Appendix F. One issue concerning Lemmas 1 and 2 is that the bounds might become useless when  is extremely small. To mitigate this issue, we obtain another concentration lower bound in the following lemma. Lemma 3. Let A ∈ Rk×k be a random matrix satisfying the assumptions of Lemma 1. There exists an absolute constant c14 > 0 such that for any √1k ≤ τ < k and 0 <  < 1,     1 1 c14 τ 1/4 1 log det I + AA> ≥ −1 − log + log k k k  k 1/4 with probability at least 1 − 4 exp (−τ k). Proof. See Appendix G.   Lemmas 1, 2, and 3 taken collectively allow us to derive the concentration of log det I + k1 M s M > s  uniformly over all s ∈ [n] as follows. k

14

Lemma 4. Let M ∈ Rm×n be a random matrix satisfying the conditions of Theorem 4. Then for any constant 0 <  < 1,      1 1 1 log 1 + log n > log det I + M s M s ≥ −1 − O min √ , (30) ∀s : n k n n1/4 n holds with probability exceeding 1 − 2 2e . In particular, if M ij ∼ N (0, 1) are jointly independent, then for any 0 <  ≤ 0.8, r   1 1 2 2 > ∀s : log det I + M s M s ≥ −1 − − (31) k k k βk n with probability at least 1 − 2e . p Proof. Fix δ = c5 k/ for some numerical constant c5 > 0 in Lemma 1. Combining Lemmas 1, 2, and 3 yields the following: under the assumptions of Lemma 1, for any small constant  > 0 and some sufficiently large c5 we have      1 1 log k 1 log 1 + log n > log det I + M s M s ≥ −1 + − O min √ , (32) k k 2k n n1/4  with probability exceeding 1 − 4e−n . Since there are at most nk ≤ 12 · 2n different s, applying the union bound establishes (30). Similarly, for the ensemble M ij ∼ N (0, 1), Lemmas 1 and 2 indicates that r   1 2 1 log k 2 log det I + M s M > ≥ −1 + − − s k k 2k k βk with probability at least 1 − 2e−n . The proof is then complete via the union bound.   Another important class of log-determinant function takes the form of n1 log det n1 AA> . The concentration of such functions for i.i.d. rectangular random matrices is characterized in the following lemmas. Lemma 5. Let α := m/n be a fixed constant independent from (m, n), and that α ∈ [δ, 1 − δ] for some small constant δ > 0. Let A ∈ Rm×n be a random matrix such that all entries are symmetric and jointly independent with zero mean and unit variance. Under Condition (a) or (b) of Lemma 1, there exist universal constants c7 , C7 > 0 independent of n such that   1 log det 1 AA> − (1 − α) log 1 + α ≤ √1 (33) n n 1−α n with probability exceeding 1 − C7 exp (−c7 n). Proof. See Appendix H. This result can be made more explicit for Gaussian ensembles as follows. Lemma 6. Suppose that A ∈ Rm×n is a random matrix with independent entries satisfying Aij ∼ N (0, 1). Suppose that there exists a small constant δ > 0 such that α := m n ∈ [δ, 1 − δ]. (1) For any  > 0 and any τ > 0, n   o √ card i | λi n1 AA> <  α 4 ατ < + √ (34) n n 1 − α − n1 with probability exceedingn1 − 2 expo(−τ n). (2) For any n > max 1−2√α , 7 and any τ > 0, 1 log det n



1 AA> n

 ≤ (1 − α) log

√ 1 2 log n 5 α τ  −α+ + √ 2 1−α n 1− α− n n 15

(35)

 with probability exceeding1 − 2 exp −2τ 2 .   √ 3 τ α 6.414 1−α , 1−α− 1 + 4 ατ and any τ > 0, (3) For any n > max 1−α · e n

1 log det n



1 AA> n

 > (1 − α) log

1 −α− 1−α



2α 1−α−

1 n

√ + 11 ατ



log n 6α − 2/3 n1/3 n

(36)

with probability exceeding 1 − 5 exp(−τ n). Proof. See Appendix I.   The last log-determinant function considered here takes the form of log det I + A> B −1 A for some independent random matrices A and B, as stated in the following lemma. Lemma 7. Suppose that m > k and m/k is bounded away from3 1. Let matrix A = Rm×k and B ∈ Rm×m be two independent random matrices such that Aij ∼ N (0, 1) are jointly independent, and B ∼ Wm (n − k, I m ). 1 Then for any τ > n1/3 ,   1 log det I k + A> B −1 A ≥ − (α − β) log (α − β) + α log α n   β c8 τ log n + (1 − α − β) log 1 − − β log (1 − α) − 1−α n1/3  with probability exceeding 1 − C8 exp −τ 2 n for some absolute constants c8 , C8 > 0. Proof. See Appendix J.

5 5.1

Discussion Implications of Main Results

Under both Landau-rate and super-Landau-rate sampling, the minimax capacity loss depends almost entirely on β and α. In this subsection, we summarize several key insights from the main theorems. 5.1.1

The Converse

Our analysis demonstrates that at high SNR, the loss LQ s depends almost solely on the quantity w d (Q(f ), s, ) := det (I k + Qw∗ s (f )Qs (f ))

for small  > 0, which is approximately the exponential of capacity loss at a given pair (s, f ). In fact, the key observation underlying the proof of Theorem 7 is that for any f , the sum X d (Q(f ), s, ) [n] s∈( k ) is a constant independent of the sampling coefficient function Q. In other words, at any given f , the exponential sum of capacity loss over all states s is invariable regardless of what samplers we employ. This invariant quantity is critical in identifying the minimax sampling method. In fact, it motivates us to seek a sampling method that achieves equivalent performance over all states s. Large random matrices exhibit sharp concentration of spectral measure, and hence become a natural candidate to attain minimaxity.

16

6

Normalized Minimax Capacity Loss per Hertz

Minimax Capacity Loss per Hertz

0.6

0.5

0.4

0.3

0.2

0.1

0

−0.1

0

0.2

0.4

0.6

0.8

5

4

3

2

1

0

1

0

0.2

β

0.4

0.6

0.8

1

β

(a)

(b)

Figure 4: Plot (a) illustrates H (β) /2 v.s. the sparsity ratio β, which characterizes the fundamental minimax capacity loss pert Hertz within a vanishing gap. Plot (b) illustrates H (β) /(2β) v.s. β, which corresponds approximately to the normalized capacity loss per Hertz. 5.1.2

Landau-rate Sampling

When sampling is performed at the Landau rate, the minimax capacity loss per unit Hertz is almost solely determined by the entropy function H(β). Specifically, when n and k are sufficiently large, the minimax limit depends only on the sparsity ratio β = k/n rather than (n, k). Some implications from Theorem 4 and Theorem 5 are as follows. 1. The capacity loss per unit Hertz is illustrated in Fig. 4(a). The capacity loss vanishes when β → 1, since Nyquist-rate sampling is information preserving. The capacity loss divided by β is plotted in Fig. 4(b), which provides a normalized view of the capacity loss. It can be seen that the normalized loss decreases monotonically with β, indicating that the loss is more severe in sparse channels. Note that this is different from an LTI channel whereby sampling at the Landau rate is sufficient to preserve all information. When the channel state is uncertain, increasing the sampling rate above the Landau rate (but below the Nyquist rate) effectively increases the SNR, and hence allows more information to be harvested from the noisy sampled output. 2. The capacity loss incurred by independent random sampling meets the fundamental minimax limit for Landau-rate sampling uniformly across all states s, which reveals that with exponentially high probability, random sampling is optimal in terms of universal sampling design. The capacity achievable by random sampling exhibits very sharp concentration around the minimax limit uniformly across all  states s ∈ [n] . k 3. A universality phenomenon that arises in large random matrices (e.g. [36]) leads to the fact that the minimaxity of random sampling matrices does not depend on the particular distribution of the coefficients. For a large class of sub-Gaussian measure, as long as all entries are jointly independent with matching moments up to the second order, the sampling mechanism it generates is minimax with exponentially high probability. 5.1.3

Super-Landau-Rate Sampling

The random sampling analyzed in Theorem 6 only involves Gaussian random sampling, and we have not shown universality results. Some implications under super-Landau-rate sampling are as follows. 1. Similar to the case with Landau-rate sampling, Gaussian sampling achieves minimax capacity loss uniformly across all states s within a large super-Landau-rate regime. The capacity gap is illustrated in Fig. 5. It can be observed from the plot that increasing the α/β ratio improves the capacity gap, shrinks the locus and shifts it leftwards. 3 In

other words, there exists a small constant δ > 0 such that m/k ≥ 1 + δ.

17

3

Normalized Minimax Capacity Loss Per Hertz

Minimax Capacity Loss Per Hertz

0.6 α=β α=1.25β α=1.5β α=2β α=3β α=5β

0.5

0.4

0.3

0.2

0.1

0

−0.1

0

0.2

0.4

0.6

0.8

1

α=β α=1.25β α=1.5β α=2β α=3β α=5β

2.5

2

1.5

1

0.5

0 0

β

0.2

0.4

0.6

0.8

1

β

Figure 5: The function 12 H (β) − α2 H (β/α) v.s. the sparsity ratio β and the undersampling factor α. Here, 1 α 2 H (β) − 2 H (β/α) characterizes the fundamental minimax capacity loss pert Hertz within a vanishing gap.

18

2. Theorem 6 only concerns i.i.d. Gaussian random sampling instead of more general independent random sampling. While we conjecture that the universality phenomenon continues to hold for other jointly independent random ensembles with sub-Gaussian tails and appropriate matching moments, the mathematical analysis turns out to be more tricky than in the Landau-rate sampling case. 3. The capacity gain by sampling above the Landau rate depends on the undersampling factor α as well. Specifically, the capacity benefit per unit bandwidth due to super-Landau sampling is captured by the term 21 αH (β/α). When α → 1, the capacity loss per Hertz reduces to   1 1 β H (β) − αH = 0, 2 2 α meaning that there is effectively no capacity loss under Nyquist-rate sampling. This agrees with the fact that Nyquist-rate sampling is information preserving.

6 6.1

Connections with Discrete-Time Sparse Channels Minimax Sampler in Discrete-time Sparse Channels

Our results have straightforward extensions to discrete-time sparse vector channels. Specifically, consider a collection of n parallel channels. The channel input x ∈ Rn is passed though the channel and contaminated by Gaussian noise n ∼ N (0, I n ), yielding a channel output r = Hx + n, where H  is some diagonal channel matrix. In particular, at each asymptotically long timeframe, a state s ∈ [n] is generated, which dictates the set of channels available for transmission, i.e. the transmitter can k only send x at indices in s. One can then obtain m measurements of the channel output through a sensing matrix Q ∈ Rm×n , i.e. the measurements y ∈ Rm can be expressed as y = Qr = Q (Hx + n) . The goal  is then to identify a sensing matrix Q that minimizes the worst-case capacity loss over all states s ∈ [n] k . Q,opt ) to denote the capacity loss at state s relative to Nyquist-rate If we abuse our notation LQ s (resp. Ls capacity without (resp. with) power control, then the following results are immediate. Theorem 10. Define SNRmin := Pk inf 1≤i≤n |H ii | and SNRmax := Pk sup1≤i≤n |H ii |, and suppose that SNRmax ≥ 1. (i) (Landau-rate sampling) If α = β (k = m), then ( (r )! ) LQ 1 SNRmax log n + log SNRmax s inf max = H (β) + O min , + ∆L , (37) Q s∈([n]) n 2 n n1/4 k and 1 LQ,opt inf max s = Q s∈([n]) n 2 k

(

(r H (β) + O min

SNRmax log n + log SNRmax , n n1/4

)!

) +

∆wf L

.

(38)

(ii) (Super-Landau-rate sampling) Suppose that there is a small constant δ > 0 such that α − β ≥ δ and 1 − α − β ≥ δ. Then       LQ 1 β log n inf max s = H (β) − αH +O + ∆ , (39) SL Q s∈([n]) n 2 α n1/3 k and

LQ,opt 1 inf max s = Q s∈([n]) n 2 k



     β log n wf H (β) − αH +O + ∆SL . α n1/3 19

(40)

opt Here, ∆L , ∆opt L , ∆SL and ∆SL are some residual terms satisfying

−√

β 2 ≤ ∆L , ∆SL ≤ , SNRmin SNRmin

and

−√

2 β+A opt ≤ ∆opt , L , ∆SL ≤ SNRmin SNRmin

where A is a constant defined as ( A := min

) max1≤i≤n H 2ii tr (HH ∗ ) , . k min1≤i≤n H 2ii min1≤i≤n H 2ii

The key observations are that in a discrete-time sparse vector channel, the minimax capacity loss per degree of freedom again depends only on β and α modulo a vanishingly small gap. Independent random sensing matrices and i.i.d. Gaussian sensing matrices are minimax in terms of a channel-blind sensing matrix design, in the Landau-rate and super-Landau-rate regimes, respectively.

6.2 6.2.1

Connections with Sparse Recovery Restricted Isometry Property (RIP)

The readers familiar with compressed sensing [7, 8] may naturally wonder whether the optimal sampling matrices M satisfy the restricted isometry property (RIP). An RIP constant δk with respect to a matrix M ∈ Cm×n is defined (e.g. [37]) as the smallest quantity such that (1 − δk ) kck ≤ kM s ck ≤ (1 + δk ) kck holds for any vector c and any s of size at most k (recall that M s is a submatrix consisting of k columns of M ). This quantity measures how close M s ’s are to orthonormal systems, and the existence of a small RIP constant that does not scale with (n, k, m) typically enables exact sparse recovery from noiseless measurements with a sensing matrix M [37]. Nevertheless, RIP is not necessary for approaching the minimax capacity loss. Consider the Landau-rate sampling regime for example. When the entries of M are independently generated under conditions of Theorem 8, one typically has (e.g. [38])   1 , σmin (M s ) = O k which cannot be bounded away from 0 by a constant. On the other hand, there is no guarantee that a restricted isometric matrix M is sufficient to achieve minimaxity. An optimal sampling matrix typically has similar spectrum as an i.i.d. Gaussian matrix, but a general restricted isometric matrix does not necessarily exhibit similar spectrum. We note, however, that many randomized schemes for generating a restricted isometric matrix are natural candidates for generating minimax samplers. As shown by our analysis, in order to obtain a desired sampling matrix M , we require M s to yield similar spectrum over all s. Many random matrices satisfying RIP exhibit this property, and are hence minimax. 6.2.2

Necessary Sampling Rate

It is well known that there exists a spectrum-blind sampling matrix with 2k noiseless measurements that admits perfect recovery of any k-sparse signal. In the continuous-time counterpart, the minimum sampling rate for perfect recovery in the absence of noise is twice the spectrum occupancy [39]. Nevertheless, this sampling rate does not allow zero capacity loss in our setting. Since the channel output is contaminated by noise and thus has bandwidth W , a spectrum-blind sampler is unable to suppress spectral contents outside active subbands, and hence suffers from information rate loss relative to Nyquist-rate capacity. On the other hand, twice the spectrum occupancy does not have a threshold effect in our setting, as illustrated in Figure 5. This arises from the fact that the transmitter can adapt its transmitted signal to the instantaneous realization and sampling rate. For instance, the spectral support of the transmitted signal may also shrink as the sampling rate decreases, thus avoiding an infliction point on the capacity curves. 20

7

Conclusions

We have investigated optimal universal sampling design from a capacity perspective. In order to evaluate the loss due to universal sub-Nyquist sampling design, we introduced the notion of sampled capacity loss relative to Nyquist-rate capacity, and characterize overall robustness of the sampling design through the minimax capacity loss metric. Specifically, we have determined the fundamental minimax limit on the sampled capacity loss achievable by a class of channel-blind periodic sampling system. This minimax limit turns out to be a constant that only depends on the band sparsity ratio and undersampling factor, modulo a residual term that vanishes in the SNR and the number of subbands. Our results demonstrate that with exponentially high probability, random sampling is minimax in terms of a universal sampler design. This highlights the power of random sampling methods in the channel-blind design. In addition, our results extend to discrete-time counterparts without difficulty. We demonstrate that independent random sensing matrices are minimax in discrete-time sparse vector channels. It remains to characterize the fundamental minimax capacity loss when sampling is performed below the Landau rate, and to be seen whether random sampling is still optimal in the sub-Landau-rate regime. It would also be interesting to extend this framework to situations beyond compound multiband channels, and our notion of sampled capacity loss will be useful in evaluating the robustness for these scenarios. Our framework and results may also be appropriate for other channels with state where sparsity exists in other transform domains. In addition, when it comes to multiple access channels or random access channels [23], it is not clear how to find a channel-blind sampler that is robust for the entire capacity region.

A

Proof of Theorem 3

We would like to bound the gap between Cswf and Cseq . In fact, the equation that determines the water level implies that ˆ

W/n

k X

0

i=1

P =

ν−

ˆ P ν≤ + βW

W/n

k X

0

i=1

df ≥

2

(H s (f ))ii

which in turns yields

ˆ

!+

1

W/n

ν−

!

1 2

(H s (f ))ii

df

Pk

1 i=1 (H s (f ))2ii df

0

βW

.

With the above bound on the water level, the capacity can be bounded above as ˆ

W/n

ˆ

W/n

k

 1 X + 2 log ν (H s (f ))ii df 2 i=1 0  ˆ W/n  Pk (H s (f ))2ii ˆ W/n X k   j=1 (H s (f ))2jj df 1 P 2 + 0 ≤ log  + (H s (f ))ii  df 2 i=1 βW βW   0

Csopt =

  k 1X P 2 log A + (H s (f ))ii df 2 i=1 βW 0   ˆ W/n P 2 = log det AI + H (f ) df, βW s 0 ≤

21

(41)

ˆ

W/n (H s (f ))2 ii j=1 (H s (f ))2 jj

Pk

where A := maxs,i P

0

df

βW

. One can easily verify that A ≥ 1. Therefore,

    ˆ W/n P P H 2s (f ) df − H 2s (f ) df log det I + log det AI + βW βW 0 0  P |H(f,s)| ˆ W/n X ˆ W/n X √ P k k A + inf 0≤f ≤W βW H 2s (f ) ii A + βW Sη (f )  ≤ log df ≤ log df 2 P P |H(f,s)| 1 + H (f ) √ 1 + inf 0 0 s 0≤f ≤W βW βW i=1 i=1 ii Sη (f )   A−1 ≤ βW log 1 + 1 + SNRmin β (A − 1) ≤W . 1 + SNRmin ˆ

W/n

Csopt − Cs eq ≤

We also observe that ˆ

W/n

(H s (f ))2ii j=1 (H s (f ))2jj df

ˆ

W/n

Pk

0

A = max s,i βW  ´  maxs∈ [n] 0W (k) ≤ min  βW inf

0



0≤f ≤W,s∈(

)

df

η

βW

|H(f,s)|2 Sη (f ) df [n] k

(H s (f ))2ii |H(f,s)|2 j=1 inf f,s S (f )

Pk

|H(f,s)|2 Sη (f )

,

sup0≤f ≤W,s∈([n])

|H(f,s)|2 Sη (f )

 

inf 0≤f ≤W,s∈([n])

|H(f,s)|2 Sη (f )



k

k

.

Combining the above bounds and Theorem 2 completes the proof.

B

Proof of Theorem 7

Before proving the results, we first state two facts. Consider any m × m matrix A, and list the eigenvalues of A as λ1 , · · · , λm . Define the characteristic polynomial of A as pA (t) = det (tI − A) = tm − S1 (λ1 , · · · , λm )tm−1 + · · · + (−1)m Sm (λ1 , · · · , λm ), where Sl (λ1 , · · · , λm ) is the lth elementary symmetric function of λ1 , · · · , λm defined as follows: X

Sl (λ1 , · · · , λm ) :=

l Y

λ ij .

1≤i1 > log det I k + M M M s M s = log det M M > + M s M > − log det M M > s       1 1  > > > σmin M M MM , I k + M s M s − log det ≥ log det k k k   > if σmin M M > is a constant bounded away from zero. The which helps separate M s M > s and M M behavior of the least singular value of a rectangular random matrix with independent sub-Gaussian entries has been largely studied in the random matrix literature (e.g. [42, Theorem 3.1] and [28, Corollary 5.35]), which we cite as follows. Lemma 8. Suppose that m = (1 − δ)n for some absolute constant δ ∈ (0, 1). Let M be an m × n real-valued random matrix whose entries are jointly independent symmetric sub-Gaussian random variables with zero mean and unit variance. Then there exist universal constants C, c > 0 such that σmin (M M ∗ ) > Cm

(49)

with probability at least 1 − 2 exp (−cn). In particular, √ √ if the entries of M are i.i.d. standard Gaussian random variables, then for any constant 0 < ξ < n − m, 2 √ √ σmin (M M ∗ ) > n− m−ξ  with probability at least4 1 − exp −ξ 2 /2 . Setting m = k, we can derive that with probability exceeding 1 − 2 exp (−cn),       −1  1 1 > > − log det log det I k + M M > M sM > ≥ log det CI + M M M M s k s s k k 4 Note

(50)

that this follows from [28, Proposition 5.34 and Theorem 5.32] by observing that σmin (M ) is a 1-Lipschitz function.

25

holds for general independent symmetric sub-Gaussian matrices. Also, with probability at least 1 − 2e−2 ,    2 √     √2 −1  1 − β − 1 1 n  > > > > I k + M s M s  − log det MM log det I k + M M M s M s ≥ log det  β k k (51) holds for i.i.d. standard Gaussian matrices and any  constant ξ ∈ (0, 1). The next step is to quantify the term log det I + k1 M s M > for some small 0 <  < 1. This has been s characterized in Lemma 4, indicating that         log 1 + log n [n] 1 1 > √ (52) ∀s ∈ , log det I + M s M s ≥ −1 − O min , k k n n1/4 with exponentially high probability. In addition, since M satisfies the assumptions of Lemma 5 with β = α, simple manipulation gives     1 1 1 1 k n > > log det MM = log det MM + log n k n n n k 1 1 1 (53) ≤ (1 − β) log − β + β log + √ 1−β β n with probability exceeding 1 − C7 exp (−c7 n). The above results (49), (50), (52) and (53) taken collectively yield the following: under the condition of Theorem 8, one has      −1 1 log k 1 > > ∀s : log det I k + M M M s M s ≥ −H (β) + −O √ (54) n 2k n with probability exceeding 1 − C exp (−cn) for some absolute constants c, C > 0. Combining this lower bound with the upper bound developed in Theorem 7 (with α = β) concludes the proof.

D

Proof of Theorem 9

   −1 > log det I k + M > M M M for some small  > 0. We first define two s s  > 1 1 Wishart matrices Ξ\s := n1 M M > − n1 M s M > s and Ξs := n M s M s . Apparently, Ξs ∼ Wm k, n I m and Ξ\s ∼ Wm n − k, n1 I m . When 1 − α > β, i.e. n − k > m, the Wishart matrix Ξ\s is invertible with probability 1.    −1 > > One difficulty in evaluating det I k + M s M M M s is that M s and M M > are not indepenOur goal is to evaluate

1 n

dent. This motivates us to decouple them first as follows  det I k +

M> s



MM

>

−1

 Ms

! 1 > M sM s = det I m + n    −1 1 1 1 > > > k−m = det  M M + M s M s det MM n n n  −1  1 = k−m det Ξ\s + (1 + ) Ξs det MM> n  −1    1 > −1 k−m = det I m + (1 + ) Ξs Ξ\s det Ξ\s det MM n    −1  1 1 > −1 = det I k + (1 + ) M > Ξ M det Ξ det M M s \s n s \s n 

k−m

26

1 MM> n

−1

or, equivalently,    −1 1 > > log det I k + M s M M Ms n      1 1 > −1 1 1 1 > MM . = log det I k + (1 + ) M s Ξ\s M s + log det Ξ\s − log det n n n n n

(55)

The point of developing this identity (55) is to decouple the left-hand side of (55) through 3 matrices > −1 M> s Ξ\s M s , Ξ\s and M M . In particular, since M s and Ξ\s are jointly independent, we can examine the −1 concentration of measure for M s and Ξ\s separately when evaluating M > s Ξ\s M s . The second and third terms of (55) can be evaluated through Lemma 6. Specifically, Lemma 6 indicates that     1 1 1 > log det MM ≤ − (1 − α) log (1 − α) − α + O √ (56) n n n  with probability at least 1 − C6 exp (−2n) for some constant C6 > 0, and that for all s ∈ [n] k ,      n−k 1 n 1 n−k 1 log det Ξ\s = log det Ξ\s + log det I n n n−k n−k n n         α α α log n ≥ (1 − β) − 1 − log 1 − − + α log (1 − β) + O 1−β 1−β 1−β n1/3     α log n ≥ − (1 − α − β) log 1 − . (57) − α + α log (1 − β) + O 1−β n1/3

hold simultaneously with probability exceeding 1 − C9 exp  (−2n).

 −1 Our main task then amounts to quantifying log det I k + M > Ξ M s , which can be lower bounded s \s via Lemma 7. This together with (56), (57) and (55) yields that    −1 1 > > log det I k + M s M M Ms n   β − β log (1 − α) ≥ − (α − β) log (α − β) + α log α + (1 − α − β) log 1 − 1−α     α log n − (1 − α − β) log 1 − − α + α log (1 − β) + (1 − α) log (1 − α) + α − O 1−β n1/3     β log n = αH − H (β) − O α n1/3

with probability exceeding 1 − C9 exp (−2n) for some constants C9 > 0. Since there are at most nk < en different states s, applying the union bound over all states completes the proof.

E

Proof of Lemma 1

 Set f (x) = log ( + x). Observe that the Lipschitz constant of g (x) := f x2 = log( + x2 ) is upper bounded by 2x 2 0 ≤ √1 . ∀x ≥ 0, |g (x)| = =  2 +x  x +x When Aij ’s are bounded in magnitude by D, define the following concave function (  √ log  + x2 , if x ≥ , √ g (x) := 1 √ (x − ) + log (2) , if 0 ≤ x < , 

27

whose Lipschitz constant is bounded by  log

1 √ . 

2 +x e

This function obeys the following interlacing bound

 ≤ g

√  x ≤ log ( + x) ,

Define the function k 1X f (A) := g k i=1

s



λi

1 AA> k

∀x ≥ 0.

(58)

! ,

(59)

then it follows from (58) that 1 log det k



2 1 I + AA> e k

 ≤ f (A) ≤

  1 1 log det I + AA> . k k

One can then apply [29, Corollary 1.8(a)] to derive that for any δ >

(60)

√ 8D √ π, 

1 f (A) ≥ E [f (A)] − δ k    1 2 1 1 > ≥E log det I + AA − δ k e k k    √ 2 with probability at least 1 − 4 exp − 4D 2 δ − 4D√ π . This together with (60) yields that for any δ > √ 8D √ π, 

     1 2 1 log det I + AA> ≥ kf (A) ≥ E log det I + AA> −δ k e k √   2 √ eπ , with probability at least 1 − 4 exp − 16D . Similar argument indicates that for any δ > 8D 2δ 2



   1 1 e log det I + AA> ≤ kf 2e  (A) ≤ E log det I + AA> −δ k 2 k  e 2 with probability at least 1 − 4 exp − 32D . 2δ If Aij satisfies the LSI with uniformly bounded constant cLS , then applying [29, Corollary 1.8(b)] leads to      log det I + 1 AA> − E log det I + 1 AA> > δ k k   δ 2 with probability at most 2 exp − 2c , as claimed. LS

F



Proof of Lemma 2

(1) By Jensen’s inequality,       1 1 1 1 E log det I + AA> ≤ log E det I + AA> k k k k √ 1.5 log (ek) 1 ≤ −1 + + 2  log , k  where the last inequality follows from [43, Lemma 3]. (2) The lower bound follows from the concentration inequality. Define Y := k (f (A) − E [f (A)]) , 28

where f (A) is defined in (59). Similar to the proof of Lemma 1, applying [29, Corollary 1.8] indicates that P (|Y | > δ) ≤ 4 exp −˜ cδ 2



for some constant c˜. If we denote by fY (·) the probability density function of Y , then ˆ ∞   ˆ ∞  ey f|Y | (y) dy = −ey P (|Y | > y) |∞ E eY ≤ E e|Y | = + ey P (|Y | > y) dy 0 0 0 ˆ ∞  2 4 exp y − c˜y dy ≤1+ 0 r   π 1 + log 1 + 4 exp . k c˜ 4˜ c   > 1 Also, the inequality kf (A) ≥ log det 2 allows us to derive e I + k AA

(61)

r        h i π 1 1 > kf (A) E log det I + AA exp ≥ log E e − log 1 + 4 k c˜ 4˜ c r       2 1 π 1 > ≥ log E det I + AA − log 1 + 4 exp . (62) e k c˜ 4˜ c Q Furthermore, if we denote by k the permutation group of k elements, then the Leibniz formula for the determinant gives k Y X sgn(σ) det (A) = Ai,σ(i) . σ∈

Q

i=1

k

Taking advantage of the joint independence hypothesis yields # " k k h h  i h i Y X Y X 2 i 2 2 > Ai,σ(i) = E Ai,σ(i) = k! E det AA E = E (det (A)) = σ∈

Q

k

i=1

σ∈

Q

k

i=1

This taken collectively with (62) yields that r          1 1 1 1 1 π 1 T > E log det I + AA ≥ log E det AA − log 1 + 4 exp k k k k k c˜ 4˜ c r    1 k! 1 π 1 = log k − log 1 + 4 exp k k k c˜ 4˜ c r    log k 1 π 1 = −1 + − log 1 + 4 exp , 2k k c˜ 4˜ c where the last inequality makes uses of the well-known Stirling-type inequality 1 k! log k log k ≥ − 1. k k 2k h  i  k 1 This indicates that k1 E log det I + k1 AA> ≥ −1 + log 2k − O k . 1

k! ≥ k k+ 2 e−k ,



29

In particular, for i.i.d. Gaussian ensemble Aij ∼ N (0, 1), one has c˜ = 12 , and hence for any  ≤ 0.8: r       1 1 1 π 1 log k E log det I + AA> − log 1 + 4 exp ≥ −1 + k k 2k k 2 2    log k 1 2 ≥ −1 + − log exp 2k k  log k 2 = −1 + − , 2k k   pπ 1 where the second inequality uses the inequality that 1 + 4 2 exp 2 ≤ exp 2 for any 0 <  ≤ 0.8.

G

Proof of Lemma 3

We first attempt to estimate the number of eigenvalues of k1 AA> not exceeding . To this end, we consider the following function ( q − xξ + 2, if x ≤ 4ξ, f2,ξ (x) := 0 else, as well as the convex function ( g2,ξ (x) := f2,ξ x

2



=

√ − √xξ + 2, if x ≤ 2 ξ, 0

else.

Apparently, 1(0,ξ) (x) ≤ f2,ξ (x) ≤ 2 · 1(0,4ξ) (x), and the Lischiptz constant of g2,ξ (·) is bounded above by

∀x

(63)

√1 . ξ

˜ such By our assumptions, Aij ’s are symmetric and sub-Gaussian. If we generate the random matrix A that ˜ ij := Aij 1 A {|Aij |≤k1/4 } , then one can easily verify that  ˜ = A = 1 − o (1) , P A 

h

i ˜ ij = 0, E A

  ˜ 2 E Aij = 1 − o (1) .

If we denote by 



NI (A) := card i | λi

1 AA> k



 ∈I

for any interval I, then it follows from the theory on local spectral statistics [44, Theorem 4.1] that ˆ ξ   ˜ ≤ k ρ (x) dx + 1 kξ N(0,ξ) A 4 0 with probability at least 1 − c5 /k 3 for some constant c5 > 0, where ρ(x) represents the Marchenko–Pastur law 1 p 1 ρ(x) := (4 − x) x · 1[0,1] (x) ≤ √ . 2πx π x This immediately implies that  1  c5  c5  E N(0,ξ) (A) ≤ 3 + 1 − 3 k k k 2p 1 c5 ≤ ξ + ξ + 3. π 4 k 30

ˆ

ξ

1 ρ (x) dx +  4 0

!

(64)

The concentration inequality [29, Corollary 1.8] ensures that for any δ = Ω



1 √ k ξ



,

       k k 1 1X 1 1 1X N(0,ξ) (A) ≤ f2,ξ λi AA> E f2,ξ λi AA> ≤ +δ k k i=1 k k i=1 k  with probability exceeding 1 − 4 exp −˜ cξδ 2 k 2 for some constant c˜ > 0. The inequality (63) suggests that     k   2c5 1 2 8p 1X > ξ + 2ξ + 3 , E f2,ξ λi AA ≤ · E N(0,4ξ) (A) ≤ k i=1 k k π k and hence for any τ ≥

√1 , k

r 8p 2c5 τ 1 N(0,ξ) (A) ≤ ξ + 2ξ + 3 + k π k c˜kξ p with probability at least 1 − 4 exp (−τ k). By setting ξ = τk , one can derive that with probability exceeding 1 − 4 exp (−τ k), 1 c11 τ 1/4 , N(0,ξ) (A) ≤ k k 1/4



1 ≤τ 0. The above estimates on the eigenvalue concentration allow us to derive       N(0,ξ) (A) 1 1 1 +ξ 1 +ξ > > log det I + AA ≥ log det AA − log k k k k k      1/4 1 1 c12 τ 1 ≥ log detξ AA> − + log k log k k  k 1/4

(65)

ξ with exponential high probability, where c12 > 0 is another constant.  Here, the function det (·) is defined ξ 1 > in (68). Now it suffices to obtain a lower estimate on log det k AA . Putting the bounds (70) and (73) together indicates that there exists a constant c13 > 0 such that   √ 1 1 log (2πk) c13 c13 τ ξ > log det AA ≥ −1 + − − √ k k 2k kξ kξ

≥ −1 +

log (2πk) c13 τ 1/4 c13 − 1/2 1/2 − 2k k τ k 1/4

with probability exceeding 1 − 4 exp (−τ k). This combined with (65) yields that for any     1 1 log (2πk) c14 τ 1/4 1 > log det I + AA ≥ −1 + − log + log k k k 2k  k 1/4 with probability exceeding 1 − 4 exp (−τ k), as claimed.

H

Proof of Lemma 5

We first develop the upper bound. The Cauchy-Binet formula indicates that h  i h  i X E det AA> = E det As A> , s s∈([n] ) m

31

√1 k

≤ τ < k,

where s ranges over all m-combinations of {1, · · · , n}, and As is the m × m minor of A whose columns are the columns of A at indices from s. It has been shown in (??) that for each jointly independent m × m ensemble As , the determinant satisfies i h  = m!, E det As A> s which immediately leads to      h  i 1 1 X m! n > > . E det AA E det As As = m = m m n n n [n] s∈( m ) Besides, using the well-known Stirling type inequality √ 1 1 2πmm+ 2 e−m ≤ m! ≤ emm+ 2 e−m , one can obtain 

log (m!) ≤ log em

m+ 21 −m

e



 =

1 m+ 2

 log m − m + 1

and similarly   1 1 log m − m + log (2π) . log (m!) ≥ m + 2 2 These further give rise to     m m + 21 log m m 1 1 m 1 > log E det AA ≤ − log n + − + +H n n n n n n n m m m log m m 2 + log m = − log n + − + +H n n n 2n n  log e2 m 1 −α+ = (1 − α) log 1−α 2n

(66)

and, similarly,    1 1 1 log (2πm) > log E det AA ≥ (1 − α) log −α+ . n n 1−α 2n Define

( f1, (x) :=

2 √ 

√ √ ( x − ) + log , 0 < x < , x ≥ ,

log x,

and det (X) :=

m Y

ef1, (λi (X)) .

(68)

i=1

One can easily justify that the Lischiptz constant of the function ( √ √ 2 √  (x − ) + log , 0 < x < , 2  g1, (x) := f1, x = √ 2 log x x ≥ , is bounded above by

2 √ , 

and that g1, (x) is a concave function. Also, det

holds on the event





(67)

1 AA> n



 = det

n   o λmin n1 AA> ≥  . 32

1 AA> n



Set Z := log det





1 AA> n



 − E log det





1 AA> n

 .

(69)

Applying [29, Corollary 1.8] then gives c˜τ 2 P (|Z| > τ ) ≤ 4 exp − α 

 (70)

1 8

when Aij ∼ N (0, 1)). This allows us to derive ˆ ∞ i h   E eZ ≤ E e|Z| = −ez P (|Z| > z) |∞ + ez P (|Z| > z) dz z=0 0   ˆ ∞ c˜z 2 ≤1+ 4 exp z − dz α 0 r  α  πα . > AA ≤ E log det AA + log 1 + 4 exp , log E det n n c˜ 4˜ c

(71)

(72)

leading to              1 1 1 1 1 1 1 1 E log det AA> ≥ log E det AA> −O ≥ log E det AA> −O n n n n n n n n   1 1 log (2πm) = (1 − α) log −α+ −O . (73) 1−α 2n n In the Gaussian case, this can be more explicitly expressed as: for any  ≤ α: " r        # 1 1 1 2πα 2α 1 1  E log det AA> ≥ log E det AA> − log 1 + 8 exp n n n n n         1 1 1 6α ≥ log E det AA> − log exp n n n  1 log (2πm) 6α ≥ (1 − α) log −α+ − . (74) 1−α 2n n √ where we have made use of the fact that 1 + 8 πx exp(x) ≤ exp(3x) for any x ≥ 2. q h  i n On the other hand, in order to develop an upper bound on E log det n1 AA> , we set τ = α log in c˜ the inequality (70), which reveals that with probability at least 1 − 4n−1 ,      r 1 1   > > ≤ α log n |Z| = log det AA − E log det AA n n c˜ or, equivalently, 1

e

1 E[log det ( n AA> )]

√ α log n e c ˜

≤ det





1 AA> n

 ≤e

√ α log n c ˜

 1 > · eE[log det ( n AA )]

with probability exceeding 1 − 4n−1 . In addition, the tail distribution for σmin (A) satisfies [45, Theorem 1.1]     √  1 (1−α)n P σmin √ A ≤ ˜ 1 − α ≤ (C˜ ) + e−cn n 33

for some constants C, c > 0. For sufficiently small  > 0, one can simply write     1 > P σmin AA ≤  ≤ Cm exp (−cm n) n for some constants cm , Cm > 0, implying that      1 1  > > AA = det AA ≥ 1 − Cm exp (−cm n) . P det n n Hence, the union bound suggests that     1 1  > > det AA = det AA n n

and



det



1 AA> n



(75)

>  1 > e−1 eE[log det ( n AA )]

hold simultaneously with probability exceeding 1 − nc˘ for some constant c˘ > 0. Consequently,       1 > 1 c˘ 1 √ α log n eE[log det ( n AA )] , E det AA> ≥ 1− n n e c ˜ indicating that for sufficiently large n,        log 1 − nc˘ 1 1 1 1  > > log E det AA ≥ E log det AA + − n n n n n    √  1 1 log n √ ≥ E log det AA> −O n n n

q

α log n c˜

n

Equivalently, we have, for sufficiently large n and sufficiently small constant  > 0,        √ 1 1 log n 1 1 √ E log det AA> AA> ≤ log E det +O n n n n n  √  2 log e m 1 log n √ ≤ (1 − α) log −α+ +O , 1−α 2n n

(76)

where the last inequality follows from (66).   Now we are ready to develop a high-probability bound on n1 log det n1 AA> . By picking  > 0 to be a sufficiently small constant, one has     1 1 1 1 > log det AA ≥ (1 − α) log −α+ √ P n n 1−α n     1 1 1 1  > ≤ P log det AA ≥ (1 − α) log −α+ √ n n 1−α n    1 1 log n √ = P Z ≥ √ −O n n n  ≤ C7 exp (−c7 n) for some absolute constants c7 , C7 > 0. Similarly, for sufficiently small constant  > 0,     1 1 1 1 > P log det AA ≤ (1 − α) log −α− √ n n 1−α n          1 1 1 1 1 1   > > > ≤ P det AA 6= det AA +P log det AA ≤ (1 − α) log −α− √ n n n n 1−α n    1 1 log n 1 Z ≤ −√ + O + (77) ≤ Cm exp (−cm n) + P n n n n ≤ C˜7 exp (−˜ c7 n) for some absolute constants c˜7 , C˜7 > 0. Here, (77) is a consequence of (75) and (73). This establishes Part (1) of the lemma. 34

I

Proof of Lemma 6

(1) In order to obtain a lower bound on log det eigenvalues of

> 1 n AA



> 1 n AA



, we first attempt to estimate the number of    Pn > 1 that are smaller than some value  > 0, i.e. 1 λ AA . Since the i n i=1 [0,]

indicator function 1[0,] (·) entails discontinuous points, we   1, f2, (x) := −x/ + 2,   0,

define an upper bound on 1[0,] (·) as if 0 ≤ x ≤ ; if  < x ≤ 2; else.

For any  > 0, one can easily verify that5 f2, (x) ≤ and hence

n X





f2, λi

i=1

1 AA> n

 ≤

 , x

m X i=1

∀x ≥ 0,

 λi



> 1 n AA

  = tr

1 AA> n

−1 ! .

This further gives " n " #  −1 !#   1X  1 m α 1 > > E f2, λi AA ≤ E tr AA = = n i=1 n n n n−m−1 1−α−

1 n

,

which follows from the property of inverse Wishart matrices (e.g. [46, Theorem 2.2.8]). Clearly, the Lipschitz constant of the function  √  if 0 ≤ x ≤ ; 1, √ √ g2, (x) := f2, (x2 ) = −x2 / + 2, if  < x ≤ 2;   0, else. p is bounded above by 8/. Applying [29, Corollary 1.8(b)] then yields that for any δ > 0 !    n 1 α 1X > 1[0,] λi AA > P +δ n i=1 n 1 − α − n1 !    n α 1X 1 > ≤ P f2, λi AA > +δ n i=1 n 1 − α − n1 " n !      # n 1X 1 1 1X > > ≤ P f2, λi AA >E f2, λi AA +δ n i=1 n n i=1 n   2 2 δ n . ≤ 2 exp − 16α In other words, this immediately implies that for any τ > 0, n   o card i | λi n1 AA> <  α < n 1−α−

√ 4 ατ √  + 1 n n

(78)

with probability exceeding 1 − 2 exp (−τ n), as claimed. By setting  = n−1/3 , one has n  o  √ 1 α card i | λi n1 AA> < n1/3 ατ 1 + 4 1−α− n < . 1/3 n n 5 This

follows from the inequality

 x

≥ 1 (0 ≤ x ≤ ) as well as the bound that

35

 x

+

x 

−2≥2

q

 x

·

(79) x 

− 2 = 0.

with probability at least 1 − 2 exp (−τ n). (2) Lemma 8 asserts that if n > 1−2√α , then  λmin

1 AA> n



 ≥

1−



2 α− n

2

 with probability at least 1 − exp − n2 . Conditional on this event, we have     1 1  > > det AA AA = det n n 2 √ for the numerical value  = 1 − α − n2 . Additionally, the bound (70) under Gaussian ensembles can be explicitly written as   τ 2 P (|Z| > τ ) ≤ 2 exp − , (80) 8α indicating that (      r )   1 1 1 1   > > > 4α < 2 exp − n . AA log det AA P log det −E n n n n n 2 and

(81)

( p )      1 4τ α  1 1 1   > > P log det > AA −E log det AA < 2 exp −2τ 2 . (82) n n n n n o n Putting these together yields that for any n > max 1−2√α , 6 ,       >  1 1 1 1 1 det AA> = det AA> and det AA> > √ 4α eE[log det ( n AA )] n n n e n     n simultaneously hold with probability exceeding 1 − 2 exp − 2 − exp − n2 > 1/n. Since det n1 AA> is non-negative, this gives     1 > 1 1 1 E det AA> ≥ · √ 4α eE[log det ( n AA )] , n n e n o n and thereby for any n > max 1−2√α , 7 , q       4α 1 1 1 1 log n   > > E log det AA ≤ log E det AA + + 1.5 n n n n n n  √ 2 log e m + 2 log n 1 2 α 1  ≤ (1 − α) log −α+ + √ 1−α 2n 1 − α − n2 n1.5 √ 1 2 log n 2 α 1  < (1 − α) log −α+ + , √ 1−α n 1 − α − n2 n1.5 2 √ where the second inequality follows from (66) and the value  = 1 − α − n2 . Putting this and (82) together gives that for any n ≥ 4,   √ √ 1 1 4 α τ 1 2 log n 2 α 1 >   1.5 log det AA ≤ + (1 − α) log − α + + √ √ 2 2 n n 1−α n 1− α− n n 1− α− n n   √ 1 2 log n 4 α 1  √ < (1 − α) log −α+ + τ + √ 1−α n 2 n 1 − α − n2 n √ 1 2 log n 5 α  τ ≤ (1 − α) log −α+ + √ 1−α n 1 − α − n2 n 36

 with probability exceeding 1 − 2 exp −2τ 2 . (3) On the other hand, the inequality (79) indicates that for any n >  o n   1    card i | λi n1 AA> < n1/3   1 P λmax AA> ≤ 1/3 ≤ P >  n n

√ 3 + 4 ατ , √  α ατ  1 + 4 1−α− n ≤ 2e−τ n .  n1/3



α 1 1−α− n

τ

1−α , Also, it follows from [47, Theorem 4.5] that for any constant n ≥ 6.414 1−α · e           λmax AA>   λmax AA> 2 n 2  > n2 ≤ P  >   · (n − m + 1) P 2 > >  λ λ  (n − m + 1) min AA min AA  (1−α)n 1 6.414 ≤√ 2π (1 − α)n 1 ≤ √ e−τ n 2π    √ 3 τ α 1−α , The above two bounds taken collectively imply that for any n > max 6.414 ατ · e + 4 , 1−α 1−α− 1 n

 P λmin



1 AA> n








1







>



1



+ P λmax AA > 1/3 and λmin n     >     λmax AA   1   > n2 ≤ P λmax AA> ≤ 1/3 + P >  λ n min AA n1/3

1 ≤ 2e−τ n + √ e−τ n < 3e−τ n 2π for any n ≥ 2. Consequently, when  = n−1/3 ,     1 1 1 1 log det AA> ≥ log det AA> − n n n n 



log

1 i:λi ( n AA> )

− n n    1 1 2α > log det AA> − n n 1−α−





X

− log 1

1 n

 ·



n7/3

o

n1/3

n √ + 8 ατ

1

· log n2



log n n1/3

with probability exceeding 1 − 3 exp (−τ n). Making use of (82) yields that for  = n−1/3 , ( )      √ 1 1 1 1 8τ α   > > > P log det AA −E log det AA < 2 exp (−τ n) . n n n n n1/3 Putting the above two bounds together implies that with probability exceeding 1 − 5 exp(−τ n),       √ 1 1 1 2α log n 1  > > log det AA ≥ log det AA − + 8 ατ n n n n n1/3 1 − α − n1    √   √ 1 1 8τ α 2α log n ≥E log det AA> − 1/3 − + 8 ατ n n n n1/3 1 − α − n1      √ √ 1 1 2α log n >E log det AA> − 8τ α + + 8 ατ · 1/3 n n n 1 − α − n1   √ 1 6α 2α log n > (1 − α) log − α − 2/3 − + 11 ατ , 1 1−α n n1/3 1−α− n 37

1 AA> n


A , where

ΣA is a diagonal matrix containing all k singular values of A. One can then write         > −1 ΣA > −1 > log det I k + A B A = log det I k + V A ΣA 0 U A B U A VA 0    −1  ˜ = log det I k + ΣA B ΣA [k]      1 ˜ −1  1 2 Σ − log det B (83) ≥ log det n A n [k]   ˜ = U > BU A ∼ Wm n − k, U > U A = Wm (n − k, I m ) from the property of Wishart distribution. where B A A  −1  ˜ ˜ −1 in rows and columns Here, B denotes the leading k × k minor consisting of matrix elements of B [k]

from 1 to k, which is independent of A byGaussianality.   Note that n1 log det n1 Σ2A = n1 log det n1 A> A . Then Lemma 6 implies that for any τ >

    m  1 > 1 > 1 m 1 1 log det A A = log det A A + log det Ik n n nm m n n       β β c8 τ log n β log 1 − − + β log α − ≥α − 1− α α α n1/3   α−β c8 τ log n = − (α − β) log − β + β log α − α n1/3 c8 τ log n , (84) = − (α − β) log (α − β) − β + α log α − n1/3  with probability exceeding 1 − C8 exp −τ 2 n for some absolute constants C8 , c8 > 0. On the other hand,  −1 −1 ˜ ˜ ∼ Wm (n − k, I m ), B also it is well known (e.g. [46, Theorem 2.3.3]) that for a Wishart matrix B [k]  −1 −1 ˜ ∼ Wk (n − m, I k ). Applying Lemma 6 again yields that follows the Wishart distribution, that is, B 1 log det n



1 , n1/3

1 2 Σ n A



=

[k]

     1  ˜ −1 −1 n−m 1 1  ˜ −1 −1 1 n−m B = log det B + log det Ik n n n−m n−m n n [k] [k]       β β β c9 τ ≤ (1 − α) − 1 − log 1 − − + β log (1 − α) + √ 1−α 1−α 1−α n   β c9 τ = − (1 − α − β) log 1 − − β + β log (1 − α) + √ 1−α n  holds with probability exceeding 1 − C9 exp −τ 2 n for some universal constants C9 , c9 > 0. 1 Combining (83), (84) and (83) suggests that for any τ > n1/3 , 1 log det n



    1 β > −1 log det I k + A B A ≥ − (α − β) log (α − β) + α log α + (1 − α − β) log 1 − n 1−α c10 τ log n − β log (1 − α) − n1/3  with probability exceeding 1 − C10 exp −τ 2 n for some constants C10 , c10 > 0. 38

(85)

References [1] Y. Chen, Y. C. Eldar, and A. J. Goldsmith, “Channel capacity under sub-Nyquist nonuniform sampling,” IEEE Transactions on Information Theory, vol. 60, no. 8, August 2014. [2] P. L. Butzer and R. L. Stens, “Sampling theory for not necessarily band-limited functions: A historical overview,” SIAM Review, vol. 34, no. 1, pp. 40–53, 1992. [3] M. Mishali and Y. C. Eldar, “Sub-Nyquist sampling: Bridging theory and practice,” IEEE Signal Processing Magazine, vol. 28, no. 6, pp. 98–124, Nov. 2011. [4] H. Landau, “Necessary density conditions for sampling and interpolation of certain entire functions,” Acta Mathematica, vol. 117, pp. 37–52, 1967. [5] C. Herley and P. W. Wong, “Minimum rate sampling and reconstruction of signals with arbitrary frequency support,” IEEE Transactions on Information Theory, vol. 45, no. 5, pp. 1555 –1564, Jul. 1999. [6] R. Venkataramani and Y. Bresler, “Optimal sub-Nyquist nonuniform sampling and reconstruction for multiband signals,” IEEE Transactions on Signal Processing, vol. 49, no. 10, pp. 2301 –2313, Oct. 2001. [7] E. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006. [8] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289 –1306, April 2006. [9] Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Applications. Press Cambridge, 2012, vol. 95.

Cambridge University

[10] M. Mishali and Y. C. Eldar, “From theory to practice: Sub-Nyquist sampling of sparse wideband analog signals,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 2, pp. 375 –391, Apr. 2010. [11] T. Blu, P.-L. Dragotti, M. Vetterli, P. Marziliano, and L. Coulot, “Sparse sampling of signal innovations,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 31 –40, Mar. 2008. [12] K. Gedalyahu, R. Tur, and Y. Eldar, “Multichannel sampling of pulse streams at the rate of innovation,” IEEE Transactions on Signal Processing, vol. 59, no. 4, pp. 1491 –1504, April 2011. [13] M. Medard, “The effect upon channel capacity in wireless communications of perfect and imperfect knowledge of the channel,” IEEE Trans. on Info Theory, vol. 46, no. 3, pp. 933 –946, May. 2000. [14] M. Medard and R. G. Gallager, “Bandwidth scaling for fading multipath channels,” IEEE Transactions on Information Theory, vol. 48, no. 4, pp. 840 –852, Apr. 2002. [15] P. Bello, “Characterization of randomly time-variant linear channels,” IEEE Transactions on Communications Systems, vol. 11, no. 4, pp. 360 –393, Dec. 1963. [16] E. Biglieri, J. Proakis, and S. Shamai, “Fading channels: information-theoretic and communications aspects,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2619 –2692, Oct 1998. [17] G. D. Forney and G. Ungerboeck, “Modulation and coding for linear Gaussian channels,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2384 –2415, Oct 1998. [18] S. Shamai, “Information rates by oversampling the sign of a bandlimited process,” IEEE Transactions on Information Theory, vol. 40, no. 4, pp. 1230 –1236, Jul. 1994. [19] T. Koch and A. Lapidoth, “Increased capacity per unit-cost by oversampling,” Sep. 2010. [Online]. Available: http://arxiv.org/abs/1008.5393

39

[20] M. Peleg and S. Shamai, “On sparse sensing of coded signals at sub-Landau sampling rates,” IEEE 27th Convention of Electrical Electronics Engineers in Israel (IEEEI), pp. 1 –5, Nov. 2012. [21] Y. Chen, Y. C. Eldar, and A. J. Goldsmith, “Shannon meets Nyquist: Capacity of sampled Gaussian channels,” IEEE Transactions on Information Theory, vol. 59, no. 8, August 2013. [22] A. Papoulis, “Generalized sampling expansion,” IEEE Transactions on Circuits and Systems, vol. 24, no. 11, pp. 652 – 654, Nov 1977. [23] A. E. Gamal and Y. H. Kim, Network Information theory.

Cambridge University Press, 2011.

[24] D. Blackwell, L. Breiman, and A. J. Thomasian, “The capacity of a class of channels,” The Annals of Mathematical Statistics, vol. 30, no. 4, pp. 1229–1241, 1959. [25] M. Effros, A. Goldsmith, and Y. Liang, “Generalizing capacity: New definitions and capacity theorems for composite channels,” IEEE Transactions on Information Theory, vol. 56, no. 7, pp. 3069 –3087, July 2010. [26] D. Donoho, A. Javanmard, and A. Montanari, “Information-theoretically optimal compressed sensing via spatial coupling and approximate message passing,” Dec. 2011. [Online]. Available: http://arxiv.org/abs/1112.0708 [27] Y. Wu and S. Verdu, “Optimal phase transitions in compressed sensing,” IEEE Transactions on Information Theory, vol. 58, no. 10, pp. 6241 –6263, Oct. 2012. [28] R. Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” Compressed Sensing, Theory and Applications, pp. 210 – 268, 2012. [29] A. Guionnet and O. Zeitouni, “Concentration of the spectral measure for large matrices,” Electronic Communications in Probability, no. 5, pp. 119 –136, 2000. [30] A. Ghosh, J. Zhang, R. Muhamed, and J. G. Andrews, Fundamentals of LTE. 2010.

Pearson Education,

[31] “LTE in a nutshell: White paper on physical layer,” 2010. [32] E. L. Lehmann and G. Casella, Theory of Point Estimation.

New York: Springer, 1998.

[33] M. Raginsky and I. Sason, “Concentration of measure inequalities in information theory, communications and coding,” arXiv preprint arXiv:1212.4663, 2012. [34] P. Massart and J. Picard, Concentration Inequalities and Model Selection, ser. Lecture Notes in Mathematics. Springer, 2007. [35] T. Tao and V. Vu, “Random matrices: Universality of local spectral statistics of non-Hermitian matrices,” June 2012. [Online]. Available: arxiv.org/pdf/1206.1893.pdf [36] ——, “Random covariance matrices: Universality of local statistics of eigenvalues,” The Annals of Probability, vol. 40, no. 3, pp. 1285 – 1315, 2012. [37] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005. [38] T. Tao, Topics in Random Matrix Theory, ser. Graduate Studies in Mathematics. Providence, Rhode Island: American Mathematical Society, 2012. [39] M. Mishali and Y. Eldar, “Blind multiband signal reconstruction: Compressed sensing for analog signals,” IEEE Transactions on Signal Processing, vol. 57, no. 3, pp. 993 –1009, March 2009. [40] R. A. Horn and C. R. Johnson, Matrix analysis.

40

Cambridge University Press, 1985.

[41] M. Aoki, New Approaches to Macroeconomic Modeling.

Cambridge University Press, 1998.

[42] A. Litvak, A. Pajor, M. Rudelson, and N. Tomczak-Jaegermann, “Smallest singular value of random matrices and geometry of random polytopes,” Advances in Mathematics, vol. 195, no. 2, pp. 491 – 523, 2005. [43] Y. Chen, A. J. Goldsmith, and Y. C. Eldar, “Backing off from infinity: Tight performance bounds for large random vector channels,” arXiv preprint arXiv:1312.2574, 2014. [44] K. WANG, “Random covariance matrices: Universality of local statistics of eigenvalues up to the edge,” Random Matrices: Theory and Applications, vol. 1, no. 1, p. 1150005, 2012. [45] M. Rudelson and R. Vershynin, “Smallest singular value of a random rectangular matrix,” Communications on Pure and Applied Mathematics, vol. 62, no. 12, pp. 1707–1739, 2009. [46] Y. Fujikoshi, V. V. Ulyanov, and R. Shimizu, Multivariate Statistics: High-dimensional and Largesample Approximations, ser. Wiley Series in Probability and Statistics. Hoboken, New Jersey: Wiley, 2010. [47] Z. Chen and J. J. Dongarra, “Condition numbers of Gaussian random matrices,” SIAM Journal on Matrix Analysis and Applications, vol. 27, no. 3, pp. 603–620, July 2005.

41