Know Your Customer: Multi-armed Bandits with ... - Semantic Scholar

Comment

Report 2 Downloads 19 Views

Know Your Customer: Multi-armed Bandits with Capacity Constraints arXiv:1603.04549v1 [cs.LG] 15 Mar 2016

Ramesh Johari∗

Vijay Kamble†

Yash Kanoria‡

Abstract A wide range of resource allocation and platform operation settings exhibit the following two simultaneous challenges: (1) service resources are capacity constrained; and (2) clients’ preferences are not perfectly known. To study this pair of challenges, we consider a service system with heterogeneous servers and clients. Server types are known and there is fixed capacity of servers of each type. Clients arrive over time, with types initially unknown and drawn from some distribution. Each client sequentially brings N jobs before leaving. The system operator assigns each job to some server type, resulting in a payoff whose distribution depends on the client and server types. Our main contribution is a complete characterization of the structure of the optimal policy for maximization of the rate of payoff accumulation. Such a policy must balance three goals: (i) earning immediate payoffs; (ii) learning client types to increase future payoffs; and (iii) satisfying the capacity constraints. We construct a policy that has provably optimal regret (to leading order as N grows large). Our policy has an appealingly simple three-phase structure: a short type-“guessing” phase, a type“confirmation” phase that balances payoffs with learning, and finally an “exploitation” phase that focuses on payoffs. Crucially, our approach employs the shadow prices of the capacity constraints in the assignment problem with known types as “externality prices” on the servers’ capacity.

1

Introduction

Our paper considers the interaction between two central operational challenges for almost any platform: one one hand, new users are continuously arriving, and the platform’s goal is to learn the preferences of these new users quickly enough to ensure satisfactory outcomes for them. On the other hand, learning preferences typically requires actually serving customers, and observing the outcomes. This is a complex undertaking because there may be limited inventory available with which to serve customers. Thus the platform faces a fundamental ∗

Stanford University ([email protected]) Stanford University ([email protected]) ‡ Columbia Business School ([email protected]) †

1

exploration-exploitation tradeoff. It can either “exploit” existing knowledge, serving users with the goal of maximizing surplus given current information; or it can “explore” to gain additional knowledge, serving users with the explicit goal of learning preference information. In this paper we address the correct balance between these activities in the presence of inventory constraints. There are many examples where learning and inventory constraints interact; as one motivating example, consider the case of online fashion retailers such as Stitch Fix and Trunk Club. When a new client arrives, she has preferences regarding desired clothing, some of which are declared as part of a profile, but many of which are qualitative. For simplicity, imagine that there are two types of client types: some want professional clothing (“business type”), and others want casual clothing (“casual type”); but the platform does not know which is which when the client first arrives. In addition, suppose that the platform has two types of shirts available: high-end blue shirts that is equally at home in either a business or casual setting, and red shirts that are decidedly casual. Finally, suppose that the blue shirts are preferred by both types of users to the red shirts; and the platform knows this. What does the platform do? If there are no inventory constraints the decision is simple: send any new incoming customer blue shirts, without attempting to learn her type. But now suppose inventory is constrained; in particular, suppose blue shirts are in short supply. This means some users will not be able to receive their preferred choice: clearly, in this case the platform is better off sacrificing some of the surplus of the casual type clients, and sending them red shirts instead. But in order to do this, the platform needs to know the client is casual type! This can only be learned by sending the red shirts to users to actually learn their type; in the process, regret is incurred on any customers that are of the business type, as they are likely to return clothes they do not want. The stylized problem described here is in fact a general instance of a phenomenon that arises in a wide array of service systems. To varying degrees, the same challenge can be found not only in other online retail settings, but also in online labor markets (such as Upwork), marketplaces for goods (such as eBay or Etsy), and markets for local tasks (such as TaskRabbit). Working with a model that captures the essential features of the resource allocation problem described above, our paper both characterizes the aggregate payoff the platform can achieve, and delivers an algorithm that achieves it. The model we consider has two “sides”: clients and servers (e.g., the different types of inventory above). Clients arrive and depart over time, and sequentially bring jobs during their lifetime. For each job, the platform can select an available server to match the job; once the match is made, a (random) payoff is generated and observed by the platform, and the server (temporarily) becomes unavailable. (In the retail example above, this might be the time required to restock the item.) The platform learns about the client’s preferences based on the observed payoffs; the overall goal is to maximize aggregate surplus from matches. We make three assumptions to simplify our algorithmic development. First, servers are typically known better by the platform than clients, since servers have many more observable transactions from which to infer type; this is the case for all the examples above. Therefore we assume server types are known, while client types must be learned. Second, we assume that

2

the platform centrally controls matching; and third, we assume that there are no strategic considerations. These modeling choices allow us to focus on the algorithmic challenge of learning and matching while meeting capacity constraints. Our main contribution is a simple and practical scheme to maximize aggregate surplus when client types must be learned in the presence of capacity constraints. A key to our approach is to leverage shadow prices on the capacity constraints in the optimization problem with known client types; these prices are quite naturally “externality prices” on servers. Thus we use the following design approach: for each server, we lower rewards according to the externality prices, and execute a learning algorithm individually on a client-by-client basis ignoring the capacity constraint. Note that the resulting problem at the individual client level becomes a multiarmed bandit (MAB) problem [1]. Nominally, this approach should be effective in managing capacity, since the externality prices exactly account for the capacity constraint in the problem with known types. However, it quickly becomes apparent that externality pricing alone is insufficient as a solution. Informally, the issue is that there can be significant capacity violations unless matching is carefully directed: in general, indifferences between servers occur generically even when client types are known, and a capacity-agnostic MAB algorithm would not know how to resolve them appropriately. For this reason, despite the correct externality prices, the allocation of clients to servers can still be very different from the optimal (and in particular feasible) allocation when client types are known. Our scheme addresses this challenge by separating the resource allocation problem problem into three phases: “guessing” the client type, “confirming” the client type (balancing learning and surplus maximization) , and “exploitation.” Most of our algorithmic contribution lies in careful design of the confirmation and exploitation phases. For the former, we optimally tradeoff learning and payoff maximization, but largely ignore capacity constraints (except for the effect of the externality prices). In the exploitation phase, we carefully design the allocation algorithm to minmize capacity violations. In particular, our design ensures that exploitation accounts for 1 − o(1) of all jobs (as the number of jobs grows), so that any capacity violations remain o(1), and we show how to eliminate these costlessly by slightly modifying the tie-breaking rules in the exploitation phase. The scheme we develop is also particularly appealing from a practical standpoint, because it suggests two very natural changes platforms could make painlessly to dramatically improve their inventory management. First, compute externality prices for each type of server. While this might seem computationally complex, in platforms such as those described above, reasonable proxies abound; for example, Stitch Fix might employ a monotone function of the remaining inventory as a proxy for the externality price on a given item. Second, separation the guessing and confirmation phases from the exploitation phase is natural in an online platform: most platforms have an “onboarding” period for each user, and during this period exploration of the user’s type is reasonable; our algorithmic design suggests a viable approach to balancing surplus maximization, capacity constraints, and learning during the onboarding period. As platforms continue to scale, they will increasingly face the challenge of integrating

3

new users while managing their limited inventory. If they allocate indiscriminately while ignoring capacity constraints, the constraints will bind and the platform will be stocked out, with large opportunity costs in lost transaction; on the other hand, without learning quickly enough about new clients, the platform may leave significant revenue on the table. Since all online market platforms focus on growth through acquisition of new users, properly meeting the challenge we study here is fundamental to their success. Our work is a significant step towards providing practical algorithmic approaches to this challenge with provably optimal performance. More broadly, as we discuss further in the conclusions, we believe our work can form the foundation for understanding a range of open issues at the intersection of learning and matching. The remainder of the paper is organized as follows. After discussing related work in Section 2, we present our model in Section 3. In Section 4, we outline the optimization problem of interest to the platform (in a continuum limit where the mass of clients and servers grows large). In Section 5 we outline the straightforward idea of employing externality prices. In Section 6, we present our algorithm and main theorem regarding its performance. In particular, we show that our algorithm attains optimal regret in the limit where each client brings many jobs, and sketch its proof. We conclude in Section 7.

2

Related literature

Our problem setting combines two challenges that have each been extensively studied on their own in the literature: (1) maximizing reward while learning; and (2) a dynamic assignment problem of clients to servers, when there is no type uncertainty. We now describe each of these lines of work in turn, as well as their relationship to our contribution. Maximizing reward while learning is exemplified by the multi-armed bandit (MAB) problem [19, 26, 10]. The learning problem in our setting corresponds to a MAB problem that has been studied in [1] (see [2] for a generalized model). In the standard stochastic MAB model introduced by [30], there are a fixed number of arms that yield a random reward on each pull with an unknown distribution, and the problem is to define an adaptive policy for choosing the arms that minimizes regret (the difference between the policy’s expected average reward and the expected reward of the best arm). The optimal regret in this case was shown to be O((log N )/N ) and several policies that achieve this bound are now known [11, 6]. In the particular problem studied in [1], the uncertainty in the rewards of the different arms is finitely parametrized, with different arms being optimal for different parameter values. This introduces correlation in the rewards of the arms, and depending on certain identifiability conditions, the optimal regret is either O(1/N ) or O((log N )/N ). The dynamic assignment problem with known types plays a key role in our analysis, because we use it to compute “externality prices” on the servers that help correctly account for capacity constraints. To compute these prices, an important step is to resort to a continuum scaling [21, 32, 33] of a dynamic assignment model with stochastic arrivals. In this fluid limit, when types are known, the dynamic assignment problem simplifies to a certain linear program that is closely related to the static planning problem in stochastic processing 4

networks; see [9] and references therein. In the static planning problem corresponding to our setting with known types, a finite number of types of jobs arrive at certain rates into the system, which are to be processed by a finite number of types of servers in some order, each of which has a fixed service rate. The static planning problem in [9] focuses on characterization of the set of arrival rate vectors that can be supported by the system. In our case, there is a (client and server type dependent) payoff generated from each match, and the problem is to allocate jobs to servers to maximize the rate of value generation, subject to server capacity constraints. In this form, the problem can also be viewed as a version of the assignment problem due to Shapley and Shubik [38], in which the resources (the service rates) are divisible. We also note that a related class of dynamic resource allocation problems, online bipartite matching, is also well studied in the computer science community; see [35] for a survey. A different class of models capturing MAB problems with constraints called “Bandits with knapsacks” has been studied in the recent years. [15] formalized the basic model, which subsumed several related problems that had appeared in literature on revenue management under demand uncertainty [17, 18, 37, 40, 12], and budgeted dynamic procurement [14, 39]. The formulation is same as the classical MAB problem due to Lai and Robbins, with the difference being that every pull of an arm depletes a vector of resources which are limited in supply. Subsequently, after a series of extensions [16, 3], the model has been ultimately generalized to the contextual bandit setting of [31] and the linear contextual bandits setting of [20], where the reward functions and the constraints are concave and convex respectively in the space of outcomes [5, 4]. There is considerable difference between our setting and these models. These settings consider a single MAB problem over a fixed time horizon, with observable contexts arriving sequentially. Our setting on the other hand can be seen as a system where different types of MAB problems arrive over time at certain (known) rates, which are coupled by the resource constraints that are needed to be satisfied on an average across all these arriving problems. These types correspond to the “types” of the arriving clients, which are unknown at arrival but belong to a finite set, while the reward distribution for each client type and server pair is known. Hence, as mentioned earlier, the uncertainty in each arriving bandit problem is finitely parameterized, which is again a departure from the kind of bandit problems considered in these models. We conclude by discussing some other directions of work that are related to this paper. Our work has particular relevance to two-sided dynamic matching markets, where clients and/or resources arrive over time and allocations must be made online [7, 8, 13, 27]. There have been some recent works that consider multi-period matching in the stable-marriage setting, in which the agents’ preferences can evolve over time [28, 22, 29]. The main contribution of these works is the analysis of dynamic notions of stability in such settings. [23] study a two-sided dating market, where the men and women date each other repeatedly to learn about their preferences over time. They model the situation as a two-sided multi-armed bandit problem and empirically study the asymptotic stability properties of some natural matching mechanisms combined with a simple ε-greedy learning scheme. A recent work [25] also studies matching with learning, mediated by a central platform; a key difference

5

from our work is that they have no constraints on the number of matches per agent. Agent incentives are considered, and optimal matching rules are studied under profit maximization and welfare maximization. Finally, a recent work [34] studies a pure learning problem in a setting very similar to ours with capacity constraints on each type of server/expert. Experts purely serve to help learn the label of each job (corresponding to our clients). The paper characterizes the set of arrival rates that can be supported by the given service capacity, while achieving the learning objective on each job. The analysis and result are related to our own work in that they are asymptotic, and use a (different) linear program in developing a near optimal assignment policy (albeit with o(1) violations of capacity).

3

The model

We consider a benchmark model of a centralized two-sided platform that dynamically learns about its user population and makes matches over time. The details of our model are as follows. 1. Clients, servers, and jobs. For convenience we adopt the terminology of clients and servers to describe the two sides of the market. As noted in the introduction, we assume a fixed set of servers whose types are known; arrivals and type uncertainty are captured on the client side of the market. Let C denote the set of possible client types and S denote the set of possible server types. Clients of type i arrive at rate ρ(i), and there are n(s) servers of type s.1 When a client arrives, her type i ∈ C is initially unknown. Each client brings N jobs sequentially (before leaving). The marketplace matches each arriving job to an available server type. Upon being matched with a job from a client of type i, a server of type s becomes unavailable for a time Exponential(µ(s)), after which the server is again available to match.2 We also allow a job to not be matched to any server, implying that it is lost. In this case, we will formally think of that job as being directed to an additional server type κ that has an unbounded number of servers, i.e., n(κ) = ∞. W.l.o.g., assume that κ is included in S. We use µ ¯(s) = n(s)µ(s) to denote the effective service rate of server type s, i.e., the maximum rate at which servers of type s can provide service. We define ρ(i) = ρ0 (i)/N for some P base arrival rates ρ0 (i) so that in effect, the total arrival rate of jobs in the system is i∈C ρ0 (i). This allows us to consider matching problems parameterized by N and study their solutions as N varies. Note that ρ0 (i) = N ρ(i) is the overall rate of arrival of jobs to the system from client type i. 1 As noted in the following subsection, we work in a continuum model, meaning that a mass ρ(i) of clients of type i arrive per unit time, and similarly, n(s) is the mass of servers of type s. 2 Note in general, µ may depend on both the server and client type. However, making µ depend only s ensures that the busy time is not indicative of our client type, which simplifies analysis.

6

2. Payoffs. The payoff from assigning a job to a server depends only on the type i of the client bringing the job and the type s of the server. In particular, we assume that the payoff is an independent Bernoulli(A(i, s)), for some matrix A; a job that goes unmatched earns zero payoff (i.e., A(i, κ) = 0 for all i ∈ C). Thus, there is uncertainty in payoff both because the type i is uncertain, and because of randomness in payoffs conditioned on the pair of types. Thus we have the central benefit of exploration: learning about a client’s type is valuable because it helps generate higher payoffs from jobs that the same client brings in future. Note that we think of payoff as accruing to the system overall, without considering how it is divided between the client, server and the platform. 3. What does the platform know? Our knowledge assumptions on the platform are motivated by settings where there is a reasonable installed user base already, so that the emphasis is not on the “cold start” problem. In this setting, aggregate statistics about the platform are known: in particular, we assume the platform knows n(s) and µ(s), as well as the rates ρ(i) and µ(i) corresponding to each client type i. In addition we assume the platform knows the client lifetime number of jobs N . Finally, we assume the compatibility matrix A is known – this applies to a platform that has already learned how payoffs depend on client and server types, but does not know the type of a new client when she brings her first job. This reflects the most salient issue for a platform that has a reasonable installed user base: it is relatively easier to estimate how well different types might be compatible with each other (indeed, platforms already do this well on a regular basis), than to correctly determine the type of any specific new user. The central challenge in this model is the following. Since clients are arriving and departing, and the platform does not know the type of any individual client, the market must constantly learn the types of new clients in order to maximize future payoffs generated by those clients’ jobs. We develop an optimal matching policy to balance exploration and exploitation, and thus maximize long-run aggregate payoff.

4

The dynamic assignment problem: A continuum limit

In this section we precisely state the optimization problem we study. We use a continuum (fluid) scaling [21, 32, 33] to simplify analysis of our model, and obtain guidance on the design of optimal matching strategies for the platform. This corresponds to the limit of the system in which the number of servers and the client arrival rates ρ0 (i) are both simultaneously

7

scaled up to ∞.3,4 Further, we consider stationary ergodic assignment policies, i.e., policies such that the system state has stationary long-run behavior. We measure the performance of a policy by the steady-state rate of payoff accumulation. Consider a stationary ergodic policy. The policy induces a function π from the history of a client (who arrives while the system is in steady-state) to a distribution of the type of server her t-th job will be assigned to. The history Ht consists of, for each of the previous t−1 jobs, the server type it was assigned to, and the payoff that was generated. Formally, any policy for routing N jobs from each client induces a steady-state mapping π : (S × {0, 1})t−1 → ∆(S) for all t = 1, 2, . . . , N . Clearly, any π further induces a fraction of jobs ξπ (i, s) generated by a client of type i that are served by a server of type s. Then the rate at which servers of type s serve jobs from clients of type i is ρ(i)N ξπ (i, s) = ρ0 (i)ξπ (i, s). The capacity constraint can now be captured as follows: X ρ0 (i)ξπ (i, s) ≤ µ ¯(s) ∀s ∈ S . (1) i∈C

The expected payoff P that results from jobs associated with a single client of type i under policy π is given by s∈S ξπ (i, s)A(i, s). Since jobs arrive from clients of type i at a rate of ρ(i)N = ρ0 (i), the overall steady state rate of payoff accumulation is X X ρ0 (i) ξπ (i, s)A(i, s) , i∈C

s∈S

Our objective is to find a policy π that maximizes this rate. We focus on policies that allocate jobs from a client to servers based only on the history of that client; this choice is without loss of generality, as we now show. In general, a policy can base its assignment decision for a job not just on the history of the particular client, but on the entire history (including the current state) of the entire system. Notice, however, that for any π induced by some stationary policy, there is a very simple policy that induces the same π and satisfies the capacity constraints, but makes assignment decisions based only on the history of the client. The policy simply assigns the t-th job of a client with history Ht to a server type drawn from the distribution π(Ht ). Since (1) holds, we deduce that a 3

Our continuum model is the correct limiting description of the finite system because we are thinking of the number of servers of each type as scaling up, as opposed to the service rate scaling up. In the latter case, there is a positive probability that a job may find all the servers of a particular type occupied even in the limit, irrespective of whether the servers are operating at full capacity or not. On the other hand in our model, with the number of servers scaling up, the probability that a job finds all the servers of a particular type occupied vanishes in the limit. This holds even in the case where all the servers are operating at full capacity and hence the fraction of time an individual server idles can be zero. 4 Strictly speaking, in the continuum limit we interpret A(i, s) not as the expected payoff from a single job (which would lead to infinite total payoff), but the payoff from a unit mass of jobs of client type i assigned to server type s. This is just an appropriate “scaling” of the payoff per job to arrive at the desired limit, and does not alter the learning problem for a single client. Hence, we will continue to think of the payoff from a single job as being Bernoulli(A(i, s)) in the context of learning client types.

8

server of the drawn type will be available with probability 1, in steady state. It follows that the policy induces the same π (and hence produces the same rate of payoff accumulation). Thus, it is convenient to identify each π with the corresponding simple policy, and restrict our study (at no cost in terms of our objective) to the class of such policies ΠN , that choose a server type for each job based only on the history of the client.5 Let ξπ = [ξπ (i, s)]C×S . This is a (right) stochastic matrix since each row sums to 1. We call this the routing matrix corresponding to a policy π. Let N N ⊆ [0, 1]|C|×|S| Ξ = ξπ : π ∈ Π be the set of achievable routing matrices (when each client brings N jobs). We start by summarizing a few properties of ΞN in order to provide intuition for the structure of this set. For simplicity, we consider a case with just two client types C = {i, i0 } in describing these properties. 1. ΞN is a convex set, since since any convex combination of ξ¯ ∈ ΞN and ξ¯0 ∈ ΞN can be achieved by appropriate randomization between the corresponding policies. In fact, ΞN is a convex polytope (see Proposition 8.8 in the Appendix). 2. The projection of ΞN to the dimensions ξ(i, ·) is a subset of the probability simplex. It is identical to the simplex in the absence of capacity constraints. 3. If types i and i0 cannot be distinguished (because they generate the same payoffs), then we must have ξ(i, ·) = ξ(i0 , ·). 4. If types i and i0 can be distinguished then this allows for ξ(i, ·) 6= ξ(i0 , ·). In the absence of capacity constraints, limN →∞ ΞN converges to the product of probability simplices, one for i and another for i0 . Now we can state the optimization problem we solve. The overall problem of choosing a policy π to maximize the rate of accumulation of payoff is: X X W N = max ρ0 (i) ξ(i, s)A(i, s) (2) i∈C

s.t.

X

s∈S

ρ0 (i)ξ(i, s) ≤ µ ¯(s) ∀ s ∈ S;

i∈C

ξ ∈ ΞN . Since ΞN is a convex polytope, this is a linear program, albeit a complex one. The complexity of this problem is hidden in the complexity of the set ΞN , which includes all possible routing matrices that can be obtained using policies in ΠN .6 The remainder of our paper is devoted to solving this problem and characterizing its value, by considering an asymptotic regime where N → ∞. 5

Notice that our argument also establishes that allowing waiting is not beneficial: we can assume that jobs must be served immediately or be dropped, with no effect on the rate of payoff accumulation. 6 Notice in particular that a policy allows for independently choosing Ω(2N ) distributions over S.

9

5

Our approach: Pricing the capacity constraints

In this section we introduce prices on each server type (externality prices), derived from the problem of maximizing the rate of payoff accumulation when client types are known. These prices are exactly the optimal dual variables corresponding to the capacity constraints in this optimization problem. We will now describe the problem with known types, and then our approach of incorporating these prices into the problem with unknown types (studied in the limit of large N ).

5.1

Known client types

Let D denote the set of achievable routing matrices when client types are known, defined as: X |C|×|S| D= x∈R : x(i, s) ≥ 0; x(i, s) = 1 . (3) s∈S

The steady state payoff maximization problem with known client types is the following: X X V ∗ = max ρ0 (i) x(i, s)A(i, s) (4) i∈C

s.t.

X

s∈S

ρ0 (i)x(i, s) ≤ µ ¯(s) ∀ s ∈ S;

i∈C

x ∈ D. This linear program is a special case of the “static planning problem” that has arisen frequently in the operations literature (see, e.g. [9]). The optimal solution is a routing matrix x∗ , which is a stochastic matrix (i.e., all row sums are one). Note that an optimal policy (with known types) can easily be constructed from x∗ : the policy simply routes each job from a client of type i to a server drawn independently from the distribution x∗ (i, ·). Now let p∗ (s)s∈S be the shadow prices corresponding to the server capacity constraints in V ∗ . Formally, these are obtained from an optimal solution to the dual of problem V ∗ . Intuitively, these shadow prices capture the negative externality per unit of jobs that are assigned to server type s. Note that both p∗ and x∗ can be easily computed, given the primitives. Our next proposition shows these prices are uniquely determined under a mild condition, given as follows. Definition 1. We say that arrival rates (ρ0 (i))i∈C and service rates (¯ µ(s))s∈S satisfy the generalized imbalance condition if there is no pair of non-empty subsets of clients and servers (C 0 , S 0 ), such that the total job arrival rate of C 0 exactly matches the total effective service rate of S 0 . Formally, X X ρ0 (i) 6= µ ¯(s) ∀φ 6= C 0 ⊆ C, φ 6= S 0 ⊆ S . i∈C 0

s∈S 0

10

The generalized imbalance condition holds generically.7 Note that this condition does not depend on the expected payoff matrix A. We have the following result. Proposition 5.1. Under the generalized imbalance condition, the server shadow prices p∗ are uniquely determined. The optimal externality-adjusted payoff of user i is: U (i) = max A(i, s) − p∗ (s); s∈S

(5)

and the set of optimal server types for user i is: S(i) = arg max A(i, s) − p∗ (s). s∈S

(6)

A standard duality argument demonstrates that in an optimal solution (with known types), jobs from client type i are directed towards server types who generate the maximum expected payoff after adjusting for the shadow price of the server type. Formally it can be shown that x∗ (i, ·) is supported on S(i). We remark that Remark 2. Under generalized imbalance and if there is at least one server type that is being used to capacity (in some x∗ ), then there is at least one customer type i such that S(i) is not a singleton: thus optimal routing requires properly allocating jobs between different servers that are optimal for the same client (in other words, optimal tie-breaking is necessary), even given the externality prices.

5.2

Unknown client types: The large N regime

We now return to problem (2). We quantify the performance of a policy for our problem (2) in terms of its regret (this is any policy in ΠN that respects capacity constraints), which we define as the amount by which its rate of accumulation of payoff is less than the maximum rate V ∗ achievable with known client types (problem 4). We will say that a policy is regretoptimal if it achieves the smallest possible regret to leading order, asymptotically in N . As noted above, we study (2) in the large N regime, where each client brings many jobs to the system. Informally, in this limit, the platform should learn the client type early in their lifetime; therefore it is natural to expect the optimal prices p∗ (s) to be “close” to optimal shadow prices in (2). This observation motivates our approach: Eliminate the capacity constraints in (2) by replacing A(i, s) with “externality-adjusted payoffs” A(i, s) − p∗ (s). The idea is that if p∗ (s) were the correct optimal dual variables, then the resulting solution, so long as any tie-breaking is done appropriately to meet capacity constraints, would in fact 7

The set of job arrival and effective service rate vectors for which the condition holds is open and dense |C|+|S| in R++ , where R++ denotes the set of strictly positive real numbers.

11

be optimal for the problem with unknown client types. This alteration yields the following problem: X maximizeπ∈ΠN ρ(i)ξπ (i, s)(A(i, s) − p∗ (s)) , (7) i∈B,s∈S

However, since p∗ (s) may not be the correct dual variables (they were computed from an optimization problem where client types are known), solving the preceding problem will in general will lead to capacity violations (this is also related to Remark 2 above). In fact, if indeed there is an optimal solution to the preceding problem that ensures that the capacities are not exceeded, and also ensures that the servers that were utilized to their full capacity in the solution to problem (4) are fully utilized in (7) as well, then (a) p∗ (s) would indeed be the optimal dual variables in the original problem (2), and (b) this solution will be optimal in the original problem (2).8 Our main result develops a regret-optimal algorithm for the preceding problem (7), where by regret-optimal here we mean that it achieves the difference in the optimal value of this problem and the same problem but with known types up to the leading order term (asymptotically in N). The algorithm further ensures that capacities are not violated, and that the servers that were fully utilized in the solution to problem (4) remain fully utilized. These two properties imply that the algorithm is regret-optimal in the original problem (2). The problem of developing such an algorithm appears to closely resemble a multi-armed bandit problem [2], with the main departure being the need to satisfy these capacity utilization requirements.

6

A regret-optimal algorithm

In this section we present a regret-optimal algorithm to solve (2). We begin the section with an example to illustrate the main difficulties our algorithm addresses. For convenience, we define the absolute regret of a policy to the be the expected total loss in payoff for one client over N jobs, relative to the payoff achievable if the client type was known. (Hence, the absolute regret is simply N times the regret).

6.1

Illustrative example

We study a simple example from four perspectives: known or unknown client types, with and without capacity constraints. In the example we study there are 3 client types, a, b and c, bringing in jobs at rates ρ0 (a) = ρ0 (b) = ρ0 (c) = 0.5. There are three server types 1, 2 and 3, whose effective service rates are µ ¯(1) = 0.7, µ ¯(2) = 0.7 and µ ¯(3) = 0.2. The matrix of probability of rewards is shown in Figure 1. We first consider this model without capacity constraints on the servers [2]. The optimal routing policy in this case if the client type is known is to route jobs from type a to server 8

This is easy to see from the optimality conditions.

12

Servers

𝜇

0.7

0.7

0.2

Clients 0.5

a

0.8

0.6

0.1

𝜌0 0.5

b

0

0.6

0.2

0.5

c

0

0.6

0.5

Figure 1: A matrix for an example with three client types, a, b, and c, and three server types 1, 2 and 3.

type 1, and the jobs from b and c to server type 2. As a consequence, even if the client type is not known, an adaptive allocation policy need not invest any effort to distinguish between client types b and c: both of them have a common optimal server type, namely type 2. But when the client type is unknown, the optimal algorithm does need to distinguish between client type a and the other two types, since their optimal servers are different. But with server 2, the probabilities of reward are the same for all the three types, and hence it cannot help in making this distinction. Hence any optimal policy must send some jobs to either server 1 or server 3 in order to distinguish the client types, and thus in the event that the client is of type b or type c, some loss relative to the case of known types is unavoidable. This aspect of the example illustrates that there may be “difficult” (yet necessary) type distinction requirements, that force the algorithm to incur regret over the short-term in order to maximize long-run average payoff. Note that an Ω(1) absolute regret is unavoidable whenever we don’t have a single server type that is optimal for all client types, since for any policy, there will be at least one client type such that an expected Ω(1) jobs (out of N ) for the client are sent to a suboptimal server type. However, in our example, the difficulties in distinguishing the class {b, c} from a are more serious. In particular, it turns out that absolute regret of Ω(log N ) is unavoidable. To get a sense of why this is true, imagine a policy that initially sends a small number k of jobs (leading to Θ(k) absolute regret) to server type 1 and/or 3, and based on the realized payoffs, estimates (with confidence 1 − exp(−O(k))) that the client type is in {b, c}. Based on this estimate, the policy can choose to send all future jobs from that client to server type 2. However, this means that there will be no further learning about the client type, and the expected contribution to absolute regret is about N times the probability that the true client type is actually a, i.e., N exp(−O(k)). Combining, we see the total absolute regret is at least Θ(k) + N exp(−Θ(k)) ≥ Θ(log N ), where k = Θ(log N ) is needed to achieve Θ(log N ) absolute regret. Now we consider the model with capacity constraints at the servers, as in Figure 1. It is straightforward to check that the optimal routing policy with known types is as follows: all 13

the a jobs are sent to server type 1; b jobs are split between server types 2 and 3 as [0.4+x, 0]; and c jobs are split between server types 2 and 3 as [0.3 − x, 0.2], for any x ∈ [0, 0.1]. The dual optimal prices for the server types are unique, and they are: p∗ (1) = 0, p∗ (2) = 0.6 and p∗ (3) = 0.5.9 Note that now server type 2, which is optimal for client types b and c in the unconstrained problem, does not have adequate capacity to serve all jobs arriving to it. Therefore some of the jobs of client type c have to be served by server 3 to avoid violating capacity constraints. As a result, with unknown client types and the specified capacity constraints, the optimal policy must invest effort in distinguishing between the client types b and c — in contrast to the problem without capacity constraints. We conclude by showing that even introducing the externality prices is not enough to prevent capacity violations. In particular, suppose we account for the externality imposed by capacity constraints by using the externality-adjusted payoffs A(i, s) − p∗ (s), cf. problem 7. Next, we employ a “black-box” implementation of any optimal multiarmed bandit algorithm for the resulting learning problem (such as the one proposed in [1]) on a client-by-client basis, without any modification to accommodate capacity constraints. Such an algorithm would never use server type 3 (since this server is weakly dominated by server types 1 and 2), and thus will never distinguish client types b and c. This will lead to a loss in the rate of payoff accumulation, as well as possible violation of the capacity of server 2. This phenomenon is general, cf. Remark 2, and demonstrates why simply introducing the externality-adjusted payoffs is insufficient to solve the general capacity constrained problem (2).

6.2

Guess, confirm, exploit

We now generalize the insights from the preceding example to motivate and describe our approach. Our central object of study is a policy π ∗ that focuses on regret minimization with respect to the externality-adjusted payoffs of each client. Our algorithm design is characterized by three “phases”: When a new client arrives, the policy first enters a guessing phase, used to form an initial conjecture about the client type; this is followed by a confirmation phase, used to confirm (or reject) the initial guess — and in particular, as the preceding example suggests is essential, to distinguish from other client types; and finally, an exploitation phase, focused on payoff maximization for the confirmed client type, while ensuring capacity is not violated. We describe each of these phases next; a formal definition of the policy π ∗ comes in the following subsection. Recall that S(i) is the set of server types that maximize the externality-adjusted payoff A(i, s) − p∗ (s) for client type i. 1. Phase 1: Guessing. This is a short phase in which we try to form a quick guess of the client type. The server types are chosen in a round-robin fashion (so that all types can 9

For a moment suppose type a is removed from the example. The need to make this type of distinction between b and c for capacity reasons only leads to an Ω(1) regret, since one only needs to differentiate between types b and c enough to in effect achieve the desired routing. If the estimated type is c and jobs are sent to server type 3, then the estimate continues to get refined. On the other hand if the server type 2 is used, there is no accumulation of regret anyway, since type 2 is optimal for both b and c.

14

√ be distinguished from each other) for a fixed O( log N ) number of jobs so that at the end of this phase, the type of the client is guessed to be some i, and the probability of error is √ exp(−Ω( log N )). 2. Phase 2: Confirmation. Suppose that we have guessed the client type is i. The second phase seeks to confirm this guess. In particular, the algorithm must compare i to each other type i0 . There are two key cases to consider here, as we now discuss. • Case 1 (An easy type pair): A(i, s) 6= A(i0 , s) for at least one s ∈ S(i). In this case, the server s can be used to distinguish between types i and i0 , without incurring any regret if the true type is confirmed to be i. In the example above, the server 1 can be used to distinguish between client type a and client types b and c, without incurring regret if the true client type is a. To cover this case the algorithm directs Θ(log N ) jobs (in expectation) uniformly to all the servers in S(i), based on which we either confirm once and for all that the type is i and move to the exploitation phase, or we learn that we made a mistake and go back to the guessing phase. • Case 2 (A difficult type pair): A(i, s) = A(i0 , s) for all s ∈ S(i) and S(i) ∩ S(i0 ) = φ. This is the most challenging case to address. We refer to a pair (i, i0 ) such that the conditions of this case hold as a difficult type pair. In general, if none of the server types in S(i) allow us to distinguish between i and i0 , and all the servers in S(i) are strictly suboptimal for i0 , then any policy that achieves small regret must send some jobs to other servers who allow us to make this distinction. This process must incur a loss in the event that the true client type is i. In the example above (with externality-adjusted payoffs), the pair (b, c) is an example of a difficult type pair. In this case we need to send some jobs to other servers that allow i to be distinguished from i0 . Note that in choosing such servers, there are two objectives to consider: minimizing the regret incurred from choosing a given server (based on the guessed type i), and maximizing the rate at which that server helps us distinguish i from i0 . Recall that U (i) is the optimal externality-adjusted payoff of user i (cf. (5)). The loss relative to this optimum for each server s is U (i) − A(i, s) − p∗ (s). On the other hand, for any server s that can distinguish between i and i0 , the expected number of jobs taken to distinguish them with probability of error less than 1/N is (log N )/I(i, i0 |s), where I(i, i0 |s) is the Kullback-Liebler (KL) divergence between the reward distribution Bernoulli(A(i, s)) and the reward distribution Bernoulli(A(i0 , s)).10 Thus the optimal server is the one for which the following relative loss is minimized: U (i) − [A(i, s) − p∗ (s)] . I(i, i0 |s) In our algorithm design, we simultaneously compare i against all other i0 for which (i, i0 ) is a difficult type pair. This requires efficiently choosing (a distribution over) 10

The KL divergence between a Bernoulli(A(i,s)) and a Bernoulli(A(i’,s)) distribution is defined as A(i,s) 1−A(i,s) I(i, i0 |s) = A(i, s) log A(i 0 ,s) + (1 − A(i, s) log 1−A(i0 ,s)

15

suboptimal servers, and a generalization of the preceding relative loss metric in the case of multiple difficult type pairs. • Case 3 (No need to distinguish): A(i, s) = A(i0 , s) for all s ∈ S(i) and S(i) ∩ S(i0 ) 6= φ. In this case, no server in S(i) can be used to distinguish i from i0 , but S(i) has nonempty intersection with S(i0 ). It can be easily shown that this implies that S(i) ⊆ S(i0 ): if s ∈ S(i) ∩ S(i0 ) and A(i, s) = A(i0 , s), then is immediately seen that i and i0 have the same optimal externality-adjusted payoff for s and hence for every other server type in S(i). But then every server in S(i) must also be optimal for i0 . In this case, the algorithm need not invest jobs in trying to distinguish i from i0 : even if the true type happened to be i0 , no regret is incurred by supposing that the true type is i. Note that in principle, if the true type is i0 but the algorithm believes it to be i, it may not allocate jobs to servers to manage the capacity constraint effectively; e.g., in the example above, S(b) ⊂ S(c), so if the algorithm guessed b but the true type were c, it would mistakenly fail to send some jobs to server 3. We address this issue by careful adjustment of the routing matrix in the exploitation phase, to ensure that any this issue does not result in capacity violations (or loss of payoff). 3. Phase 3: Exploitation. In this phase, the jobs are directed to the optimal server types for the confirmed client type i according to a particular routing matrix y ∗ . This matrix is carefully constructed to ensure that the capacity constraints are not violated overall. The idea is to construct y ∗ to closely resemble an optimal solution x∗ to the problem with known types (cf. (4)). The routing matrix y ∗ needs two key properties: first, it should correct for errors relative to x∗ that occur during routing in the guessing and confirmation phases; and second, it should account for the fact that the confirmed type may be incorrect with some probability (e.g., in Case 1 of the confirmation phase, described above). Also, y ∗ should share the support of x∗ . We show in Proposition 8.4 that such a y ∗ can indeed be chosen under the generalized imbalance condition. We refer to a type i as a difficult type if there is at least one i0 such that (i, i0 ) is a difficult type pair. The presence of difficult types makes the optimal regret scale as Θ(log N/N ) [1]. This is because for any difficult type i, some effort must be spent distinguishing i from all i0 such that (i, i0 ) is a difficult type pair, cf. Section 6.1. On the other hand if there is no difficult type, then the optimal regret scaling is O(fN /N ) for any fN = ω(1) (this follows immediately from the proof of our main theorem). The reason is that only O(fN ) jobs from a client of type i need to be sent to servers not in S(i). Nominally, we might expect that generically for any values of the A matrix, difficult type pairs do not exist: the definition requires exact equality in payoffs. On the other hand, this extreme interpretation is somewhat misleading, and an artifact of considering the absolute limit where N → ∞. For example, if A(i, s) = 0.5 and A(i0 , s) = 0.6, it would take approximately 1/(0.1)2 = 100 jobs to distinguish i and i0 using s, which may be comparable to (or more likely larger than) practical values of N ; e.g,. in a platform like Upwork, clients may not bring more than a few jobs per month. Therefore s is practically unable to distinguish between i and i0 . For this reason, we consider a full understanding of 16

the general setting with difficult types to be the more relevant scenario to study. Formally, we consider the case where at least one difficult type pair exists, so that there is a tradeoff between myopic payoffs and learning.

6.3

Formal definition of the policy π ∗

In this section we provide a formal definition of the policy π ∗ , based on the discussion above. We require a few additional pieces of notation. For each client type i, let Cd (i) be the set of types i0 such that (i, i0 ) is a difficult type pair; and let C 0 be the set of difficult types. Recall these are the client types described in Case 2 of the discussion of the confirmation phase above. Next, for client type i, let Ce (i) be the set of types that can be distinguished from i using a server type in S(i): Ce (i) = {i0 ∈ C : ∃s ∈ S(i) s.t. A(i, s) 6= A(i0 , s)}.

(8)

The subscript “e” refers to the fact that these are types i0 such that it is easy to distinguish i from i0 . Recall that these are the client types described in Case 1 of discussion of the confirmation phase above. Next, let S(i) = S \ S(i) be the set of all server types that are sub-optimal for client type i. Finally, note that since each client type is distinguishable by some server type, there exist integers Ks ≥ 0 for each s ∈ S such that X X Ks A(i0 , s) Ks A(i, s) 6= s∈S

s∈S

for any i, i0 ∈ C. Let ψ = mini,i0 ∈C | s∈S Ks (A(i, s) − A(i0 , s))|. The policy π ∗ is formally defined as follows. P

√ P 2 log N ( K )2

s s 1. Phase 1: Guessing. Fix an arbitrary order of the server types. For ψ2 rounds, in each round, allot the client to server s for Ks number of times P in the fixed order. Let p be the empirical average reward per round, and let i∗ ∈ arg min |p − s∈S Ks A(i, s)|.11

2. Phase 2: Confirmation. We separate this phase into two subphases. If the guessed type i∗ is a difficult type, we try to distinguish i∗ from all types in Cd (i∗ ); this is called confirmation phase A. Next, if there are any remaining types i0 ∈ Ce (i∗ ), we try to distinguish i∗ from them; this is called confirmation phase B. (a) Phase A: Difficult types. This phase is only executed for difficult types; if Cd (i∗ ) is empty, the algorithm directly proceeds to confirmation phase B. Let QN = N log N . Further, let P ∗ ∗ ∗ s∈S(i∗ ) αs U (i ) − [A(i , s) − p (s)] ∗ P α ∈ arg min 0 max∗ . ∗ 0 α∈∆(S(i∗ )) i ∈Cd (i ) s∈S(i∗ ) αs I(i , i |s) 11

All the results will still hold if

√

log N here is replaced by any f (N ) = Ω(log log N ) ∩ o(log N ).

17

Let the time in this phase be labeled as n = 1, 2, · · · . At each time n, a server in S(i∗ ) is chosen according to the distribution α∗ . Let the server type chosen at time n be sn and the outcome be Xn . For any i ∈ C and s ∈ S, let p(X, i, s) = A(i, s)1{X=1} + (1 − A(i, s))1{X=0} . Let n Y p(Xt , i, st ) , Λn (i, i0 ) = 0, s ) p(X , i t t t=1 i.e., the likelihood ratio of the observations under type i and i0 . Finally let Λn (i) = C (i) mini0 ∈C Λn (i, i0 ), and let Λnd (i) = mini0 ∈Cd (i) Λn (i, i0 ). The confirmation phase A continues until one of the following conditions is satisfied: C (i∗ )

i. Λnd (i∗ ) ≥ QN , in which case we say that i∗ is confirmed and move on to the exploitation phase. ii. Λn ≤ Q1N , in which case we declare the confirmation phase a failure and go back to the guessing phase. (b) Phase B: Easy types. Let i∗ be the confirmed client type and let the time in this phase be labelled as n = 1, 2, · · · . Then at each stage n, choose a server in S(i∗ ) uniformly at random. Keep track of the likelihood ratio ∗

ΛCne (i ) (i∗ ) = 0 min∗ Λn (i∗ , i0 ). i ∈Ce (i )

Then the confirmation phase B continues until either of the two conditions hold: C (i∗ )

i. Λne

(i∗ ) > QN , in which case we move on to the exploitation phase.

C (i∗ )

ii. If Λne (i∗ ) ≤ Q1N , in which case we declare the confirmation phase a failure and go back to the guessing phase. 3. Phase 3: Exploitation. For each job, choose a server in S(i∗ ) with probability y (i∗ , s), where y ∗ is a specified routing matrix (cf. Proposition 8.4 in the Appendix). ∗

6.4

Main result

We conclude by presenting our main theorem analyzing the performance of the policy described above. We begin with a relatively straightforward result that provides some intuition. When a single client brings a large number of jobs, we expect that this eases the burden of learning; and in fact asymptotically as N → ∞, allows for the same rate of payoff accumulation as if the type of a client is known from the start. In terms of problem 2, we expect limN →∞ ΞN = D, where recall that X |C|×|S| D= x∈R : x(i, s) ≥ 0; x(i, s) = 1 , (9) s∈S

is the set of achievable routing matrices when client types are known. Indeed, under the mild requirement that no two rows of the expected payoff matrix A are identical, we establish that 18

limN →∞ ΞN = D; see Proposition 8.7 in the Appendix for details, where we also show how a simple family of naive “explore then exploit” policies achieves the optimal performance with known types (V ∗ , cf. problem (4)) asymptotically as N → ∞. The problem with these simple, naive policies is that they incur high regret, precisely because they don’t account for the cases described in both our example and the description of the policy π ∗ . To that end, the following theorem is our main result: it shows the policy π ∗ described above not only achieves a payoff of V ∗ in the limit as N → ∞, it also has optimal regret to leading order in N . Theorem 6.1. Suppose that the generalized imbalance condition holds (Definition 1). Suppose also that no two rows of A are identical, and that at least one difficult type pair (defined in Section 6.2) exists. Let W N (π ∗ ) denote the rate of payoff accumulation of policy π ∗ described below. Then there is a constant C = C(ρ0 , µ ¯, A) ∈ (0, ∞) such that we have: V ∗ − W N (π ∗ ) = C log N/N + o(log N/N ) . Moreover, policy π ∗ achieves the optimal regret to leading order, i.e., we have V ∗ − W N = C log N/N + o(log N/N ) or equivalently W N − W N (π ∗ ) = o(log N/N ) . We precisely characterize the constant C in the Appendix (see Propositions 8.1 and 8.6). Proof sketch. The critical ingredient in the proof of the theorem is the following relaxed optimization problem. In this problem there are no capacity constraints, but capacity violations are charged with prices p∗ from the optimization problem with known client types. X X WpN∗ = max ρ0 (i) ξ(i, s)A(i, s) ξ∈ΞN

−

i∈C

X s∈S

s∈S

X p (s)[ ρ0 (i)ξ(i, s) − µ ¯(s)]. ∗

(10)

i∈C

Now subject to there being at least one pair of difficult client types, there is an upper bound for the performance of any policy in this problem, expressed relative to V ∗ [1]: V ∗ − WpN∗ ≥ C log N/N + o(log N/N ), where C > 0 is precisely the constant appearing in the theorem. By a standard duality argument, we know that WpN∗ ≥ W N , and hence this bound holds for W N as well (see Proposition 8.1). Let WpN∗ (π ∗ ) denote the value attained by the policy π ∗ in this problem. Then there are two key steps in the proof.

19

1. First, we show that our policy π ∗ achieves near optimal performance in problem (10), i.e., V ∗ − WpN∗ (π ∗ ) = C log N/N + o(log N/N ). This is shown in Proposition 8.2. To gain some intuition for this result, first note that the difference in values V ∗ and WpN∗ is the difference between: X ρ0 (i) max[A(i, s) − p∗ (s)], and (11) s∈S

i∈C

max

ξ∈ΞN

X i∈C

ρ0 (i)

X

ξ(i, s)[A(i, s) − p∗ (s)].

(12)

s∈S

Now an optimal policy that solves (11) has to try to choose the optimal server type in S(i) = arg maxs∈S [A(i, s) − p∗ (s)] for each client type i, but without knowing what the type is, which is exactly problem (7): the unconstrained MAB problem discussed in the example with externality adjusted payoffs. As we have discussed in the example, in order to ensure an absolute regret of O(log N ) relative to the problem where the client types are known, for each difficult type i, one needs to distinguish that type from all the other types that form a difficult pair with this type with a confidence of at least 1 − 1/N . Further, there is an optimal “confirmation” policy for making this distinction that appropriately routes the jobs to the suboptimal servers S(i). This policy strikes the right balance between the average number of jobs required to make these distinctions, and the average per-job regret incurred with respect to U (i). This resulting optimal regret is Ci log N for each type i, where Ci > 0 can be exactly characterized, and on the event that the true type is i, this regret is unavoidable, which results in the O(log N ) regret overall. Now the challenge for any policy that tries to achieve this bound is to achieve these type-dependent minimal regrets without a priori knowing the true type. This is where the short guessing phase of length o(log N ) helps. After we form an initial guess of the underlying type, the optimal confirmation policy for the guessed type can be used to confirm that type later in the confirmation phase, thus ensuring the optimal regret up to the leading order term. 2. In the next part of the proof, we show that by using the generalized imbalance condition, for a large enough N , we can design a routing matrix y ∗ (that depends on N ) to be used in the exploitation phase of the policy π ∗ , with the following two properties: P (a) ¯(s) = 0 for any server s such that p∗ (s) > 0, and, i∈C ρ(i)ξπ ∗ (i, s) − µ P (b) ¯(s) ≤ 0 for any other server. i∈C ρ(i)ξπ ∗ (i, s) − µ This means that by this choice of y ∗ , the algorithm ensures that capacity constraints are not violated, and that the servers that were fully utilized in the solution to problem (4), i.e., under the routing matrix x∗ , remain fully utilized. This is shown in Proposition 8.4.

20

The key point here is that at the end of the confirmation phase of the policy, the type of the client is learned with a confidence of at least (1 − o(1)), and this fact coupled with the generalized imbalance condition is sufficient to ensure an appropriate choice of y ∗ will correct (almost) all the deviations from x∗ ’s capacity utilizations that may have happened in the guessing and the confirmation phase, and also the deviations that will happen in the exploitation phase owing to the fact that the type is not learned perfectly. One can see this in our example. Suppose that our policy initially guessed that the client type is b, and further it has distinguished it from a in the confirmation phase. At this point, since the b and c have a common optimal server type 2, the policy will not try to distinguish them any more (see Case 3 in the informal description of our algorithm) and enter the exploitation phase, directing all remaining jobs to seller 2. But since there is a (vanishing) chance that the true type was actually c, which is supposed to be routed to seller 3 as well in the optimal policy, we make up for the possible under-utilization of seller 3 by sending a larger fraction of the jobs to seller 3 than required by the optimal policy x∗ in the case that we actually confirm that the client type is c. The result shows that such corrections can always we done for a large enough N , assuming that the generalized imbalance condition holds. These two results together ensure that the policy π ∗ is feasible in problem (2), and further, the rate of accumulation of payoff X X ρ0 (i) ξπ∗ (i, s)A(i, s) i∈C

s∈S

is near-optimal (since this is precisely WpN∗ (π ∗ )). The combination of these facts gives us the result.

7

Conclusion

As noted in the introduction, this work suggests a fruitful algorithmic approach with potentially very wide-ranging application, from retail to online matching markets. There are certainly several generalizations of the model that would enrich the class of applications where our analysis applies. Two in particular are important open directions: (1) generalizing the model of types beyond the finite type model considered here (e.g., clients and servers may be characterized by features in a vector-valued space, with compatibility determined by the inner product between feature vectors); and (2) generalizing the model to allow twosided uncertainty (i.e., where the types of newly arriving servers must also be simultaneously learned). Our conjecture is that the same combination of externality pricing with an algorithmic design that prevents capacity violations should be applicable even in these more general settings, with an appropriate modeling framework. Finally, we note two further directions that are likely to require substantial additional technical analysis beyond this paper. First, recall that we assumed the expected surplus from a match between a client type and server type (i.e., the A matrix) is known to the 21

platform. This reflects the first order concern of most platforms, where aggregate knowledge is available, but learning individual user types quickly is challenging. It may be of interest to study how A can be efficiently learned by the platform. Our problem appears to be a discrete analog of the problem of learning a mixture of (multidimensional) Gaussians, which can be achieved using polynomial-sized data and polynomial computation [36].12 Second, our model does not contain any analysis of strategic behavior by clients or servers. A first strategic model might consider the fact that clients or servers are less likely to return to the platform after several bad experiences; this would dramatically alter the multiarmed bandit model, and force the algorithm to be more conservative to retain users. More generally, users may consider strategic manipulation to, e.g., improve the matches made on their behalf by the algorithm. Both the modeling and analysis of these strategic behaviors remain important challenges for future work.

References [1] Rajeev Agrawal, Demosthenis Teneketzis, and Venkatachalam Anantharam. 1989a. Asymptotically efficient adaptive allocation schemes for controlled iid processes: finite parameter space. Automatic Control, IEEE Transactions on 34, 3 (1989), 258–267. [2] Rajeev Agrawal, Demosthenis Teneketzis, and Venkatachalam Anantharam. 1989b. Asymptotically efficient adaptive allocation schemes for controlled Markov chains: Finite parameter space. Automatic Control, IEEE Transactions on 34, 12 (1989), 1249– 1259. [3] Shipra Agrawal and Nikhil R Devanur. 2014. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation. ACM, 989–1006. [4] Shipra Agrawal and Nikhil R Devanur. 2015. Linear Contextual Bandits with Global Constraints and Objective. arXiv preprint arXiv:1507.06738 (2015). [5] Shipra Agrawal, Nikhil R Devanur, and Lihong Li. 2015. Contextual Bandits with Global Constraints and Objective. arXiv preprint arXiv:1506.03374 (2015). [6] Shipra Agrawal and Navin Goyal. 2011. Analysis of Thompson sampling for the multiarmed bandit problem. arXiv preprint arXiv:1111.1797 (2011). [7] Mohammad Akbarpour, Shengwu Li, and Shayan Oveis Gharan. 2014. Dynamic matching market design. Available at SSRN 2394319 (2014). 12

If a fixed number Ns of jobs from each client are sent to each server type s, then the resultant joint distribution of total payoff realized from each server type s is a mixture (indexed by i) over a product measure over binomials (Binomial(Ns , A(i, s)))s∈S . This should already allow for learning A efficiently unless the discrete case is very different from the continuous analog (this would be a mixture of spherical Gaussians, e.g., see Dasgupta [24]).

22

[8] Ross Anderson, Itai Ashlagi, David Gamarnik, and Yash Kanoria. 2015. A dynamic model of barter exchange. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 1925–1933. [9] Baris Ata, Sunil Kumar, and others. 2005. Heavy traffic analysis of open processing networks with complete resource pooling: asymptotic optimality of discrete review policies. The Annals of Applied Probability 15, 1A (2005), 331–391. [10] J.-Y. Audibert and R. Munos. 2011. Introduction to Bandits: Algorithms and Theory. In ICML. [11] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2-3 (2002), 235–256. [12] Moshe Babaioff, Shaddin Dughmi, Robert Kleinberg, and Aleksandrs Slivkins. 2015. Dynamic pricing with limited supply. ACM Transactions on Economics and Computation 3, 1 (2015), 4. [13] Mariagiovanna Baccara, SangMok Lee, and Leeat Yariv. 2015. Optimal dynamic matching. Available at SSRN 2641670 (2015). [14] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Yaron Singer. 2012. Learning on a budget: posted price mechanisms for online procurement. In Proceedings of the 13th ACM Conference on Electronic Commerce. ACM, 128–145. [15] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. 2013. Bandits with knapsacks. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on. IEEE, 207–216. [16] Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. 2014. Resourceful Contextual Bandits. In Proceedings of The 27th Conference on Learning Theory. 1109– 1134. [17] Omar Besbes and Assaf Zeevi. 2009. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research 57, 6 (2009), 1407–1420. [18] Omar Besbes and Assaf Zeevi. 2012. Blind network revenue management. Operations research 60, 6 (2012), 1537–1550. [19] S´ebastien Bubeck and Nicolo Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Machine Learning 5, 1 (2012), 1–122. [20] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. 2011. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics. 208–214. 23

[21] Jim G Dai. 1995. On positive Harris recurrence of multiclass queueing networks: a unified approach via fluid limit models. The Annals of Applied Probability (1995), 49– 77. [22] Ettore Damiano and Ricky Lam. 2005. Stability in dynamic matching markets. Games and Economic Behavior 52, 1 (2005), 34–53. [23] Sanmay Das and Emir Kamenica. 2005. Two-sided bandits and the dating market. In Proceedings of the 19th international joint conference on Artificial intelligence. Morgan Kaufmann Publishers Inc., 947–952. [24] Sanjoy Dasgupta. 1999. Learning mixtures of Gaussians. In Foundations of Computer Science, 1999. 40th Annual Symposium on. IEEE, 634–644. [25] Daniel Fershtman and Alessandro Pavan. 2015. Dynamic matching: experimentation and cross subsidization. Technical Report. Citeseer. [26] John Gittins, Kevin Glazebrook, and Richard Weber. 2011. Multi-armed bandit allocation indices. John Wiley & Sons. [27] Ming Hu and Yun Zhou. 2015. Dynamic Matching in a Two-Sided Market. Available at SSRN (2015). [28] Sangram V Kadam and Maciej H Kotowski. 2015. Multi-period Matching. Technical Report. Harvard University, John F. Kennedy School of Government. [29] Morimitsu Kurino. 2005. Credibility, efficiency, and stability: A theory of dynamic matching markets. (2005). [30] Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6, 1 (1985), 4–22. [31] John Langford and Tong Zhang. 2008. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems. 817–824. [32] Constantinos Maglaras and Assaf Zeevi. 2003. Pricing and capacity sizing for systems with shared resources: Approximate solutions and scaling relations. Management Science 49, 8 (2003), 1018–1038. [33] Constantinos Maglaras and Assaf Zeevi. 2005. Pricing and design of differentiated services: Approximate analysis and structural insights. Operations Research 53, 2 (2005), 242–262. [34] Laurent Massoulie and Kuang Xu. 2016. On the Capacity of Information Processing Systems. (2016). Unpublished.

24

[35] Aranyak Mehta. 2012. Online matching and ad allocation. Theoretical Computer Science 8, 4 (2012), 265–368. [36] Ankur Moitra and Gregory Valiant. 2010. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on. IEEE, 93–102. [37] Denis Saur´e and Assaf Zeevi. 2013. Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management 15, 3 (2013), 387–404. [38] Lloyd S Shapley and Martin Shubik. 1971. The assignment game I: The core. International Journal of game theory 1, 1 (1971), 111–130. [39] Adish Singla and Andreas Krause. 2013. Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In Proceedings of the 22nd international conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1167–1178. [40] Zizhuo Wang, Shiming Deng, and Yinyu Ye. 2014. Close the gaps: A learning-whiledoing algorithm for single-product revenue management problems. Operations Research 62, 2 (2014), 318–331. [41] Huasen Wu, R Srikant, Xin Liu, and Chong Jiang. 2015. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In Advances in Neural Information Processing Systems. 433–441.

8 8.1

Appendix Proof of Theorem 6.1

We will first show the following lower bound on the difference between V ∗ and W N Proposition 8.1. Suppose that P ∗ is a singleton, and {p∗ (s)} are the unique optimal prices in the matching problem with known client types. Then, N V ∗ − WN N →∞ log N P ∗ s∈S(i) αs U (i) − [A(i, s) − p (s)] X P ≥ ρ0 (i) min 0max . 0 i ∈C (i) α∈∆(S(i)) d s∈S(i) αs I(i, i |s) i∈C 0 lim sup

25

Proof. Consider the following relaxed problem: X X WpN∗ = max ρ(i) ξ(i, s)A(i, s) ξ∈ΞN

−

i∈C

X

s∈S

X p (s)[ ρ(i)ξ(i, s) − µ ¯(s)]. ∗

s∈S

(13)

i∈C

By a standard duality argument, we know that WpN∗ ≥ W N . The optimal policy in this problem is a solution to X ρ(i)ξπ∗ (i, s)(A(i, s) − p∗ (s)) . (14) maximizeπ∗ ∈ΠN i∈B,s∈S

Then from Theorem 3.1 in [1], we know that N V ∗ − WpN∗ N →∞ log N P ∗ s∈S(i) αs U (i) − [A(i, s) − p (s)] X P ≥ ρ0 (i) min 0max . 0 α∈∆(S(i)) i ∈Cd (i) s∈S(i) αs I(i, i |s) i∈C 0 lim sup

The result then follows from the fact that W N ≤ WpN∗ . Let WpN∗ (π ∗ ) be the value attained by the defined policy in optimization problem (10). We will prove an upper bound on the difference between V ∗ and WpN∗ (π ∗ ). Note that the difference in these values of the two problems is the same as the difference in: X X ξπ∗ (i, s)(A(i, s) − p∗ (s)), ρ0 (i) i∈C

and

P

i∈C

s∈S

ρ0 (i)U (i). Following is the result.

Proposition 8.2. Consider the policy π ∗ such that the routing matrix y used in the exploitation phase satisfies y(i, .) ∈ ∆(S(i)). Then, N V ∗ − WpN∗ (π ∗ ) N →∞ log N P ∗ s∈S(i) αs U (i) − [A(i, s) − p (s)] X P ≤ ρ0 (i) min 0max . 0 α∈∆(S(i)) i ∈Cd (i) s∈S(i) αs I(i, i |s) 0 i∈C lim sup

In order to prove this Lemma, we need the following result that follows from Theorem 4.1 in [1]. For two numbers a and b, let a ∧ b , min(a, b). 26

Lemma 8.3. Let X1 , X2 , · · · be i.i.d. random variables where Xi is the outcome of choosing a server type s ∈ S according to a distribution α ∈ ∆(S). Suppose i ∈ C and B ⊂ C are such that X X p(X, i, s) αs Ei p(X, i, s) log = αs I(i, i0 |s) > 0 0 , s) p(X, i s∈S s∈S for each i0 ∈ B. Let ΛBn (i) = min Λn (i, i0 ). i0 ∈B

Then, 1. lim sup N →∞

1 Ei [inf{n ≥ 0|ΛBn (i) ≥ QN } ∧ N ] P ≤ , log N mini∈B s∈S αs I(i, i0 |s)

2. Pi0 (ΛBn (i) ≥ QN for some n ≤ N ) ≤

1 . QN

We are now ready to prove Proposition 8.2. Proof. Let X be random variable denoting the true type of the client. Let R(i) denote the expected absolute regret on the event {X = i}, defined as X R(i) = N max[A(i, s) − p∗ (s)] − N ξπ∗ (i, s). s∈S

s∈S

We will refer to this quantity as just regret. We thus want to find an upper bound on R(i). First consider the event that the guessing phase guessed the type i correctly. By a standard application of the Hoeffding bound, if i∗ is the guessed type at the end of the guessing phase, then p P (i∗ 6= i) ≤ exp(− log N ). Let Ai be the event that the guessing phase guessed the type i correctly. Then on the event {X = i} ∩ Ai , the confirmation phase A fails to confirm i and we go back to the guessing phase if 1 ΛnCd (i) (i) ≤ , QN and confirmation phase B fails to confirm i and we go back to the guessing phase if ΛCne (i) (i) ≤

1 , QN

the probability of each of which is bounded above by 1/QN by Lemma 8.3. Hence the probability that either of the two phases fails to confirm i is bounded P above by 2/QN . Regret is incurred in the confirmation phase at the expected rate of s∈S(i) αs [A(i, s(i)) − p∗ (s(i))] − [A(i, s) − p∗ (s)], and the expected duration of the confirmation phase, regardless 27

C (i)

of whether it confirmed i or not, is bounded above by Ei [inf{n ≥ 0|Λnd (i)]. Thus on the event {X = i} ∩ Ai , the expected regret is upper bounded by p 2 log N ) + R(i) QN X ∗ ∗ + αs [A(i, s(i)) − p (s(i))] − [A(i, s) − p (s)] Ei [inf{n ≥ 0|ΛnCd (i) (i) ≥ QN } ∧ N ]. O(

s∈S(i)

Now let Aci be the event that the guessing phase did not guess the type i correctly. Consider the event {X = i} ∩ Aci . Let i∗ be the type that was guessed instead. Then, in case that i ∈ Cd (i∗ ), from Lemma 8.3 we can see that the confirmation phase fails to reject i∗ with a probability at most 1/QN , in which case one incurs a regret of O(N ), otherwise we return back to the guessing phase after an expected time not more than Ei [inf{n ≥ 0|Λn (i∗ , i) ≤

1 } ∧ N ], QN

In the case that i ∈ / Cd (i∗ ), then there are two possibilities: • One possibility is that i ∈ Ce (i∗ ). Now if i is such that A(i, s) 6= A(i∗ , s) for some s ∈ S(i∗ ) such that αs∗ > 0, then the expected time spent in confirmation phase A is at most 1 } ∧ N ] = O(log N ). Ei [inf{n ≥ 0|Λn (i∗ , i) ≤ QN by Lemma 8.3. Else if i is such that A(i, s) = A(i∗ , s) for all s ∈ S(i∗ ) such that αs∗ > 0, then the expected time spent in confirmation phase A is at most ∗

Ei∗ [inf{n ≥ 0|ΛCnd (i ) (i∗ ) ≥ QN } ∧ N ] = O(log N ), since in this case, in confirmation phase A, the likelihood of events under i and i∗ are identical. Thus in either case, the expected time spent in confirmation phase A is O(log N ). Also, in either case, the expected time spent in confirmation phase B is at most 1 Ei [inf{n ≥ 0|Λn (i∗ , i) ≤ } ∧ N ] = O(log N ). QN Further, both the confirmation phase and the pre-exploitation phase fail to reject i∗ with probability at most 1/QN , in which case you incur a regret of O(N ). Otherwise you go back to the guessing phase. • The other possibility is that i ∈ C \ Ce (i∗ ) ∪ Cd (i∗ ) ∪ {i∗ }. In this case, A(i, s) = A(i∗ , s) for all s ∈ S(i∗ ) and further S(i∗ ) ∩ S(i) 6= φ (Case 3 in the description of π ∗ ) . This implies that in fact S(i∗ ) ⊆ S(i), and hence the optimal allocation policy for i does not incur any regret if the guessed type is i∗ . Then either you reject i∗ in the confirmation phase and return to the guessing phase, or you are not able to reject i∗ , in which case 28

you don’t incur any regret anymore since the optimal policy for i∗ is optimal for i as well. Now if i is such that A(i, s) 6= A(i∗ , s) for some s ∈ S(i∗ ) such that αs∗ > 0, then the expected time spent in confirmation phase A is at most Ei [inf{n ≥ 0|Λn (i∗ , i) ≤

1 } ∧ N ] = O(log N ). QN

Else if i is such that A(i, s) = A(i∗ , s) for all s ∈ S(i∗ ) such that αs∗ > 0, then the expected time spent in confirmation phase A is at most ∗

Ei∗ [inf{n ≥ 0|ΛCnd (i ) (i∗ ) ≥ QN } ∧ N ] = O(log N ). No regret is incurred in the confirmation phase B (or exploitation phase) so its duration does not matter. In either of the two cases, the confirmation phases A and B together fail to confirm i∗ and we return to the guessing phase with a probability no more than 1/QN . Thus overall, on the event {X = i} ∩ Aci , the expected regret is bounded above by p 1 O( log N ) + O(log N ) + R(i) + O(N ) QN Hence, on the event {X = i}, the total expected regret R(i) is bounded above as: p log N ) X ∗ ∗ + αs [A(i, s(i)) − p (s(i))] − [A(i, s) − p (s)] Ei [inf{n ≥ 0|ΛCnd (i) (i) ≥ QN } ∧ N ] R(i) ≤ O(

s∈S(i)

p 1 1 R(i) + exp(− log N ) O(log N ) + R(i) + O(N ) . + QN QN

(15)

Now as a direct consequence of Lemma 8.3, we can first see that the following holds in confirmation phase A: C (i)

Ei [inf{n ≥ 0|Λnd (i) ≥ QN } ∧ N ] 1 P lim sup ≤ . log N mini0 ∈Cd (i) s∈S(i) αs∗ I(i, i0 |s) N →∞ P This is because α∗ is such that mini0 ∈Cd (i) s∈S(i) αs∗ I(i, i0 |s) > 0, because if min 0

i ∈Cd (i)

X

(16)

αs∗ I(i, i0 |s) = 0 ,

s∈S(i)

then it follows that for some i0 ∈ Cd (i), A(i, s) = A(i0 , s) for all s ∈ S, which means that for this i0 , for all s ∈ S(i), s ∈ S(i0 ), which would contradict the definition of Cd (i). Thus we have 29

P R(i) lim sup ≤ min 0max α∈∆(S(i)) i ∈Cd (i) N →∞ log N

s∈S(i)

αs P

U (i) − [A(i, s) − p (s)] ∗

s∈S(i)

αs I(i, i0 |s)

.

Proposition 8.4. Suppose that the generalized imbalance condition is satisfied. Consider any optimal routing matrix x∗ . Then in the policy π ∗ for the N -job problem, for any N large enough, one can choose a routing matrix y ∗ such that y ∗ (i, .) ∈ ∆(S(i)) and that satisfies: P P 1. ¯(s) for any s such that i∈C ρ0 (i)x∗ (i, s) = µ ¯(s), and i∈C ρ0 (i)ξπ ∗ (i, s) = µ P 2. ¯(s) for any other s. i∈C ρ0 (i)ξπ ∗ (i, s) < µ We remark that the y ∗ we construct satisfies ky ∗ − x∗ k = o(1). In order to prove this proposition, we will need the following Lemma. Lemma 8.5. Suppose that the generalized imbalance condition in (1) is P satisfied. Consider any feasible routing matrix [x(i, s)]C×S . Consider any server s such that i∈C ρ0 (i)x(i, s) = µ ¯(s). Then there is a path with the following properties: • One end point is server s. • The other end point is a server type whose capacity is under-utilized (it is permitted to be κ). • For every client type and server type on the path in between, they are operating at capacity/all jobs are being served. • For every undirected edge on the path, there is a positive rate of jobs routed on that edge in x. Proof. Consider a bi-partite graph with servers representing nodes on one side and clients on the other. There is an edge between a server s0 and a client i if x(i, s) > 0. Consider the connected component of server s in this graph. Suppose it includes no server that is underutilized. Then the arrival rate of jobs from the set of clients in the connected component exactly matches the total effective service rate of the sellers in connected component. But this is a contradiction since generalized imbalance holds. Hence there exists an underutilized server type s0 that can be reached from s. Take any path from s to s0 . Traverse it starting from s and terminate it the first time it hits any underutilized server type. Proof of Proposition 8.4. For a given routing matrix [y(i, s)]C×S , let ξπ∗ (y) (i, s), be the resulting fraction of jobs directed from client type i to server type s. In the course of this proof, we will suppress the subscript π ∗ (y). Clearly, there exist εi0 (i, s) for each i ∈ C, i0 ∈ C ∪{0}, s ∈ S such that we have X ξ(i, s) = ε0 (i, s) + (1 − εi (i, s))y(i, s) + εi0 (i, s)y(i0 , s) . (17) i0 ∈C\{i}

30

The ε’s depend on the guessing and confirmation phases but not on y. (In particular, ε0 arises from the overall routing contribution of the guessing and confirmation phases, and εi ’s arise from the small likelihood that a client who is confirmed as type i is actually some other type.) A key fact that is that all ε’s are uniformly bounded by o(1). P we will use P ∗ Let Sx∗ = {s : i∈C ρ0 (i)x (i, s) = µ ¯(s)} and Sπ∗ = {s : i∈C ρ0 (i)ξπ∗ (i, s) = µ ¯(s)}. Now we want to find a y such that y(i, ·) ∈ ∆(S(i)) for all i ∈ C (call (i, s) a “permissible edge” in the bipartite graph between clients and servers if s ∈ S(i)), and such that: • For each s ∈ Sx∗ we also have s ∈ Sπ∗ , i.e., Sx∗ ⊆ Sπ∗ . • ky − x∗ k = o(1). Note that the two bullets together will imply the since kξ − x∗ k = o(1) from P P proposition, Eq. (17), and this leads to i∈C ρ0 (i)ξ(i, s) = i∈C ρ0 (i)x∗ (i, s) + o(1) < µ ¯(s) for all s ∈ S\Sx∗ , for large enough N . The requirement in the first bullet can be written as a set of linear equations using Eq. (17). Here we write y (and later also x∗ ) as a column vector with |C||S| elements: By + εˆ = (¯ µ(s))s∈Sx∗ . Here we have kˆ εk = o(1) and matrix B can be written as B = B0 + Bε , where B0 has 1’s in columns corresponding to dimensions (·, s) and 0’s everywhere else, and kBε k = o(1). Expressing y as y = x + z, we are left with the following equation for z, Bz = −(Bε x∗ + εˆ)

(18)

using the fact that B0 x∗ = (¯ µ(s))s∈Sx∗ by definitions of B0 and Sx∗ . We will look for a solution to this underdetermined set of equations with a specific structure: we want z to be a linear combination of flows along |Sx∗ | paths coming from Lemma 8.5, one path λs for each s ∈ Sx∗ . Each λs can be written as a column vector with +1’s on the odd edges (including the edge incident on s) and −1’s on the even edges. Let Λ = [λs ]s∈Sx∗ be the path matrix. Then z with the desired structure can be expressed as Λη, where η is the vector of flows along each of the paths. Now note that Bz = (B0 + Bε )Λη = (I + Bε Λ)η. Here we deduced B0 Λ = I from the fact that λs is a path which has s as one end point, and a client or else a server not in Sx∗ as the other end point. Our system of equations reduces to (I + Bε Λ)η = −(Bε x∗ + εˆ) , Since kBε k = o(1), the coefficient matrix is extremely well behaved being o(1) different from the identity, and we deduce that this system of equations has a unique solution η ∗ that satisfies kη ∗ k = o(1). This yields us z ∗ = Λη ∗ that is also of size o(1), and supported on permissible edges since each of the paths is supported on permissible edges (Lemma 8.5). Thus, we finally obtain y ∗ = x∗ + z ∗ possessing all the desired properties. Notice that the (permissible) edges on which y ∗ differs from x∗ had strictly positive values in x∗ by Lemma 8.5, and hence this is also the case in y ∗ for large enough N .

31

Proposition 8.6. Suppose that the generalized imbalance condition is satisfied. Consider the policy π ∗ , with the routing matrix y ∗ and let W N (π ∗ ) be the value attained by this policy in optimization problem (2). Then N V ∗ − W N (π ∗ ) N →∞ log N P ∗ s∈S(i) αs U (i) − [A(i, s) − p (s)] X P . ≤ ρ0 (i) min 0max 0 α α∈∆(S(i)) i ∈Cd (i) s I(i, i |s) s∈S(i) 0 i∈C lim sup

Proof. From Proposition 8.4 it follows that the policy π ∗ is feasible in problem 2, and further X X X X WpN∗ (π ∗ ) = ρ(i) ξπ∗ (i, s)A(i, s) − p∗ (s)[ ρ(i)ξπ∗ (i, s) − µ ¯(s)]. (19) =

i∈C

s∈S

X

X

i∈C

ρ(i)

s∈S

i∈C

ξπ∗ (i, s)A(i, s),

(20)

s∈S

P where the second equality follows from the fact that if p∗ (s) > 0, then i∈C ρ(i)x∗ (i, s) − µ ¯(s) = 0 by complementary slackness, and hence from Proposition 8.4 we obtain that P ¯(s) = 0 as well for these s. Thus we have a policy π ∗ that is feasii∈C ρ(i)ξπ ∗ (i, s) − µ ble, and that gives a rate WpN∗ (π ∗ ) of accumulation of payoff in problem (2). Thus the result follows from Proposition 8.2

8.2

Other proofs

Proof of Proposition 5.1. Denote the effective service rate of server type s by µ ¯(s) = n(s)µ(s). The dual to problem 4 can be written as X X minimize µ ¯(s)p(s) + ρ0 (i)v(i) s∈S

i∈C

subject to p(s) + v(i) ≥ A(i, s) ∀i ∈ C, s ∈ S , p(s) ≥ 0 ∀s ∈ S , v(i) ≥ 0 ∀i ∈ C . The dual variables are (P, V ) where “server prices” P = (p(s))s∈S and “client values” V = (v(i))i∈C . We will prove the result by contradiction. Suppose there are multiple dual optima. Let D be the set of dual optima. Let S 0 be the set of servers such that the prices of those servers take multiple values in D. Formally, S 0 = {s ∈ S : p(s) takes multiple values in D} .

32

(21)

Similarly, let C 0 be the set of clients such that the prices of those clients take multiple values in D. Formally, C 0 = {i ∈ C : v(i) takes multiple values in D} .

(22)

For each s ∈ S 0 , we immediately deduce that there exists a dual optimum with p(s) > 0, and hence the capacity constraint of server type s is tight in all primal optima.PSimilarly, we deduce that for each i ∈ C 0 , all arriving jobs from client type i are served, i.e., s∈S x(i, s) = 1. By assumption, we have X X ρ0 (i) 6= µ ¯(s) . i∈C 0

s∈S 0

Suppose the left hand side is larger than the right (the complementary case can be dealt with similarly). Take any primal optimum x∗ . The servers in S 0 do not have enough capacity to serve all clients in C 0 , hence there must be some client i ∈ C 0 and a server s ∈ / S 0 such that ∗ 0 x (i, s) > 0. Since s ∈ / S , we must have that p(s) has a unique optimum value in D. Call ∗ this value p (s). Let the largest and smallest values of v(i) in D be v max (i) and v min (i). By complementary slackness, we know that v max (i) + p∗ (s) = A(i, s) = v min (i) + p∗ (s) ⇒ v max (i) = v min (i) . But since i ∈ C 0 we must have v max (i) > v min (i). Thus we have obtained a contradiction. Proposition 8.7. Suppose that no two rows in A are identical. Then supx∈D inf y∈ΞN kx − yk = O( logNN ) ˜ N to ΞN such that Proof. It is clear that ΞN ⊆ D. We will find an inner approximation Ξ ˜ N ⊆ ΞN , and Ξ ˜ N converges to D in an appropriate sense as N goes to infinity. To define Ξ this approximation, suppose that in the learning problem corresponding to a fixed N , one starts off with a learning phase of a fixed length O(log N ), where each server s is presented to the client Os number of times (where Os = O(log N ), fixed a priori), so that after this phase, the type of the client becomes known with a probability of error at most O(1/N ). This will then allow us to relate the problem to the problem in which the user type is known. Suppose after this phase, the probability that a client of type b is correctly identified is p(b) and the probability that she is mis-identified as some other type b0 is p(b, b0 ). Note that since no two rows in A are identical, p(b, b0 ) = O(1/N ) for all b 6= b0 . Let d(b, s) denote the expected number of times a client that has been identified as being of type b (correctly or incorrectly) is directed towards server s after the learning phase, i.e., from job Os + 1 till the ¯ s) = d(b, s)/N . Then we can see that, one can attain all ξ in the following N th job. Let d(b, set: X N ¯ s) ≥ 0; ˜ Ξ = ξ ∈ R|B|×|S| : d(b, ξ(ib, s) = 1; (23) s∈S

X Os 0 0 ¯ , s) ¯ s) + ξ(b, s) = + p(b)d(b, p(b, b )d(b N 0 b 6=b 33

(24)

¯ s) ≤ 1, and since p(b, b0 ) ≤ O(1/N ), we can express the above set as: Now since d(b, X N ¯ s) ≥ 0; ˜ Ξ = ξ ∈ R|B|×|S| : d(b, ξ(b, s) = 1;

(25)

s∈S

log N ¯ ) ξ(b, s) = d(b, s) + O( N This in turn is the same as: log N X N |B|×|S| ˜ ); ξ(b, s) = 1 Ξ = ξ∈R : ξ(b, s) ≥ O( N s∈S

(26)

(27)

˜ N ⊆ ΞN . But we can now see that Ξ ˜ N converges to D in the Note that by construction, Ξ sense that log N sup inf kx − yk = O( ), N ˜ N x∈D y∈Ξ and hence, sup inf kx − yk = O( x∈D

y∈ΞN

log N ) N

as well. Proposition 8.8. The set ΞN is a convex polytope. Proof. For the purpose of this proof, let N

Ξ = {N ξ : ξ ∈ ΞN }. N

We will show that Ξ is a polytope, from which the result will follow. We will prove this N using an induction argument. We will represent each point in Ξ as a |C| × |S| matrix (ξ(i, s))|C|×|S| . Let client types in C be labeled as i1 , . . . , i|C| and let server types in S be labeled as s1 , . . . , s|S| . 0 N Now clearly, Ξ = {(0)|C|×|S| } which is a convex polytope. We will show that if Ξ is a N +1 convex polytope, then Ξ is one as well, and hence the result will follow. To do so, we decompose the assignment problem with N +1 jobs, into the first job and the remaining N jobs. A policy in the N + 1- jobs problem is a choice of a randomization over the servers in S for the first job, and depending on whether a reward was obtained or not with the chosen N server, a choice of a point in Ξ to be achieved for the remaining N jobs. Each such policy N +1 gives a point in the Ξ . Suppose that η1 ∈ ∆(S) is the randomization chosen for job N N 1, and let R(s, 1) ∈ Ξ and R(s, 0) ∈ Ξ be the points chosen to be achieved from job 2 onwards depending on the server s that was chosen, and whether a reward was obtained or

34

N

not, i.e.. R(., .) is a mapping from S × {0, 1} to the set Ξ . Then this policy achieves the following point in the N + 1- jobs problem:   η1 (s1 ) η1 (s2 ) . . . η1 (s|S| ) X  ..  .. . . ¯ .. ..  + η1 (s) Diag[A(., s)]R(s, 1) + Diag[A(., s)]R(s, 0) ,  . . s∈S η1 (s1 ) η1 (s2 ) . . . η1 (s|S| ) where

and

 A(i1 , s) 0  0 A(i2 , s)  Diag[A(., s)] =  .. ..  . . 0 0

... ... .. .

0 0

...

A(i|C| , s)

0

    



 1 − A(i1 , s) 0 ... 0   0 1 − A(i2 , s) . . . 0  ¯ s)] =  Diag[A(.,  . .. .. ..   . 0 . . 0 0 . . . 1 − A(i|C| , s)

And thus we have N +1

Ξ =  η1 (s1 ) η1 (s2 ) . . . η1 (s|S| ) X  ..  .. . . ¯ s)]R(s, 0) .. ..  + η1 (s) Diag[A(., s)]R(s, 1) + Diag[A(.,  . . s∈S η1 (s1 ) η1 (s2 ) . . . η1 (s|S| ) N : η1 ∈ ∆(S), R(., .) ∈ Ξ . Let 1s be the |C| × |S| matrix with ones along column corresponding to server type s and all other entries 0. Then the set N ¯ s)]R(s, 0) : R(s, .) ∈ Ξ , J (s) = 1s + Diag[A(., s)]R(s, 1) + Diag[A(., is a convex polytope, being a linear combination of two convex polytopes, followed by an N +1 affine shift. It is easy to see that Ξ is just a convex combination of the polytopes J (s) N +1 for s ∈ S, and hence Ξ is a convex polytope as well.

35

Recommend Documents

Know Your Customer

Combinatorial Bandits Revisited - Semantic Scholar