Top-down Statistical Modeling and Big Data

Comment

Report 5 Downloads 46 Views

Top-down Statistical Modeling and Big Data Peter W. Glynn

Stanford University Joint work with Z. Zheng, X. Zhang, and H. Honnappa i-SIM Workshop, Durham University, UK, August 1, 2017

Top-down Statistical Modeling and Big Data

1/45

Outline of Talk

General Comments on Big Data/Simulation

Top-down Statistical Modeling

Top-down Statistical Modeling and Big Data

2/45

Management Science in the 20th Century

Data gathered only when necessary Computing limited and expensive “Humans in the loop” on all decisions OR modeling viewed as a technical competency

Top-down Statistical Modeling and Big Data

3/45

Management Science Today

Data everywhere Computation everywhere More and more real-time decision-making (“algorithmic decision-making”) Analytics viewed by CEOs as a core competency

Top-down Statistical Modeling and Big Data

4/45

Why analytics matters: e.g. Netflix competition Winner decreased RMSE by 10% Why incremental improvements matter: - no capital costs - implementation costs close to 0 - any improvements translate immediately to bottom line - effect particularly important in low margin industries (e.g. retail, airlines, etc.)

Top-down Statistical Modeling and Big Data

5/45

The OR/MS Focus on Decision-specific Models

Descriptive Predictive Prescriptive

Top-down Statistical Modeling and Big Data

6/45

Descriptive Models Used extensively in design developing new wireless standards market design service architectures etc.

Often stylized models No data/no ML Simulation will continue to play a key role in this setting

Top-down Statistical Modeling and Big Data

7/45

Predictive Models

Used in the presence of data When lots of “representative” data is available, ML is applicable and powerful Key distinction: OR/MS: Build the known “physics/economic principles” into the model ML: Use more generic statistical models to better predict

Top-down Statistical Modeling and Big Data

8/45

Lots of Representative Data Online behavior Internet of Things High frequency trading ML a dominant toolset

Modest/Little Representative Data Financial time series/longer-term predictions Risk exposure predictions Extremes in weather (e.g. hurricanes) Climate change Fisheries, wildlife management Clinical trials etc.

Scenario Analysis 2008 financial crisis Top-down Statistical Modeling and Big Data

9/45

Different Managerial Time Scales for Predictions Very short time scale (internet) simple predictive models; better models adaptively updated

Longer time scales (managing/ordering inventory) more complex online predictive models simulation a possible technology could be real-time or offline adaptation could be present

Long time scales (risk management (e.g. VaR)) simulation more dominant offline computation

Top-down Statistical Modeling and Big Data

10/45

Challenges for ML Prediction Why is the model predicting a given outcome? Related ethical issues: ML may effectively be predicting on the basis of legally prohibited criteria

Building/recognizing “hard constraints” on performance/safety/risk Large consequences for a bad prediction Building “physics/economic principles” into the predictor (e.g. gaming effects) Top-down Statistical Modeling and Big Data

11/45

Prescriptive Modeling

“What if” analysis generally involves a system for which little or no data yet exists OR/MS modeling will continue to play a key role

Top-down Statistical Modeling and Big Data

12/45

Two Decision Modes Long-term decisions capacity issues design issues related to staffing etc.

Medium-term decisions managing inventory routing decisions scheduling decisions staffing decisions etc. Top-down Statistical Modeling and Big Data

13/45

A Common Theme

Going forward, there may be more opportunity to use simulation in real-time settings to make better medium-term decisions (i.e., non-instantaneous decision consequences)

Top-down Statistical Modeling and Big Data

14/45

With more data/higher quality data, we have the opportunity to “up our game” regarding the integration of data into simulation models: To develop specialized knowledge of what features of data have an impact on decisions

Use of MLEs to estimate tail probabilities. To implement optimization that, from the outset, recognizes data uncertainties and generates robust solutions

Top-down Statistical Modeling and Big Data

15/45

“Top-down Modeling”

We should take our limit theorems seriously! 1

n 2 (Qn (·) − q(·)) ⇒ Z (·) in many server setting... Z (·) is built from a Gaussian process describing covariance structure/time of day effects at order of the service times

Top-down Statistical Modeling and Big Data

16/45

An Example

An Israeli call center having an hourly arrival rate varying between 20 and 120 customers per hour.

Top-down Statistical Modeling and Big Data

17/45

An Experiment

Take trace data and run through a many-server queue (500 servers; iid exponential service times) Take trace data and split into intervals of length x: Redistribute the trace arrivals in each such interval as iid uniforms. Then, run the modified traffic through the same queue Compare performance using synchronized service times

Top-down Statistical Modeling and Big Data

18/45

11:00 AM - 11:30 AM

Top-down Statistical Modeling and Big Data

19/45

3:00 PM - 3:30 PM

Top-down Statistical Modeling and Big Data

20/45

11:00 AM - 11:30 AM (deterministic)

Top-down Statistical Modeling and Big Data

21/45

3:00 PM - 3:30 PM (deterministic)

Top-down Statistical Modeling and Big Data

22/45

As is well known, call center data tend to be over-dispersed relative to a Poisson model var N(t) >> EN(t) A Simple Fix: Ni (·) = Ni

Z

·

λ(s)ds

Λi 0

(introduced by Ward Whitt)

Top-down Statistical Modeling and Big Data

23/45

Another Experiment

Estimate λ(·) from the trace data Take trace data and run through a queue Take trace data and redistribute all the trace arrivals for a given day as iid draws from Z 24 Z t λ(s)ds λ(s)ds. 0

0

Run the modified trace through the queue Compare

Top-down Statistical Modeling and Big Data

24/45

9:00 AM - 9:30 AM

Top-down Statistical Modeling and Big Data

25/45

12:00 PM - 12:30 PM

Top-down Statistical Modeling and Big Data

26/45

3:00 PM - 3:30 PM

Top-down Statistical Modeling and Big Data

27/45

6:00 PM - 6:30 PM

Top-down Statistical Modeling and Big Data

28/45

What could lead to such differences in “trace predictions”? One possibility: Auto-correlation in interval counts at operationally relevant time scales

Top-down Statistical Modeling and Big Data

29/45

Top-down Statistical Modeling and Big Data

30/45

Top-down Statistical Modeling and Big Data

31/45

Top-down Statistical Modeling and Big Data

32/45

How can this be? We know that there are many limit theorems that support the use of (non-stationary) Poisson process models

Top-down Statistical Modeling and Big Data

33/45

Superposition of Independent Point Processes N1 , N2 , . . . iid point processes with stationary stochastic intensities λ1 , λ 2 , . . . P Nˆn (·) = n Ni (·) i=1

Nˆn (t, t + ·/n] ⇒ N∞ (· E λi (0)) in D[0, ∞) Top-down Statistical Modeling and Big Data

34/45

Superposition of Point Processes

ˆ

P(Nn ∈ ·) − P(N∞ ∈ ·)

Ftn

 √  n) 0, tn = o(1/ √ = c, tn ∼ a/ n  √  1, tn / n → ∞

Again, non-Poisson structure manifests itself over mesoscopic and macroscopic time scales

Top-down Statistical Modeling and Big Data

35/45

We now discuss “top-down models” that can be used to model traffic that exhibits auto-correlation at operationally meaningful time scales.

Top-down Statistical Modeling and Big Data

36/45

Autoregressive Poisson Processes

Self-excited Poisson Time Series Models: Nn+1 = Poissonn+1 (a1 Nn + . . . + ap Nn−p + b) where Poissoni (·)’s are independent unit-rate Poisson processes.

Top-down Statistical Modeling and Big Data

37/45

MLE Convex Program Formulation

With observations N1 , N2 , . . ., the MLE problem is ! ! p p X X X max Nn log ai Nn−i + b − ai Nn−i − b . a1 ,...,ap

n=p+1

i=1

i=1

s.t. b ≥ 0, ai ≥ 0, for i = 1, 2, . . . , p.

Highly tractable convex program

Top-down Statistical Modeling and Big Data

38/45

Estimation and Hypothesis Test In p = 1 setting

First-order Poisson autoregressive model Nn+1 = Poissonn+1 (a Nn + b) MLE estimate aˆ, bˆ through convex program Likelihood ratio test for a∗ > 0

Top-down Statistical Modeling and Big Data

39/45

Applied Probability Properties In p = 1 setting E exp(θX∞ ) = E exp((a∗ X∞ + b ∗ )(e θ − 1)), from which equilibrium moments can be explicitly computed Poisson’s equation k Ex gk (X1 ) − gk (x) = −(x k − EX∞ )

can be explicitly solved for gk So, martingale CLT allows us to compute σk for which ! n−1 X k n−1/2 Xik − nEX∞ ⇒ σk N(0, 1) i=0

as n → ∞ Top-down Statistical Modeling and Big Data

40/45

We can also explicitly compute the functions c(γ) and w (γ) for which ! n−1 X Mn = exp c(γ)Xn + γ Xi − nw (γ) i=0

is an exponential martingale So, we can explicitly compute the large deviations tail asymptotics for Pn−1 i=0 Xi

Top-down Statistical Modeling and Big Data

41/45

If (Xn : n ≥ 0) is the traffic fed into a queue, we can explicitly compute the Cram´er-Lundberg constant θ∗ for which 1 log P(Q∞ > x) → −θ∗ x as x → ∞ If X˜a (t) = (1 − a)Xbt/(1−a)c , then (X˜a (t) : t ≥ 0) converges weakly to (Y (t) : t ≥ 0) as a % 1, where Y satisfies p dY (t) = (−Y (t) + b)dt + Y (t)dB(t), a CIR process

Top-down Statistical Modeling and Big Data

42/45

More Generalizations

To deal with time-of-day effects To deal with potential trends To incorporate daily “busyness” factor

Top-down Statistical Modeling and Big Data

43/45

What time scales matter?

Single-server system : (1 − ρ)α W (t/(1 − ρ)β ) ⇒ Z (t) as ρ % 1; α, β, and Z all depend on the qualitative structure of the traffic Buffer (of size K ) overflow probabilities P(W (∞) > K ) ∼ αK −β e −ηK as K → ∞

Top-down Statistical Modeling and Big Data

44/45

γ

Conclusions The use of data should focus on the decision to be made... Known limit theorems inform us as to the key features of the data that must be captured by our models... A number of data sets indicate that top-down autocorrelated models capture the right effects Autoregressive Poisson processes...

A new decision-oriented way to think about data analysis in the simulation/management science areas! Top-down Statistical Modeling and Big Data

45/45

Recommend Documents

Multivariate Statistical Modeling with Survey Data - Mplus

Advanced Statistical Modeling and Cumulative

Generative Statistical Modeling for Dynamic and Distributed Data

modeling traffic flow using simulation and big data

weaving multi-agent modeling and big data for stochastic process ...