Statistical Methods and Analysis Techniques in Experimental Physics
• fitting with constraints • kernel density estimation
Course FS11 (ETH + UZH) Lecture 9 19.4.2011
Christoph Grab
Summary of last lecture
“Statistics show that of those who contract
the habit of eating, very few survive.”. (G.B.Shaw)
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
2
Method of Least Squares
Goal: from a set of measurements (yi i), find the best estimate of the unknown parameter a in function f(x|a), describing data. Assume data set yi , independent, Gaussian distributed, errors i,
calculate the Chi-square 2(a) with parameters [a] for a model function f(x|a) which predicts for every x-value a y-value Best estimate of parameters [a] is obtained by minimizing 2(a)
N 2
(a) i 1
[ yi
f ( xi ; a)]2
Residuals =deviations from “calculated values”, weighted by errors !
2 i
sum obeys a 2 distribution with N-p degrees of freedom, where N is # of data points and p is # of parameters to determine in f(x|a)
Note: we have taken the errors the different points Christoph Grab, ETHZ
i
into account for weighting
Statistical Methods and Analysis Techniques in Experimental Physics
3
Example: straight line N 2
(a) i 1
[ yi
f ( xi ; a)]2 2 i
these differences are the so-called
residuals
f(xi|a)
yi
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
4
Method of Least Squares LSQ Procedure to find parameter â which minimizes is by taking derivatives d 2/da=0: N
[ yi
2
a
a
i 1
f ( xi ; a )]2 2 i
N i 1
1 2 i
f ( xi ; a) yi a
2
,
f ( xi ; a)
0
• If a is a p-component vector a=(a1,…,ap), then solve p equations simultaneously “normal equations”
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
5
LSQ - Connection to Likelihood Given data set yi(xi) with xi known exactly; i=1..N interpret yi as measurements of N independent Gaussian random variables, distributed around theory-function f(xi|a) assumed known assume large enough statistics errors on yi are Gaussian then probability to measure particular yi for given (xi|a )is:
p ( yi | a ) i
Ln L(a, y )
1 2
1 2
[ yi
Ln Lln (a, L( y ) a, y ) 2
f ( xi ; a )]2
and
2 i
e
f ( xi ; a)]2 2 i
2
Christoph Grab, ETHZ
[ yi
L( y | a ) i
Maximize likelihood by minimizing lnL:
ln
i
2
N 2
2
p( yi | a)
(a)
[ yi
i 1
2
f ( xi ; a)]2 2 i
Term is irrelevant
Statistical Methods and Analysis Techniques in Experimental Physics
6
LSQ: linear matrix case (recap) In case function f is linear in a (not in x), then the normal equations are exactly solvable Linear: f(x|a)= cr(x)ar or f=Ca with Cir=cr(xi) y = column vector of N measured data (yj, j=1..N); a = column vector of n parameters (ai, i=1..n); ; C = Nxn rectangular matrix (fixed) r = y – Ca = y-f = column vector of N residuals (j=1..N); V[y] = NxN covariance matrix ; W = V[y] -1 weight matrix
LSQ principle: minimize 2
(a)
(y
2
T
wrt a:
Ca) V
1
(y
Ca)
T
r Wr
Note: f(x|a)= a1+a2x+a3x2+a4 x is linear in a ! But f(x|a)= a1+sin(x+a2) not ! Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
7
LSQ: linear correlated case Linear case, with correlated measurements, the normal equations 2/ ai=0 for N data points and n coefficients ai (n=0 Energy / momentum are conserved in reaction (4c-fits) So-called “beam-constraints”, e.g. at e+e- colliders, the total energy in the event (s=ECMS2) is given by beams. Particle decays: mass of mother particle is known, once the decay is identified. e.g. (Ks )
First look at linear constraints. If needed one can linearise the problem. Numerically, non-linear constraints are not a real issue any longer.
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
14
Example Tau-lepton decays The goal of the constrained fit is to make optimal use of the experimental data to determine τ branching fractions. Total of 119 conventional branching fractions are listed; and some 28 upper limits. 31 basis modes : sum is constrained to= 1 Constrained fit has a χ2 of 95.7 for 100 degrees of freedom [ prob(χ2,n) = 0.60 ] Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
15
Inclusion of constraints … Idea: reformulate the 2 minimization to an unconstraint one, then fit as before! Constraints can be included in different ways. Elimination method: use the constraint equations to eliminate a number of variables analytically, and then fit the data in terms of the reduced number of unknowns. Method of Lagrangian multipliers: increase the number of unknowns in the fit by adding a set of Lagrangian multipliers, one for each constraint equation.
Pragmatic approaches: mimicking very precise measurements; iterative fitting, fixing parameters one by one...
useful in unstable conditions. Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
16
Example : Angles in Triangle Measure the 3 angles of a triangle yi with an identical error for all of s=1o (for case of argument): y1 = 63o y2=34o y3=85o Wanted: estimates ni for true values of angles
observations: yi – 180 =2 : two degrees off LSQ fit: find solution to 3 2
(n1 , n2 , n3 ) i 1
constraint
Christoph Grab, ETHZ
yi
ni
2
min
i
ni 180 0 i
Statistical Methods and Analysis Techniques in Experimental Physics
17
Example Triangle : Elimination method Elimination method : use constraint to eliminate one n, eliminate n3 , substitute in 2 , and minimize wrt the remaining 2 variables n1 and n2 2
(n1 , n2 )
( y1 n1 ) 2
( y2 n2 ) 2
2 1
Find solutions:
nˆ1
( y3 (180 n1 n2 )) 2
2 2
1 (180 2 y1 3
and from constraint:
2 1
y2
y3 ) 62
1 3
nˆ2
nˆ3 180 nˆ1 nˆ2
min
1 (180 y1 2 y2 3
84
y3 ) 33
1 3
Find : improvement comes from subtracting the measured excess of 2o from all three observations equally -2/3. No surprise. Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
18
1 3
J.-L. Lagrange Joseph-Louis Lagrange born Giuseppe Lodovico Lagrangia
(Turin, 1736 – 1813; then lived in Berlin and Paris. succeeded Euler as the director of mathematics at the Prussian Academy of Sciences in Berlin. first professor of analysis at the École Polytechnique in Paris
many important contributions: Algebra, Astronomy, Classical and analytical mechanics e.g. Mécanique Analytique Euler-Lagrange equations.. Variation calculus
introduced Christoph Grab, ETHZ
the method of Lagrangian Multiplier Statistical Methods and Analysis Techniques in Experimental Physics
19
Principle of Lagrange Multipliers Find parameter â which minimizes
2
N
:
2
(a)
f ( xi ; a)]2
[ yi
2 i
i 1
• Add constraints in form of equation: (may depend on a)
g ( xi , a) c
• Re-define 2 including constraint, adding a parameter (add one i per constraint) N [ yi f ( xi ; a)]2 2 ( a, ) 2 2 i 1
: g ( xi , a) c
i
• Find solutions for both (â, ) by derivatives (2 variables) of N 2
a
i 1
2
Christoph Grab, ETHZ
1 2 i
f ( xi ; a) yi a
f ( xi ; a)
( g ( xi ; a) c) a
2
:
0
2[ g ( xi ; a) c] 0
Statistical Methods and Analysis Techniques in Experimental Physics
20
Triangle : Lagrangian Multiplier (1) Lagrangian multiplier method : reformulate problem to an equivalent form with Lagrangian multiplier : 3 2
yi
(n1 , n2 , n3 , ) i 1
ni
2
i
3
2
ni 180
min
i 1
This is a linear LSQ problem for four unknowns ni, The 4 normal equations with 2/ ni=0 become: 2
0=
2 2 i
ni 2
and
0=
( yi
ni ) 2
3
2
ni 180
for
i=1,2,3
for
i 1
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
21
Triangle : Lagrangian Multiplier (2) This can be analytically solved for (sum the equations: -I-II-III+1/ 0=
2
3
yi 180
2
leads to
6
i 1
: 2*IV ) ˆ=1 2 6 2
3
yi 180 i 1
Then the estimates for the angles become (with Eq. IIII) : 3
nˆ j = y j
2
ˆ
yj
1 3
yi 180 i 1
Find same result: subtracting measured excess of 2o from all three observations equally . symmetry in all three indices is obvious Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
22
Comments on Lagrangian Multiplier In general the method of Lagrange Multipliers is useful if the number of constraints is not too large. Number of Lagrange Multipliers is equal to the number of constraints. All variables are treated symmetrically. the introduced factor of 2 is convention drops after the derivatives. Linear conditions can be solved analytically Non-linear conditions need be solved iteratively, using reasonable starting conditions for parameters to converge. [ for a more explicit treatment see e.g. V.Blobel ] Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
23
Generalisation : Fitting with constraints General case of
2
minimization with set of additional constraints H(yref )=0
- linearize first (first order Taylor expansion) around some given point yex
H ( yex )
H ( yex ) (y y
yex )
D
y d
0
D: matrix of derivatives, one line per constraint equation (n equations x n param.); d: vector of values of constraints. Function to minimize with respect to (yex, ) 2
( yref
y) V
1
( yref
y) 2
T
(D y d )
min
In the linearized case, the minimization problem has an analytical solution, independent of constraint equations May need to iterate, if initial expansion point is too far off the minimum … Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
24
Example of constraints without error … Consider fitting charge particle tracks in a collider experiment, coming from one common event vertex. 1) Standard way is assigning a certain combination of measured hits to one “track” and fit them to a “helical track” using some standard parametrisation. 2) Can improve this: if I know that the particle originates from the primary interaction, can require that the track comes from (0,0,0). for this, just ADD this (0,0,0) as one additional “measured” hit.
3) Improve, ie. make it more realistic: add error to the origin (0,0,0)
4) Can improve by replacing the origin (0,0,0) by a realistic average run vertex, ie. the beam spot
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
25
Example of constraints … Constraints mimicking very precise measurements Consider finding the origin of an interaction in a collider experiment, the so-called event (or primary) vertex, based on a number of measured tracks. Using information from the beam, an expected distribution of the reactions in the interaction region is known locus of collisions, given in terms of a “beam-profile” e.g. HERA: close to Gaussian in x, y, with x=150 mm and y =25 mm; in z it‟s a flattened non-Gaussian dist. with a length of some 30 cm.
In addition a “run-vertex” can be determined, by averaging over many events in the run. All these information produce “razor-like” shape of the interaction region with errors. Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
26
LSQ Fitting with Weighted Events Events may be weighted: i.e. the weight wi assigned to a particular event i is equal to the inverse probability of observing this event. Weights may come from: prescales from HW Trigger levels (L1,..HLT) online selection processes (skims ..) due to weights in the Monte Carlo simulation process: e.g. steeply falling distributions, may generate separately N events in different regions, but with relative weights.
Problem: how can these individually weighted events taken properly into account in the fit? see Barlow 93, Zech 95
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
27
LSQ Fitting with Weighted Events Note: weights between different events should not be TOO different, else curious results may be obtained. BUT: These procedure is no longer exact. Exactness can only be reached in the limiting case for n infinity Judgement is the key ... !
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
28
LSQ Fits: Residuals and Pulls Goodness of fit checks : Residuals
i
= yi - ni
pulls or stretch function: zi =
i
/
i
:
if observations are uncorrelated and the overall estimation problem is sufficiently linear, then we expect a Gaussian G(0,1) distribution Plot the pull distribution and compare with G(0,1)
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
29
Some notes about fitting/using weighted events …
Avoid data sets with largely varying weights
if you need weights, be careful when combining events
if you have weights, be extremely careful when quoting errors on “combined samples”
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
30
How to fit Multiple signals and BGND
Different sources are the decays. - assume, they all have the same weight. Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
31
Binned ML fit without weights (1) Observed data events di, (bins i=1…n) from sources j=1..m j = fraction of source j (=scale factor) searched parameters distribution of sources is given by MC templates, with Aji = expected number of MC events of source j in bin i, which interpreted as Poisson (fluctuate due to limited stat.), also represent the mean of generated MC events aji m
fi = predicted number of data events in bin i
fi
j
Aji ( x)
j 1
Data are assumed Poisson distributed:
Estimate fractions
j
(searched)and Aji (uninteresting) by ML fits:
n
- ln L
fi
Christoph Grab, ETHZ
n
m
di ln fi i 1
P(di | f i )
f i di e di !
a ji ln Aji
Aji
i 1 j 1
Statistical Methods and Analysis Techniques in Experimental Physics
32
fi
How to fit Multiple signals and BGND
Different sources are the decays, with different BR. - different efficiencies means, they all have different weights. Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
33
LSQ Fitting with Weighted Events (opt) Example: histogram with ni =observed number of events in bin i, fi = predicted number of events, and wij inverse detection probability of event j in bin i. n Defining 1 n 1 i
i
ni ' =
Di =
wij
j 1
ni
j 1
wij
fi' = fi Di
Various alternatives to do a LSQ fit are for N 2 1 i 1
(ni ' fi ) 2 fi
N 2 2 i 1
The various alternatives for
(ni ' fi ) 2 ni ' 2
2
: N
2 2 i 1
(ni ' fi ) 2 fi '
, actually errors (!) are:
1) “exact” for assuming Poisson distribution for ni 2) “approximate”: substitute variance by ni instead of fi 3) using the estimated detection efficiency
Note: 3) is equivalent to 1) for identical weights. Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
35
Literature on fitting etc.. R.Barlow, C.Beeston: “Fitting using finite MC samples”; Computer Physics Communications, Volume 77, Issue 2, October 1993, Pages 219-228 http://lss.fnal.gov/archive/other/man-hep-93-1.pdf
G.Zech: “Comparing Statistical Data to Monte Carlo Simulation” DESY Report 95/113 http://ccdb4fs.kek.jp/cgi-bin/img/allpdf?199508210
A.G.Frodesen, O.Skjeggestad, H.Toefte:”Probability and Statistics in Particle Physics”, 1979, p.295 Lendermann etal. Note on combining weighted events in experiments: http://arxiv.org/abs/0901.4118 Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
36
Kernel density estimation
Kernel density estimation is a way to estimate the PDF of a random variable x in a smooth way – without knowing the true PDF.
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
37
Kernel density estimation Definition: A kernel is • non-negative real-valued integrable function K • used to estimate random variables' density functions pdf, • satisfies the following requirements:
K (u ) 0 K ( u ) K (u ) K (u )du 1 uK (u )du
Positiv Symmetric around 0 normalised (makes it a PDF)
0
expectation value is 0
u 2 K (u )du
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
38
non parametric kernel density estimation consider simplest case of a few measured points:
simplest non parametric kernel density estimation is a histogram Need to determine the bin width and the bin end-points
bins:
Christoph Grab, ETHZ
2-2.5, 2.5-3, 3-3.5, 3.5-4,…
bin width=0.5
Statistical Methods and Analysis Techniques in Experimental Physics
39
non parametric kernel density estimation same data points, same bin width, but shifting the bin centres gives a different picture distribution is NOT smooth
shifted bins: 2.25-2.75, 2.75-3.25, … Christoph Grab, ETHZ
bin width=0.5
Statistical Methods and Analysis Techniques in Experimental Physics
40
Idea of kernel density estimation Kernel density estimation is a way to estimate the PDF of a random variable x in a smooth way it is determined by kernel K, which defines the weighting around each data point i.e. centred at the data point itself and bandwidth b which defines the width of weighting Summing kernels centred around every data point:
Sample of events
Christoph Grab, ETHZ
Apply kernel to each event
Summed kernels for all events “pdf”
Statistical Methods and Analysis Techniques in Experimental Physics
41
Other example: kernel density estimation
Six
data points measured
Six Gaussians (red) and their sum The kernel density estimate f(x) is
(blue). obtained by taking the sum of all 6 Gaussians, and dividing by 6 (=number of Gaussians).
The
variance of the Gaussians was set to 0.5.
Note
: where the points are denser, the density estimate will have higher values.
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
42
Non-parametric density-estimation Generalized kernel density estimation = procedure to estimate unknown PDF in a non-parametric way Let xi be a set of i=1…n independent and identically distributed samples of a random variable x relative occurrence 1/n of data point xi in sample is seen as probability density smear/smooth probability density around xi according to certain pattern (=kernel); width of smoothening around xi is called bandwidth b General expression f for kernel density approximation of the data’s probability density function f with the kernel K(u) with u=(x-xi)/b is given by:
fˆnK ( x) Christoph Grab, ETHZ
1 nb
n i
x xi K b 1
Statistical Methods and Analysis Techniques in Experimental Physics
43
Finite Kernel choices Several choices for finite kernels K(u) Rectangular
kernel
KR (u) 0.5 Triangular
for
kernel
KT (u) 1 | u | Bi-Square
K B (u )
for
1 u 1, else 0
or Quartic kernel
15 1 u2 16
Christoph Grab, ETHZ
1 u 1, else 0
2
for
1 u 1, else 0
Statistical Methods and Analysis Techniques in Experimental Physics
44
Finite Kernel choices The Epanechnikov kernel used kernels
K (u)
3 (1 u 2 ) 4
Christoph Grab, ETHZ
for
K(u), one of the most often
1 u 1, else 0
Statistical Methods and Analysis Techniques in Experimental Physics
45
Infinite Kernels Other often used infinite kernels
Gaussian kernel
KG (u )
Christoph Grab, ETHZ
1 exp( u 2 / 2) 2
Statistical Methods and Analysis Techniques in Experimental Physics
46
Comments on density-estimation nearly optimal choice for kernel is Epanechnikow kernel, since it yields efficient estimations see below Gauss kernel is also good and frequently used in practice, the choice of bandwidth parameter b in kernel density estimation is much more important than kernel choice. Too small parameter b density fluctuates a lot, many local maxima/minima „undersmoothing‟
too large bandwidth lose information in data, washes out substructures „oversmoothing‟
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
47
example: bandwidth influence histogram, centred at bins with a) box kernel density estimate b) and c) use Gaussian kernel, but under- or oversmoothed
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
48
example: bandwidth influence
Kernel
density estimation of 100 normally distributed random numbers using different smoothing bandwidths values b
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
49
Example: Old Faithful • Old Faithful in Yellowstone Park • eruption period regular, but duration of eruption irregular • „textbook‟ example to apply non-parametric kernel density estimation
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
50
Choice of bandwidth b ? Task: how to choose the optimal bandwidth ? solution: choose the bandwidth b, which minimises the so-called AMISE =Asymptotic Mean Integrated Squared Error In general, the AMISE depends on the true underlying density (which we don't have!); so we need to estimate the AMISE from our data as well.
This means that the chosen bandwidth is an estimate of an asymptotic approximation. BUT: this particular choice of bandwidth recovers all the important features whilst maintaining smoothness
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
51
Choice of optimal bandwidth b (opt) Choose the optimal bandwidth: Use mean integrated squared error (MISE) to estimate accuracy of estimation (f=unknown PDF to estimate; fˆ kernel density estimate of data)
For kernel density estimation with kernel K(z), MISE can be approximated:
bias
variance (dominant for small b) Bias depends on second derivative of f and disappears for b0. Tradeoff.
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
52
Choice of optimal bandwidth b Bandwidth b is optimal, if MISE minimal: dMISE/db=0
bopt
1 n
2
K ( z ) dz 2
f ''( x) 2 dx
1/5
mean
integrated squared error for kernel K(z) =Int [ z2 K(z) ]
Optimal b depends on sample size n and f‟‟ (true PDF). But f is unknown, may be not even analytically existing
Rule of thumb: replace the unknown f by either of the following kernels, with optimal bopt: Plug-in approach: start initially with best guesstimate of b and calculate then f‟‟(x)dx. Calculate then b again. Iterate towards bopt Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
53
example: optimal bandwidth our histogram: found the optimal smoothing parameter, the bandwidth is b ~ 0.25
Now we have: smooth no end points depend on bandwidth
Christoph Grab, ETHZ
Statistical Methods and Analysis Techniques in Experimental Physics
54
Adaptive kernel density estimation bandwidth b is not constant anymore increase (decrease) b in region with small (high) data population preserve small features in high statistics area, minimize jitter in small statistics area
fˆnK ( x) Static Kernel (width of all Gaussian identical)
Christoph Grab, ETHZ
1 n
n i 1
x xi 1 K bi bi
Adaptive Kernel (width of all Gaussian depends on local density of events)
Statistical Methods and Analysis Techniques in Experimental Physics
55