Noise reduction in genome-wide perturbation ... - Semantic Scholar

Report 2 Downloads 39 Views
Noise reduction in genome-wide perturbation screens using linear mixed-effect models Supplementary Materials Danni Yu 1 , John Danku 2 , Ivan Baxter 3 , Sungjin Kim 4 , Olena K. Vatamaniuk 4 , David E. Salt 2,5 and Olga Vitek 1,6 June 18, 2011 1 Department

of Statistics, Purdue University, West Lafayette, IN, USA of Biological Sciences, University of Aberdeen, Aberdeen, UK, 3 USDA-ARS Plant Genetics Research Unit, Donald Danforth Plant Science Center, MO, USA 4 Department of Crop and Soil Sciences, Cornell University, Ithaca, NY, USA 5 Bindley Bioscience Center, Discovery Park, Purdue University, West Lafayette, IN, USA 6 Department of Computer Science, Purdue University, West Lafayette, IN USA 2 School

We describe the existing statistical methods for interpretation of perturbation screens in the section of results. Implementations of these methods are available in the open-source Bioconductor packages RNAither [4] or cellHTS2 [2, 1]. Below, Xgkp denotes the raw phenotype of the replicate k of mutant g on plate p.

1

Currently used methods: sample-based normalization

B score accounts for additive effects of rows and columns within plates. In the notation above, the method specifies a linear model Xgkp = µp + Ri,p + Cj,p + εij,gkp ,

(1)

where µp is the average phenotype of the plate, Ri,p and Cj,p are the systematic plate-specific artifacts of row i and column j, and εij,gkp is the non-systematic error following a normal distribution with a mean equal to zero. Parameters in Eq. (3) are estimated separately for each plate, and from all measurements on the plate, using a robust alternative to sample averages (i.e. Tukey median polish). The residual measurements of the phenotype are then obtained as ˆ i,p + Cˆj,p ]. rgkp = Xgkp − [ˆ µp + R

(2)

Finally, B score for the kth replicate of mutant g is calculated by standardizing its residual to a robust estimate on the plate-specific spread of the residuals of all samples Bscoregkp =

rgkp median( |rgkp −median(rgkp )| )p .

1

(3)

Z score accounts for the distance between the mutant’s phenotype and its group mean in units of plate-specific standard deviation ¯ ..p Xgkp − X , s..p

Zscoregkp =

(4)

¯ ..p and s..p are respectively the mean and the standard deviation of the raw phenotype of where X mutant g in plate p. Plate-wise median normalization (pmNorm) pmN ormgkp =

is defined as

Xgkp . median(Xgkp )p

(5)

Quantile normalization (pmNorm) assumes that the empirical distributions of the phehotypes are the same across plates. It normalizes the phenotype Xgkp by applying the transformation rgkp = F −1 (G(Xgkp ))

(6)

where G is the empirical distribution of each plate, and F is the averaged distribution of sample quantiles across all plates.

2

Currently used methods: control-based normalization

Normalized percent inhibition (NPI)

is defined as c¯+ p − Xgkp

N P Igkp =

¯− c¯+ p p −c

,

(7)

where c¯+ ¯− p and c p are the mean raw phenotypes of positive and negative controls in plate p. Percent of control mean (pocMean)

is defined as

pocM eangkp =

Xgkp × 100 , c¯+ p

(8)

where c¯+ p is the mean of the positive controls in plate p. Percent of control median (pocMed) pocM edgkp =

Xgkp × 100 , c˜− p

where c¯− p is the median of the negative controls in plate p.

2

(9)

3

Currently used methods: detection of hits

Denote vgkp the phenotype of the kth replicate of mutant g in plate p, normalized with any of the above methods, and consider testing H0 : the phenotype of a mutant is consistent with a pre-defined value of interest c against Ha : the phenotype of the mutant differs from c more systematically than expected by random chance. Depending on the experiment, c can be the normalized phenotype of a control, or the median normalized phenotype of all mutants in the screen. Student T statistic is based on the summarization by sample average, and on variance estimation by sample variance. The test statistic is defined as ng v¯g·· − c 1 X 2 Tg = q , where sg = (vgkp − v¯g·· ) ng − 1 s2g /ng k=1

(10)

where ng is the number of replicates of mutant g in plate p. Moderated T statistic was originally proposed in the context of gene expression microarrays [5], and improves upon the estimate of variance s2g for experiments with a small number of replicates. The approach assumes a Scaled Inverse Chi-square prior distribution of σg2 (or, equivalently, an Inverse Gamma distribution), and uses an Empirical Bayes procedure to derive the test statistic. Formally, the approach assumes 1 iid 1 2 d0 d0 s20 ∼ , ), χ , = Gamma( σε2g 2 2 d0 s20 d0

(11)

the the moderated T statistic is ¯

B −c , where s˜2g = Tg = √g·bp 2 s˜g /ng

(ng −1)s2g +d0 s20 (ng −1)+d0 .

(12)

s20 and d0 in the expression above are the degrees of freedom parameter and the scale parameter of the prior distribution, which are estimated empirically from the entire collection of mutants in the screen. In other words, the joint analysis of all the mutants provides an additional information on the variation, and is equivalent to a prior dataset with estimated variance s20 based on d0 degrees of freedom.

4

Rows and columns of the plate have negligible effect on the quantitative ionomic phenotypes

The following quality control procedure reproduces Figure 4 in reference [3]. We specify the additive model Xijp = µp + Ri,p + Cj,p + εijp ,

(13)

for all the samples in a plate, separately for each plate. Here Xijp denotes the raw univariate phenotype of the sample in row i, column j on the plate p, and Ri,p and Cj,p denote the row- and ˆ i,p and column-specific deviations of the phenotypes from the overall mean. Parameter estimates R 3

Cˆj,p are obtained using Tukey two-way Median Polish, a robust alternative to least squares-based estimation. The quality control metrics for the plate are then defined as the sample variances of ˆ i,p and Cˆj,p relative to the sample variances of the residuals εˆij,p R P8 P8 ˆ 2 ˆ i=1 Ri,p /8) / (8 − 1) i=1 (Ri,p − (14) P8 P12 P 8 P12 εij,p − i=1 j=1 εˆij,p /96)2 / 95 i=1 j=1 (ˆ and P12

ˆ

j=1 (Cj,p − P8 P12 εij,p i=1 j=1 (ˆ

P12

2 ˆ j=1 Cj,p /12) / (12 − 1) P8 P12 − i=1 j=1 εˆij,p /96)2 /

95

(15)

Fig. 1 shows the distributions of the metrics across plates for all elements and all the 3 screens. The plots indicate the ratio of variance(effects) over variance(residuals) equal to 1 as a reference. The plots indicate the general absence of row effects, and some larger column effects. The column effect is not surprising since, as shown in Fig. 1 of the main manuscript, in this experiment columns of the plates have more diverse biological samples than rows. In other words, the columns have a stronger confounding with the perturbations. Since the column effects appear only for a subset of the elements, and since all the metrics are below 5 (i.e. much smaller than the median value of 40 in [3]), the row and column effects were omitted from the subsequent downstream modeling for these datasets. As discussed in the main manuscript, the proposed noise reduction methodology is equally applicable with and without these effects.

5

Detection of hits

Similarly to (Efron, 2008), we apply a transformation to the test statistic to ensure that the sampling distribution under H0 is close to the Standard Normal, i.e. Zg =

Dg − median(Dg ) , median( |Dg − median(Dg )| ) · C

(16)

where C = 1/Φ−1 (3/4) ≈ 1.4826 is a normalizing constant for a robust unbiased estimation of the scale. Fig. 2 illustrates the sampling distributions of Zg with all dimensions combined, for the three screens in the main manuscript. When the assumptions of the proposed model and of the estimation procedure are verified, the sampling distribution of Zg under H0 is approximately Standard Normal. As can be seen from the figure, the Standard Normal distribution approximates well the center of the histogram, and indicates that the data present no gross departures from the assumptions. Fig. 3 shows the result of fitting a two-group model by (Efron, 2008) to the test statistics of all mutants in the KO screen, combined across all the dimensions of the multivariate phenotypes. The raw phenotypes were normalized as described in legends (a)-(g), and standardized with the moderated T statistic. The figures show fairly narrow sampling distributions of the test statistics, with many outlying values, which yield a large number of candidate hits. This pattern is due, in part, to the under-estimation of the variation.

4



Presence of colum/row effects in KO , 14 dimensions ●

(a) KO screen ●



20



var(column)/var(residual) var(row)/var(residual)









15



● ● ● ●

● ●



● ●

● ●

● ●



10

● ● ●









● ●







● ●



● ●

● ●

● ● ●

● ●



5



● ● ●

● ● ● ● ● ●

Ca44

● ●

● ● ●



● ● ● ● ● ● ● ● ● ●





● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

Cd111



Co59

● ● ● ● ● ●



Cu65



● ● ● ● ●







● ● ●

K39

● ●





Mn55



● ● ● ● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●



● ● ● ● ●

● ● ●



● ●

● ● ● ● ● ●

● ● ● ●

● ● ●



● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

● ● ● ● ●

Mo95

Na23

Ni60

Presence of colum/row effects in KOd , 14 dimensions

P31

S34

Zn66

(b) KOd screen

20



var(column)/var(residual) var(row)/var(residual)



10

15













● ● ●





5

variance(effect)/variance(residual)



● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ●



● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●



● ● ● ● ● ● ● ●

● ●

● ● ● ● ● ● ● ● ●

Mg25



● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ●

● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

Fe57

● ● ● ● ● ● ● ● ● ●



● ● ●

● ● ● ● ● ● ● ● ● ● ●



● ●

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●

● ● ●

● ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ●

● ● ● ● ● ●

● ● ●



● ● ● ● ●









● ● ● ● ●



● ●



● ● ●

● ● ●

● ● ● ●

● ● ● ●



0

variance(effect)/variance(residual)



● ● ●

● ●

● ● ●

● ● ●



● ●



● ●



● ● ● ● ●

0

Ca44

● ● ●



● ● ● ● ●

● ● ● ● ● ● ●

● ● ●

● ● ● ●

● ● ● ●

● ●

● ●

● ● ● ●



● ●

● ● ●

Co59

Cu65

Fe57

K39

Mg25

Mn55

Mo95

Na23

● ● ●

● ● ●



Cd111





● ● ●

Ni60

Presence of colum/row effects in OE , 17 dimensions

P31

S34

Zn66

10

15

var(column)/var(residual) var(row)/var(residual)



● ●





● ●

● ● ● ●



● ●

5





● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●



● ●



● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●

● ●

As75

Ca44

● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●

Cd111

Cl35

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

Co59

Cu65

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

Fe57









● ● ● ● ● ● ●







● ●

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●

● ● ● ● ● ●



0

variance(effect)/variance(residual)

20

(c) OE screen

K39

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●

Mn55

● ● ● ● ● ● ● ●

Mo95



● ● ●

● ● ● ●



● ● ●





● ● ● ● ● ● ● ● ● ● ●





● ● ●



● ● ● ● ● ● ● ● ● ● ● ●

Na23

● ● ● ● ●



● ●





● ● ● ● ● ● ● ●

● ● ● ● ● ●



Mg25



● ● ● ● ● ● ● ●

● ● ● ● ● ●



P31

● ● ● ● ● ● ●

● ● ●

● ●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

Ni60

● ● ●

● ● ● ● ● ● ● ●

S34

Se82

● ● ● ● ● ● ● ● ● ● ● ● ●

Zn66

Figure 1: Distribution of relative row and column effects in Eq. (14) and Eq. (15), summarized over plates, separately for each ionomic screen in the main manuscript, and each inorganic element. Horizontal lines indicate the ratio of 1.

5

(b) KOd screen, all elements 0.35

thresholds (−3.413, 3.443) 148 (13.13%) of 1127 mutants

thresholds (−3.734, 3.686) 964 (16.63%) of 5798 mutants

0.15

range [−10.789, 8.591]

range [−12.624, 40.407]

0.00

0.0

0.00

0.05

0.05

0.1

0.10

0.2

0.20

range [−9.449, 33.694]

0.10

0.15

KOd: all elements N(0, 1) fitted curve

0.20

0.3

0.25

1303 (26.38%) of 4940 mutants

N(0, 1) fitted curve

0.30

0.4

KOd: all elements

thresholds (−3.158, 3.293)

0.30

0.35

KOd: all elements N(0, 1) fitted curve

(c) OE screen, all elements

0.25

(a) KO screen, all elements

−4

−2

0

2

4

−4

−2

0

2

4

−4

−2

0

2

4

Figure 2: Determination of hits in three perturbation screens in the main manuscript. The histograms show the sampling distributions of Zg in Eq. (16), combined across all dimensions of the multivariate phenotype. The dashed line showed the Standard Normal distribution fitted to the center of the distribution, and the green line shows the fit to the histogram based according to the two-group model in (Efron, 2008). Magenta triangles indicate the thresholds of Zg , which control the FDR at 0.05.

6

−20

0

20

Frequency

4000 −100

0

50

−150

−100

−50

thresholds (−8.001, 8.182)

thresholds (−8.992, 8.874) 3962 (79.69%) of 4972 mutants

thresholds (−6.375, 6.582) 4885 (98.25%) of 4972 mutants

−150

−100

−50

0

50

10000 0

0

0

2000

5000

5000

Frequency

Frequency

6000 4000

50

(f) NPI

10000

4044 (81.34%) of 4972 mutants

0

MLE: delta: −0.114 sigma: 2.179 p0: 0.733 CME: delta: 0.006 sigma: 5.267 p0: 0.977

(e) pocMed 15000

10000

−50

MLE: delta: 0.112 sigma: 2.145 p0: 0.759 CME: delta: 0.516 sigma: 4.631 p0: 1.005

(d) pocMean

Frequency

4584 (92.2%) of 4972 mutants

0

40

MLE: delta: 0.1 sigma: 1.879 p0: 0.749 CME: delta: −0.737 sigma: 3.772 p0: 0.984

15000

−40

thresholds (−6.214, 5.987)

2000

4000 0

2000

2000 0

−60

8000

10000

3709 (74.6%) of 4972 mutants

8000

Frequency

6000

thresholds (−6.22, 6.444)

6000

8000

10000

3497 (70.33%) of 4972 mutants

8000

12000

thresholds (−5.751, 5.952)

(c) Plate-wize median

6000

14000

(b) Zscore

4000

Frequency

10000

12000

(a) Bscore

−100

MLE: delta: 0.091 sigma: 2.696 p0: 0.744 CME: delta: 0.479 sigma: 6.003 p0: 0.988

−50

0

50

MLE: delta: −0.059 sigma: 2.719 p0: 0.739 CME: delta: −0.007 sigma: 5.517 p0: 0.976

100

−200

0

200

400

600

MLE: delta: 0.104 sigma: 2.488 p0: 0.673 CME: delta: 1.127 sigma: 9.699 p0: 0.967

(g) quantile

4000

3359 (67.56%) of 4972 mutants

0

2000

Frequency

6000

thresholds (−8.029, 7.647)

−40

−20

0

20

MLE: delta: −0.191 sigma: 2.31 p0: 0.807 CME: delta: 0.032 sigma: 3.787 p0: 0.999

Figure 3: Result of fitting a two-group model by (Efron, 2008) to the test statistics of all mutants in the KO screen, combined across all the dimensions of the multivariate phenotypes. The raw phenotypes were normalized as described in legends (a)-(g), and standardized with the moderated T statistic. Score cutoffs were chosen to control the False Discovery Rate at 0.05.

7

6

Profile plots before and after the proposed noise reduction procedure

Fig. 4, Fig. 5, Fig. 6 show profile plots of the evaluation controls before and after the proposed noise reduction procedure, as in Fig. 4 (a) and (b) of the main manuscript, for all screens. The evaluation controls were not used for normalization and for variance estimation. The figures illustrate the effectiveness of the proposed method in reducing the noise.

(a) YLR396C, before normalization correlation 0.558 KO: Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.558

(b) YLR396C, noise reduction with BY4741-YDL227C correlation 0.968 KO: Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.968 10

0

−50 nocZrr

raw Z statistic

0

−100

−10

−150 −20 −200 −30

(c) YPR065W, before normalization correlation 0.457 KO:Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.457

S34

Zn66

P31

Ni60

Na23

Mo95

Mn55

K39

Mg25

Fe57

Cu65

Co59

Ca44

Cd111

S34

Zn66

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cd111

Ca44

−250

(d) YPR065W, noise reduction with BY4741-YDL227C correlation 0.971 KO: Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.964

50

0

10 nocZrr

raw Z statistic

20

0

−50 −10

−20

Zn66

S34

P31

Ni60

Na23

Mo95

Mn55

K39

Mg25

Fe57

Cu65

Co59

Ca44

Cd111

Zn66

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cd111

.

Ca44

−100

Figure 4: KO screen. Profile plots of the standardized phenotypes of the controls, which has not been used for normalization or standardization. X axis: inorganic elements. Y axis: (a), (c) raw and (b), (d) normalized and standardization phenotypes. Each line represents the phenotype of the control in one plate. 8

(a) YLR396C, before normalization correlation 0.408 KOd:Average refLine (0,0) Cluster 1 has 60 genes; avgCor 0.408

(b) YLR396C, noise reduction with BY4743-YDL227C correlation 0.963 KOd:Average refLine (0,0) Cluster 1 has 60 genes; avgCor 0.962 40

30

0

nocZrr

raw Z statistic

20

−50

10

0 −100 −10

(c) YPR065W, before normalization correlation 0.033 KO: Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.457

Zn66

S34

P31

Ni60

Na23

Mo95

Mn55

K39

Mg25

Fe57

Cu65

Co59

Ca44

Cd111

S34

Zn66

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cd111

Ca44

−20

(d) YPR065W, noise reduction with BY4743-YDL227C correlation 0.940 OE: Average refLine (0,0) Cluster 1 has 312 genes; avgCor 0.956

50

nocZrr

raw Z statistic

10 0

−50

0

−10

−20 Zn66

Se82

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cl35

Cd111

As75

Ca44

S34

Zn66

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cd111

Ca44

−100

Figure 5: KOd screen. Profile plots of the standardized phenotypes of the controls, which has not been used for normalization or standardization. X axis: inorganic elements. Y axis: (a), (c) raw and (b), (d) normalized and standardization phenotypes. Each line represents the phenotype of the control in one plate.

9

(a) YBR290W, before normalization correlation 0.039 OE: Average refLine (0,0) Cluster 1 has 312 genes; avgCor 0.039

(b) YBR290W, noise reduction with YMR243C-YDL227C correlation 0.962 OE: Average refLine (0,0) Cluster 1 has 312 genes; avgCor 0.959 20

100

nocZrr

raw Z statistic

10 50

0

0 −10

−50

(c) YGL008C, before normalization correlation 0.03 KOd: Average refLine (0,0) Cluster 1 has 60 genes; avgCor 0.033

Zn66

S34

Se82

P31

Ni60

Na23

Mo95

Mn55

K39

Mg25

Fe57

Cu65

Cl35

Co59

Cd111

As75

Ca44

Zn66

S34

Se82

P31

Ni60

Na23

Mo95

Mn55

K39

Mg25

Fe57

Cu65

Cl35

Co59

Cd111

As75

Ca44

−20

(d) YGL008C, noise reduction with YMR243C-YDL227C correlation 0.961 OE: Average refLine (0,0) Cluster 1 has 312 genes; avgCor 0.956

20

10 10

nocZrr

raw Z statistic

0

−10

0

−20 −10 −30

−40

Zn66

Se82

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cl35

Cd111

As75

Ca44

Zn66

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cd111

Ca44

−20

Figure 6: OE screen. Profile plots of the standardized phenotypes of the controls, which has not been used for normalization or standardization. X axis: inorganic elements. Y axis: (a), (c) raw and (b), (d) normalized and standardization phenotypes. Each line represents the phenotype of the control in one plate.

10

7

Profile plots after normalization with sample-based B score, and summarization with moderated T statistic

Fig. 7 shows profile plots of the evaluation controls for all screens, after normalization with samplebased B score, and summarization with moderated T, as in Fig. 4(c) of the main manuscript.

(a) YLR396C in KO Average correlation KO: refLine (0,0) Cluster 1 has 305 genes;0.640 avgCor 0.64

(b) YPR065W in KO Average correlation KO: refLine (0,0) Cluster 1 has 305 genes;0.720 avgCor 0.72

10 100

modT_Bscore

modT_Bscore

0 −10 −20

50

−30 0

−40

S34

Zn66

P31

Ni60

Na23

Mo95

Mn55

K39

Mg25

Fe57

Cu65

Co59

Ca44

(c) YLR396C in KOd Average correlation KOd: refLine (0,0) Cluster 1 has 60 genes;0.889 avgCor 0.889

Cd111

S34

Zn66

P31

Ni60

Na23

Mo95

Mn55

K39

Mg25

Fe57

Cu65

Co59

Cd111

Ca44

−50

(d) YPR065W in KOd: Average correlation KOd: refLine (0,0) Cluster 1 has 60 genes;0.825 avgCor 0.825

20 60

0 modT_Bscore

modT_Bscore

40

−20

20

0

−40 −20 −60

Zn66

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Ca44

(e) YBR290W in OE Average correlation OE: refLine (0,0) Cluster 1 has 312 genes;0.491 avgCor 0.491

Cd111

Zn66

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cd111

Ca44

−40

(f) YGL008C in OE Average correlation OE: refLine (0,0) Cluster 1 has 312 genes;0.331 avgCor 0.331 10

10

modT_Bscore

modT_Bscore

20

0

0

−10

−10 −20

Zn66

Se82

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Cl35

Co59

Cd111

As75

Ca44

Zn66

S34

Se82

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cl35

Cd111

As75

Ca44

−20

Figure 7: Profile plots of evaluation controls. Moderated, after normalization with B score.

11

8

Profile plots after normalization with control-based NPI and summarization with moderated T statistic

Fig. 8 shows profile plots of the evaluation controls for all screens, after normalization with controlbased NPI score, and summarization with moderated T, as in Fig. 4(c) of the main manuscript.

(a) YLR396C in KO Average correlation KO: refLine (0,0) Cluster 1 has 305 genes;0.438 avgCor 0.438

(b) YPR065W in KO Average correlation KO: refLine (0,0) Cluster 1 has 305 genes;0.508 avgCor 0.508

0

50

(c) YLR396C in KOd Average correlation KO: refLine (0,0) Cluster 1 has 60 genes; 0.689 avgCor 0.689

S34

Zn66

P31

Ni60

Na23

Mo95

Mn55

K39

Mg25

Fe57

Ca44

S34

Zn66

P31

Ni60

Na23

Mo95

Mn55

K39

Mg25

Cd111

Fe57

−100

Cu65

−150

Co59

−50

Ca44

−100

Cu65

0

Co59

−50

Cd111

modT_NPI

100

modT_NPI

50

(d) YPR065W in KOd: Average correlation KO: refLine (0,0) Cluster 1 has 60 genes; 0.686 avgCor 0.686 100

50

modT_NPI

modT_NPI

100

0

0

−50

−100

−100

−150 Zn66

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Ca44

(e) YBR290W in OE Average correlation OE: refLine (0,0) Cluster 1 has 312 genes; 0.75 avgCor 0.759

Cd111

Zn66

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cd111

Ca44

−200

(f) YGL008C in OE Average correlation OE: refLine (0,0) Cluster 1 has 312 genes;0.595 avgCor 0.595

100

40

50 20

modT_NPI

modT_NPI

0

−50

0

−20

−100 −40 −150 −60

Zn66

Se82

S34

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Cl35

Co59

Cd111

As75

Ca44

Zn66

S34

Se82

P31

Ni60

Na23

Mo95

Mn55

Mg25

K39

Fe57

Cu65

Co59

Cl35

Cd111

As75

Ca44

−200

Figure 8: Profile plots of evaluation controls. Moderated T, after normalization with NPI score. 12

9

Relative contribution of the normalization and summarization steps to the overall accuracy Normalization and summarization steps

Normalize: B and P (B) growth rate X X X X

X X X

Estimate: 2 and σ 2 σB 0 P0 X X

Average pairwise Pearson correlation between plates (1) KO screen YLR396C YPR065W 0.968 0.904 0.777 0.831 0.202

0.971 0.901 0.742 0.797 0.207

(2) KOd screen YLR396C YPR065W 0.963 0.895 0.824 0.743 0.176

0.940 0.869 0.783 0.739 0.243

(3) OE screen YBR290W YGL008C 0.962 0.911 0.716 0.705 0.160

0.961 0.917 0.725 0.684 0.292

Table 1: Pearson correlation of normalized and summarized profiles between plates, for two positive controls which have not been previously used for normalization or standardization. Higher values indicate better noise reduction. ’X’ indicates the applied normalization and variance estimation steps. The first row corresponds to the proposed approach. The table summarizes the contributions of various analysis steps to the accuracy of the results. In the example datasets, normalization with respect to the covariate (growth rate) and estimation 2 and σ 2 ) make a stronger contribution to the reduction of the noise of residual variance terms (σB 0 P0 than the batch- and plate-wise normalization.

13

10

Additional supplementary materials

Additional supplementary figures, shown in separate files, illustrate raw and normalized phenotypes for the four controls in KO, KOd and OE screens.

Additional Supplementary 1 Boxplots of the ionomic phenotypes of the controls for each element and each screen, separately for each plate, as in Fig. 1(c) and Fig. 2 of the main manuscript. Raw phenotypes are shown at the top of each page. Phenotypes normalized with BY4741 for KO, BY4743 for KOd, and YMR243C for OE, are shown at the bottom of each page.

Additional Supplementary 2 Boxplots of the ionomic phenotypes of the controls for each element and each screen, separately for each plate, after normalization with sample-based B score.

Additional Supplementary 3 Boxplots of the ionomic phenotypes of the controls for each element and each screen, separately for each plate, after normalization with sample-based Z score.

Additional Supplementary 4 Boxplots of the ionomic phenotypes of the controls for each element and each screen, separately for each plate, after normalization with control-based NPI score.

References [1] M. Boutros, L. Br´ as, F. Hahne, and W. Huber. End-to-end analysis of cell-based screens: from raw intensity readings to the annotated hit list. Documentation for the Bioconductor package cellHTS2, 2009. [2] Michael Boutros, L´ıgia P Br´ as, and Wolfgang Huber. Analysis of cell-based RNAi screens. Genome Biology, 2006. [3] Nathalie Malo, James A Hanley, Sonia Cerquozzi, Jerry Pelletier, and Robert Nadon. Statistical practice in high-throughput screening data analysis. Nature Biotechnology, 24:167–175, 2006. [4] Nora Rieber, Bettina Knapp, Roland Eils, and Lars Kaderali. RNAither, an automated pipeline for the statistical analysis of high-throughput RNAi screens. Bioinformatics (Oxford, England), 25:678–9, March 2009. [5] G. K. Smyth. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, 2004.

14