Noise reduction in genome-wide perturbation screens using linear mixed-effect models Supplementary Materials Danni Yu 1 , John Danku 2 , Ivan Baxter 3 , Sungjin Kim 4 , Olena K. Vatamaniuk 4 , David E. Salt 2,5 and Olga Vitek 1,6 June 18, 2011 1 Department
of Statistics, Purdue University, West Lafayette, IN, USA of Biological Sciences, University of Aberdeen, Aberdeen, UK, 3 USDA-ARS Plant Genetics Research Unit, Donald Danforth Plant Science Center, MO, USA 4 Department of Crop and Soil Sciences, Cornell University, Ithaca, NY, USA 5 Bindley Bioscience Center, Discovery Park, Purdue University, West Lafayette, IN, USA 6 Department of Computer Science, Purdue University, West Lafayette, IN USA 2 School
We describe the existing statistical methods for interpretation of perturbation screens in the section of results. Implementations of these methods are available in the open-source Bioconductor packages RNAither [4] or cellHTS2 [2, 1]. Below, Xgkp denotes the raw phenotype of the replicate k of mutant g on plate p.
1
Currently used methods: sample-based normalization
B score accounts for additive effects of rows and columns within plates. In the notation above, the method specifies a linear model Xgkp = µp + Ri,p + Cj,p + εij,gkp ,
(1)
where µp is the average phenotype of the plate, Ri,p and Cj,p are the systematic plate-specific artifacts of row i and column j, and εij,gkp is the non-systematic error following a normal distribution with a mean equal to zero. Parameters in Eq. (3) are estimated separately for each plate, and from all measurements on the plate, using a robust alternative to sample averages (i.e. Tukey median polish). The residual measurements of the phenotype are then obtained as ˆ i,p + Cˆj,p ]. rgkp = Xgkp − [ˆ µp + R
(2)
Finally, B score for the kth replicate of mutant g is calculated by standardizing its residual to a robust estimate on the plate-specific spread of the residuals of all samples Bscoregkp =
rgkp median( |rgkp −median(rgkp )| )p .
1
(3)
Z score accounts for the distance between the mutant’s phenotype and its group mean in units of plate-specific standard deviation ¯ ..p Xgkp − X , s..p
Zscoregkp =
(4)
¯ ..p and s..p are respectively the mean and the standard deviation of the raw phenotype of where X mutant g in plate p. Plate-wise median normalization (pmNorm) pmN ormgkp =
is defined as
Xgkp . median(Xgkp )p
(5)
Quantile normalization (pmNorm) assumes that the empirical distributions of the phehotypes are the same across plates. It normalizes the phenotype Xgkp by applying the transformation rgkp = F −1 (G(Xgkp ))
(6)
where G is the empirical distribution of each plate, and F is the averaged distribution of sample quantiles across all plates.
2
Currently used methods: control-based normalization
Normalized percent inhibition (NPI)
is defined as c¯+ p − Xgkp
N P Igkp =
¯− c¯+ p p −c
,
(7)
where c¯+ ¯− p and c p are the mean raw phenotypes of positive and negative controls in plate p. Percent of control mean (pocMean)
is defined as
pocM eangkp =
Xgkp × 100 , c¯+ p
(8)
where c¯+ p is the mean of the positive controls in plate p. Percent of control median (pocMed) pocM edgkp =
Xgkp × 100 , c˜− p
where c¯− p is the median of the negative controls in plate p.
2
(9)
3
Currently used methods: detection of hits
Denote vgkp the phenotype of the kth replicate of mutant g in plate p, normalized with any of the above methods, and consider testing H0 : the phenotype of a mutant is consistent with a pre-defined value of interest c against Ha : the phenotype of the mutant differs from c more systematically than expected by random chance. Depending on the experiment, c can be the normalized phenotype of a control, or the median normalized phenotype of all mutants in the screen. Student T statistic is based on the summarization by sample average, and on variance estimation by sample variance. The test statistic is defined as ng v¯g·· − c 1 X 2 Tg = q , where sg = (vgkp − v¯g·· ) ng − 1 s2g /ng k=1
(10)
where ng is the number of replicates of mutant g in plate p. Moderated T statistic was originally proposed in the context of gene expression microarrays [5], and improves upon the estimate of variance s2g for experiments with a small number of replicates. The approach assumes a Scaled Inverse Chi-square prior distribution of σg2 (or, equivalently, an Inverse Gamma distribution), and uses an Empirical Bayes procedure to derive the test statistic. Formally, the approach assumes 1 iid 1 2 d0 d0 s20 ∼ , ), χ , = Gamma( σε2g 2 2 d0 s20 d0
(11)
the the moderated T statistic is ¯
B −c , where s˜2g = Tg = √g·bp 2 s˜g /ng
(ng −1)s2g +d0 s20 (ng −1)+d0 .
(12)
s20 and d0 in the expression above are the degrees of freedom parameter and the scale parameter of the prior distribution, which are estimated empirically from the entire collection of mutants in the screen. In other words, the joint analysis of all the mutants provides an additional information on the variation, and is equivalent to a prior dataset with estimated variance s20 based on d0 degrees of freedom.
4
Rows and columns of the plate have negligible effect on the quantitative ionomic phenotypes
The following quality control procedure reproduces Figure 4 in reference [3]. We specify the additive model Xijp = µp + Ri,p + Cj,p + εijp ,
(13)
for all the samples in a plate, separately for each plate. Here Xijp denotes the raw univariate phenotype of the sample in row i, column j on the plate p, and Ri,p and Cj,p denote the row- and ˆ i,p and column-specific deviations of the phenotypes from the overall mean. Parameter estimates R 3
Cˆj,p are obtained using Tukey two-way Median Polish, a robust alternative to least squares-based estimation. The quality control metrics for the plate are then defined as the sample variances of ˆ i,p and Cˆj,p relative to the sample variances of the residuals εˆij,p R P8 P8 ˆ 2 ˆ i=1 Ri,p /8) / (8 − 1) i=1 (Ri,p − (14) P8 P12 P 8 P12 εij,p − i=1 j=1 εˆij,p /96)2 / 95 i=1 j=1 (ˆ and P12
ˆ
j=1 (Cj,p − P8 P12 εij,p i=1 j=1 (ˆ
P12
2 ˆ j=1 Cj,p /12) / (12 − 1) P8 P12 − i=1 j=1 εˆij,p /96)2 /
95
(15)
Fig. 1 shows the distributions of the metrics across plates for all elements and all the 3 screens. The plots indicate the ratio of variance(effects) over variance(residuals) equal to 1 as a reference. The plots indicate the general absence of row effects, and some larger column effects. The column effect is not surprising since, as shown in Fig. 1 of the main manuscript, in this experiment columns of the plates have more diverse biological samples than rows. In other words, the columns have a stronger confounding with the perturbations. Since the column effects appear only for a subset of the elements, and since all the metrics are below 5 (i.e. much smaller than the median value of 40 in [3]), the row and column effects were omitted from the subsequent downstream modeling for these datasets. As discussed in the main manuscript, the proposed noise reduction methodology is equally applicable with and without these effects.
5
Detection of hits
Similarly to (Efron, 2008), we apply a transformation to the test statistic to ensure that the sampling distribution under H0 is close to the Standard Normal, i.e. Zg =
Dg − median(Dg ) , median( |Dg − median(Dg )| ) · C
(16)
where C = 1/Φ−1 (3/4) ≈ 1.4826 is a normalizing constant for a robust unbiased estimation of the scale. Fig. 2 illustrates the sampling distributions of Zg with all dimensions combined, for the three screens in the main manuscript. When the assumptions of the proposed model and of the estimation procedure are verified, the sampling distribution of Zg under H0 is approximately Standard Normal. As can be seen from the figure, the Standard Normal distribution approximates well the center of the histogram, and indicates that the data present no gross departures from the assumptions. Fig. 3 shows the result of fitting a two-group model by (Efron, 2008) to the test statistics of all mutants in the KO screen, combined across all the dimensions of the multivariate phenotypes. The raw phenotypes were normalized as described in legends (a)-(g), and standardized with the moderated T statistic. The figures show fairly narrow sampling distributions of the test statistics, with many outlying values, which yield a large number of candidate hits. This pattern is due, in part, to the under-estimation of the variation.
4
●
Presence of colum/row effects in KO , 14 dimensions ●
(a) KO screen ●
●
20
●
var(column)/var(residual) var(row)/var(residual)
●
●
●
●
15
●
● ● ● ●
● ●
●
● ●
● ●
● ●
●
10
● ● ●
●
●
●
●
● ●
●
●
●
● ●
●
● ●
● ●
● ● ●
● ●
●
5
●
● ● ●
● ● ● ● ● ●
Ca44
● ●
● ● ●
●
● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
Cd111
●
Co59
● ● ● ● ● ●
●
Cu65
●
● ● ● ● ●
●
●
●
● ● ●
K39
● ●
●
●
Mn55
●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ● ●
● ● ●
●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
Mo95
Na23
Ni60
Presence of colum/row effects in KOd , 14 dimensions
P31
S34
Zn66
(b) KOd screen
20
●
var(column)/var(residual) var(row)/var(residual)
●
10
15
●
●
●
●
●
●
● ● ●
●
●
5
variance(effect)/variance(residual)
●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
Mg25
●
● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
Fe57
● ● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
●
● ● ● ● ●
●
●
●
●
● ● ● ● ●
●
● ●
●
● ● ●
● ● ●
● ● ● ●
● ● ● ●
●
0
variance(effect)/variance(residual)
●
● ● ●
● ●
● ● ●
● ● ●
●
● ●
●
● ●
●
● ● ● ● ●
0
Ca44
● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ●
●
● ●
● ● ●
Co59
Cu65
Fe57
K39
Mg25
Mn55
Mo95
Na23
● ● ●
● ● ●
●
Cd111
●
●
● ● ●
Ni60
Presence of colum/row effects in OE , 17 dimensions
P31
S34
Zn66
10
15
var(column)/var(residual) var(row)/var(residual)
●
● ●
●
●
● ●
● ● ● ●
●
● ●
5
●
●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
As75
Ca44
● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
Cd111
Cl35
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
Co59
Cu65
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
Fe57
●
●
●
●
● ● ● ● ● ● ●
●
●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
●
0
variance(effect)/variance(residual)
20
(c) OE screen
K39
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
Mn55
● ● ● ● ● ● ● ●
Mo95
●
● ● ●
● ● ● ●
●
● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
Na23
● ● ● ● ●
●
● ●
●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
Mg25
●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
P31
● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
Ni60
● ● ●
● ● ● ● ● ● ● ●
S34
Se82
● ● ● ● ● ● ● ● ● ● ● ● ●
Zn66
Figure 1: Distribution of relative row and column effects in Eq. (14) and Eq. (15), summarized over plates, separately for each ionomic screen in the main manuscript, and each inorganic element. Horizontal lines indicate the ratio of 1.
5
(b) KOd screen, all elements 0.35
thresholds (−3.413, 3.443) 148 (13.13%) of 1127 mutants
thresholds (−3.734, 3.686) 964 (16.63%) of 5798 mutants
0.15
range [−10.789, 8.591]
range [−12.624, 40.407]
0.00
0.0
0.00
0.05
0.05
0.1
0.10
0.2
0.20
range [−9.449, 33.694]
0.10
0.15
KOd: all elements N(0, 1) fitted curve
0.20
0.3
0.25
1303 (26.38%) of 4940 mutants
N(0, 1) fitted curve
0.30
0.4
KOd: all elements
thresholds (−3.158, 3.293)
0.30
0.35
KOd: all elements N(0, 1) fitted curve
(c) OE screen, all elements
0.25
(a) KO screen, all elements
−4
−2
0
2
4
−4
−2
0
2
4
−4
−2
0
2
4
Figure 2: Determination of hits in three perturbation screens in the main manuscript. The histograms show the sampling distributions of Zg in Eq. (16), combined across all dimensions of the multivariate phenotype. The dashed line showed the Standard Normal distribution fitted to the center of the distribution, and the green line shows the fit to the histogram based according to the two-group model in (Efron, 2008). Magenta triangles indicate the thresholds of Zg , which control the FDR at 0.05.
6
−20
0
20
Frequency
4000 −100
0
50
−150
−100
−50
thresholds (−8.001, 8.182)
thresholds (−8.992, 8.874) 3962 (79.69%) of 4972 mutants
thresholds (−6.375, 6.582) 4885 (98.25%) of 4972 mutants
−150
−100
−50
0
50
10000 0
0
0
2000
5000
5000
Frequency
Frequency
6000 4000
50
(f) NPI
10000
4044 (81.34%) of 4972 mutants
0
MLE: delta: −0.114 sigma: 2.179 p0: 0.733 CME: delta: 0.006 sigma: 5.267 p0: 0.977
(e) pocMed 15000
10000
−50
MLE: delta: 0.112 sigma: 2.145 p0: 0.759 CME: delta: 0.516 sigma: 4.631 p0: 1.005
(d) pocMean
Frequency
4584 (92.2%) of 4972 mutants
0
40
MLE: delta: 0.1 sigma: 1.879 p0: 0.749 CME: delta: −0.737 sigma: 3.772 p0: 0.984
15000
−40
thresholds (−6.214, 5.987)
2000
4000 0
2000
2000 0
−60
8000
10000
3709 (74.6%) of 4972 mutants
8000
Frequency
6000
thresholds (−6.22, 6.444)
6000
8000
10000
3497 (70.33%) of 4972 mutants
8000
12000
thresholds (−5.751, 5.952)
(c) Plate-wize median
6000
14000
(b) Zscore
4000
Frequency
10000
12000
(a) Bscore
−100
MLE: delta: 0.091 sigma: 2.696 p0: 0.744 CME: delta: 0.479 sigma: 6.003 p0: 0.988
−50
0
50
MLE: delta: −0.059 sigma: 2.719 p0: 0.739 CME: delta: −0.007 sigma: 5.517 p0: 0.976
100
−200
0
200
400
600
MLE: delta: 0.104 sigma: 2.488 p0: 0.673 CME: delta: 1.127 sigma: 9.699 p0: 0.967
(g) quantile
4000
3359 (67.56%) of 4972 mutants
0
2000
Frequency
6000
thresholds (−8.029, 7.647)
−40
−20
0
20
MLE: delta: −0.191 sigma: 2.31 p0: 0.807 CME: delta: 0.032 sigma: 3.787 p0: 0.999
Figure 3: Result of fitting a two-group model by (Efron, 2008) to the test statistics of all mutants in the KO screen, combined across all the dimensions of the multivariate phenotypes. The raw phenotypes were normalized as described in legends (a)-(g), and standardized with the moderated T statistic. Score cutoffs were chosen to control the False Discovery Rate at 0.05.
7
6
Profile plots before and after the proposed noise reduction procedure
Fig. 4, Fig. 5, Fig. 6 show profile plots of the evaluation controls before and after the proposed noise reduction procedure, as in Fig. 4 (a) and (b) of the main manuscript, for all screens. The evaluation controls were not used for normalization and for variance estimation. The figures illustrate the effectiveness of the proposed method in reducing the noise.
(a) YLR396C, before normalization correlation 0.558 KO: Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.558
(b) YLR396C, noise reduction with BY4741-YDL227C correlation 0.968 KO: Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.968 10
0
−50 nocZrr
raw Z statistic
0
−100
−10
−150 −20 −200 −30
(c) YPR065W, before normalization correlation 0.457 KO:Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.457
S34
Zn66
P31
Ni60
Na23
Mo95
Mn55
K39
Mg25
Fe57
Cu65
Co59
Ca44
Cd111
S34
Zn66
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cd111
Ca44
−250
(d) YPR065W, noise reduction with BY4741-YDL227C correlation 0.971 KO: Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.964
50
0
10 nocZrr
raw Z statistic
20
0
−50 −10
−20
Zn66
S34
P31
Ni60
Na23
Mo95
Mn55
K39
Mg25
Fe57
Cu65
Co59
Ca44
Cd111
Zn66
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cd111
.
Ca44
−100
Figure 4: KO screen. Profile plots of the standardized phenotypes of the controls, which has not been used for normalization or standardization. X axis: inorganic elements. Y axis: (a), (c) raw and (b), (d) normalized and standardization phenotypes. Each line represents the phenotype of the control in one plate. 8
(a) YLR396C, before normalization correlation 0.408 KOd:Average refLine (0,0) Cluster 1 has 60 genes; avgCor 0.408
(b) YLR396C, noise reduction with BY4743-YDL227C correlation 0.963 KOd:Average refLine (0,0) Cluster 1 has 60 genes; avgCor 0.962 40
30
0
nocZrr
raw Z statistic
20
−50
10
0 −100 −10
(c) YPR065W, before normalization correlation 0.033 KO: Average refLine (0,0) Cluster 1 has 305 genes; avgCor 0.457
Zn66
S34
P31
Ni60
Na23
Mo95
Mn55
K39
Mg25
Fe57
Cu65
Co59
Ca44
Cd111
S34
Zn66
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cd111
Ca44
−20
(d) YPR065W, noise reduction with BY4743-YDL227C correlation 0.940 OE: Average refLine (0,0) Cluster 1 has 312 genes; avgCor 0.956
50
nocZrr
raw Z statistic
10 0
−50
0
−10
−20 Zn66
Se82
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cl35
Cd111
As75
Ca44
S34
Zn66
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cd111
Ca44
−100
Figure 5: KOd screen. Profile plots of the standardized phenotypes of the controls, which has not been used for normalization or standardization. X axis: inorganic elements. Y axis: (a), (c) raw and (b), (d) normalized and standardization phenotypes. Each line represents the phenotype of the control in one plate.
9
(a) YBR290W, before normalization correlation 0.039 OE: Average refLine (0,0) Cluster 1 has 312 genes; avgCor 0.039
(b) YBR290W, noise reduction with YMR243C-YDL227C correlation 0.962 OE: Average refLine (0,0) Cluster 1 has 312 genes; avgCor 0.959 20
100
nocZrr
raw Z statistic
10 50
0
0 −10
−50
(c) YGL008C, before normalization correlation 0.03 KOd: Average refLine (0,0) Cluster 1 has 60 genes; avgCor 0.033
Zn66
S34
Se82
P31
Ni60
Na23
Mo95
Mn55
K39
Mg25
Fe57
Cu65
Cl35
Co59
Cd111
As75
Ca44
Zn66
S34
Se82
P31
Ni60
Na23
Mo95
Mn55
K39
Mg25
Fe57
Cu65
Cl35
Co59
Cd111
As75
Ca44
−20
(d) YGL008C, noise reduction with YMR243C-YDL227C correlation 0.961 OE: Average refLine (0,0) Cluster 1 has 312 genes; avgCor 0.956
20
10 10
nocZrr
raw Z statistic
0
−10
0
−20 −10 −30
−40
Zn66
Se82
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cl35
Cd111
As75
Ca44
Zn66
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cd111
Ca44
−20
Figure 6: OE screen. Profile plots of the standardized phenotypes of the controls, which has not been used for normalization or standardization. X axis: inorganic elements. Y axis: (a), (c) raw and (b), (d) normalized and standardization phenotypes. Each line represents the phenotype of the control in one plate.
10
7
Profile plots after normalization with sample-based B score, and summarization with moderated T statistic
Fig. 7 shows profile plots of the evaluation controls for all screens, after normalization with samplebased B score, and summarization with moderated T, as in Fig. 4(c) of the main manuscript.
(a) YLR396C in KO Average correlation KO: refLine (0,0) Cluster 1 has 305 genes;0.640 avgCor 0.64
(b) YPR065W in KO Average correlation KO: refLine (0,0) Cluster 1 has 305 genes;0.720 avgCor 0.72
10 100
modT_Bscore
modT_Bscore
0 −10 −20
50
−30 0
−40
S34
Zn66
P31
Ni60
Na23
Mo95
Mn55
K39
Mg25
Fe57
Cu65
Co59
Ca44
(c) YLR396C in KOd Average correlation KOd: refLine (0,0) Cluster 1 has 60 genes;0.889 avgCor 0.889
Cd111
S34
Zn66
P31
Ni60
Na23
Mo95
Mn55
K39
Mg25
Fe57
Cu65
Co59
Cd111
Ca44
−50
(d) YPR065W in KOd: Average correlation KOd: refLine (0,0) Cluster 1 has 60 genes;0.825 avgCor 0.825
20 60
0 modT_Bscore
modT_Bscore
40
−20
20
0
−40 −20 −60
Zn66
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Ca44
(e) YBR290W in OE Average correlation OE: refLine (0,0) Cluster 1 has 312 genes;0.491 avgCor 0.491
Cd111
Zn66
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cd111
Ca44
−40
(f) YGL008C in OE Average correlation OE: refLine (0,0) Cluster 1 has 312 genes;0.331 avgCor 0.331 10
10
modT_Bscore
modT_Bscore
20
0
0
−10
−10 −20
Zn66
Se82
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Cl35
Co59
Cd111
As75
Ca44
Zn66
S34
Se82
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cl35
Cd111
As75
Ca44
−20
Figure 7: Profile plots of evaluation controls. Moderated, after normalization with B score.
11
8
Profile plots after normalization with control-based NPI and summarization with moderated T statistic
Fig. 8 shows profile plots of the evaluation controls for all screens, after normalization with controlbased NPI score, and summarization with moderated T, as in Fig. 4(c) of the main manuscript.
(a) YLR396C in KO Average correlation KO: refLine (0,0) Cluster 1 has 305 genes;0.438 avgCor 0.438
(b) YPR065W in KO Average correlation KO: refLine (0,0) Cluster 1 has 305 genes;0.508 avgCor 0.508
0
50
(c) YLR396C in KOd Average correlation KO: refLine (0,0) Cluster 1 has 60 genes; 0.689 avgCor 0.689
S34
Zn66
P31
Ni60
Na23
Mo95
Mn55
K39
Mg25
Fe57
Ca44
S34
Zn66
P31
Ni60
Na23
Mo95
Mn55
K39
Mg25
Cd111
Fe57
−100
Cu65
−150
Co59
−50
Ca44
−100
Cu65
0
Co59
−50
Cd111
modT_NPI
100
modT_NPI
50
(d) YPR065W in KOd: Average correlation KO: refLine (0,0) Cluster 1 has 60 genes; 0.686 avgCor 0.686 100
50
modT_NPI
modT_NPI
100
0
0
−50
−100
−100
−150 Zn66
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Ca44
(e) YBR290W in OE Average correlation OE: refLine (0,0) Cluster 1 has 312 genes; 0.75 avgCor 0.759
Cd111
Zn66
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cd111
Ca44
−200
(f) YGL008C in OE Average correlation OE: refLine (0,0) Cluster 1 has 312 genes;0.595 avgCor 0.595
100
40
50 20
modT_NPI
modT_NPI
0
−50
0
−20
−100 −40 −150 −60
Zn66
Se82
S34
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Cl35
Co59
Cd111
As75
Ca44
Zn66
S34
Se82
P31
Ni60
Na23
Mo95
Mn55
Mg25
K39
Fe57
Cu65
Co59
Cl35
Cd111
As75
Ca44
−200
Figure 8: Profile plots of evaluation controls. Moderated T, after normalization with NPI score. 12
9
Relative contribution of the normalization and summarization steps to the overall accuracy Normalization and summarization steps
Normalize: B and P (B) growth rate X X X X
X X X
Estimate: 2 and σ 2 σB 0 P0 X X
Average pairwise Pearson correlation between plates (1) KO screen YLR396C YPR065W 0.968 0.904 0.777 0.831 0.202
0.971 0.901 0.742 0.797 0.207
(2) KOd screen YLR396C YPR065W 0.963 0.895 0.824 0.743 0.176
0.940 0.869 0.783 0.739 0.243
(3) OE screen YBR290W YGL008C 0.962 0.911 0.716 0.705 0.160
0.961 0.917 0.725 0.684 0.292
Table 1: Pearson correlation of normalized and summarized profiles between plates, for two positive controls which have not been previously used for normalization or standardization. Higher values indicate better noise reduction. ’X’ indicates the applied normalization and variance estimation steps. The first row corresponds to the proposed approach. The table summarizes the contributions of various analysis steps to the accuracy of the results. In the example datasets, normalization with respect to the covariate (growth rate) and estimation 2 and σ 2 ) make a stronger contribution to the reduction of the noise of residual variance terms (σB 0 P0 than the batch- and plate-wise normalization.
13
10
Additional supplementary materials
Additional supplementary figures, shown in separate files, illustrate raw and normalized phenotypes for the four controls in KO, KOd and OE screens.
Additional Supplementary 1 Boxplots of the ionomic phenotypes of the controls for each element and each screen, separately for each plate, as in Fig. 1(c) and Fig. 2 of the main manuscript. Raw phenotypes are shown at the top of each page. Phenotypes normalized with BY4741 for KO, BY4743 for KOd, and YMR243C for OE, are shown at the bottom of each page.
Additional Supplementary 2 Boxplots of the ionomic phenotypes of the controls for each element and each screen, separately for each plate, after normalization with sample-based B score.
Additional Supplementary 3 Boxplots of the ionomic phenotypes of the controls for each element and each screen, separately for each plate, after normalization with sample-based Z score.
Additional Supplementary 4 Boxplots of the ionomic phenotypes of the controls for each element and each screen, separately for each plate, after normalization with control-based NPI score.
References [1] M. Boutros, L. Br´ as, F. Hahne, and W. Huber. End-to-end analysis of cell-based screens: from raw intensity readings to the annotated hit list. Documentation for the Bioconductor package cellHTS2, 2009. [2] Michael Boutros, L´ıgia P Br´ as, and Wolfgang Huber. Analysis of cell-based RNAi screens. Genome Biology, 2006. [3] Nathalie Malo, James A Hanley, Sonia Cerquozzi, Jerry Pelletier, and Robert Nadon. Statistical practice in high-throughput screening data analysis. Nature Biotechnology, 24:167–175, 2006. [4] Nora Rieber, Bettina Knapp, Roland Eils, and Lars Kaderali. RNAither, an automated pipeline for the statistical analysis of high-throughput RNAi screens. Bioinformatics (Oxford, England), 25:678–9, March 2009. [5] G. K. Smyth. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, 2004.
14