Mass Spectra Alignments and their Significance Sebastian B¨ocker1 , Hans-Michael Kaltenbach2
1
2
Technische Fakult¨at, Universit¨at Bielefeld NRW Int’l Graduate School in Bioinformatics and Genome Research, Universit¨at Bielefeld
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Overview
I
Mass Spectrometry in Proteomics
I
Protein Identification via MS
I
Alignment of Spectra
I
Score Significance
I
Conclusion
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Overview
I
Mass Spectrometry in Proteomics
I
Protein Identification via MS
I
Alignment of Spectra
I
Score Significance
I
Conclusion
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Overview
I
Mass Spectrometry in Proteomics
I
Protein Identification via MS
I
Alignment of Spectra
I
Score Significance
I
Conclusion
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Overview
I
Mass Spectrometry in Proteomics
I
Protein Identification via MS
I
Alignment of Spectra
I
Score Significance
I
Conclusion
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Overview
I
Mass Spectrometry in Proteomics
I
Protein Identification via MS
I
Alignment of Spectra
I
Score Significance
I
Conclusion
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Proteins
Biology Proteins are directed polymers of 20 different amino acids. G
T T
D I
S Q
N
T
D M
K
K A
K
A
K
A
T S
Mathematics Proteins are strings over an alphabet Σ.
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Mass Spectrometry Mass Spectrometry in Bioscience Mass spectrometry measures the masses and quantity of molecules in a probe. It is widely used in biosciences to identify proteins and other biomolecules.
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Fragmentation of peptides Problem Solely measuring the mass of a protein is not sufficient for identification. T I Q
S T N K K A K
D M
K A
A T
S
abundance
G T D
mass
Idea Break up the protein into smaller pieces in a deterministic way. The spectrum of these pieces is called a fingerprint of the protein.
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Fragmentation of peptides Problem Solely measuring the mass of a protein is not sufficient for identification. T I Q
S T N K K A K
D M
K A
A T
abundance
G T D
S
mass
D M
K
A K
T I Q
G T D
A T
K A S T N K
S
abundance
Idea Break up the protein into smaller pieces in a deterministic way. The spectrum of these pieces is called a fingerprint of the protein.
mass
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Peptide Mass Fingerprints Enzymatic cleavage example An enzyme cuts amino acid sequence after each letter K.
G
T T
B¨ ocker, Kaltenbach
D S T N K I Q
D M
K A
K A K A T
Mass Spectra Alignments
S
CPM 2005
Peptide Mass Fingerprints Enzymatic cleavage example An enzyme cuts amino acid sequence after each letter K.
G
T
B¨ ocker, Kaltenbach
I
Q
T D S T
N
D
K
M
K A
K A
Mass Spectra Alignments
K A
T
S
CPM 2005
Peptide Mass Fingerprints
1.00
Artificial Spectrum of GTDSTNKDMKASTAKAKQIT
0.85
QIT / 343.3801
AK / 199.3618
GTDSTNK / 703.7071
0.70
0.75
0.80
Rel. Abundance
0.90
0.95
DMK / 374.4614
ASTAK / 458.5151 200
300
400
500
600
Mass
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
700
Real Mass Spectrum (PMF peaks annotated)
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Processing the spectrum
Peak extraction Spectra are summarized into peak lists, but extracting peaks is inherently difficult. Problem: Peak lists are never correct I Inaccurate calibration I
Probe contamination
I
Peak detection
I
...
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Identification Protein Identification w/ PMF I
Isolate many copies of ONE protein
I
Digest it into specific smaller fragments (Mass Fingerprint)
I
Make a mass spectrum of these fragments
I
Compare spectrum to all predicted mass spectra from DB
Mass Fingerprint via Mass Spectrometry
Peaklist
Peaklist Comparison
Score + Significance
B¨ ocker, Kaltenbach
Mass Spectra Alignments
Mass Fingerprint via in-silico fragmentation AVKKPPTVHIIT... KVVGTASILLYV... VVNMTREEEASD... QEVFGGTELLPP... PLMKKRPHGTFD... ............... KLMMMTGERDFG... HILKMLVFDSAQ...
CPM 2005
Identification Protein Identification w/ PMF I
Isolate many copies of ONE protein
I
Digest it into specific smaller fragments (Mass Fingerprint)
I
Make a mass spectrum of these fragments
I
Compare spectrum to all predicted mass spectra from DB
Mass Fingerprint via Mass Spectrometry
Peaklist
Peaklist Comparison
Score + Significance
B¨ ocker, Kaltenbach
Mass Spectra Alignments
Mass Fingerprint via in-silico fragmentation AVKKPPTVHIIT... KVVGTASILLYV... VVNMTREEEASD... QEVFGGTELLPP... PLMKKRPHGTFD... ............... KLMMMTGERDFG... HILKMLVFDSAQ...
CPM 2005
Comparing Two Peak Lists Peaklists and Empty Peaks Let Sm , Sp be an extracted and a predicted peaklist. Let ε denote a special gap peak. Scoring Scheme Each assignment between the two peak lists can be scored: X score(Sp , Sm ) = score(i, j) matched peaks matched i,j
+
X
score(i, ε) missing peaks
missing
+
X
score(ε, j) additional peaks
additional
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Matching peaklists
Matching I One-to-one peak matching I
Peak matchings should not cross
I
Any peak must be matched either to a peak or to the gap peak
I
Matching score mainly based on mass difference but can include other features
Best matching Using such scoring schemes, the best peaklist matching can be computed using standard global alignment.
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Scoring scheme example: Peak counting
Peak counting score
1 |mass(i) − mass(j)| ≤ δ 0 else score(i, ε) = score(ε, j) = 0
score(i, j) =
δ = 10, Sm = {1000, 1230, 1500} and Sp = {1000, 1235, 1700} Alignment Sp 1000 1235 ε 1700 Sm 1000 1230 1500 ε score(Sm , Sp ) = (1 + 1) + 0 + 0 = 2.
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Estimating the score distribution Problem The score distribution depends on I
Measured spectrum
I
Sequence length
I
Mass and probability of characters
Estimation techniques I Different null-models: Sampling against spectra or sampling against sequences I
Sampling against sequences Random or DB sequences both take long time
I
Estimation of moments Works with certain classes of distributions
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Score distribution Claim In most useful cases, the score distribution for fixed string length can be well approximated by a normal distribution and is then determined by expectation and variance. Missing and additional scores are usually very small compared to matches.
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Computing moments Main Idea Probability of a peak corresponds to probability of a fragment of same mass in peptide. I
Discretize masses by scaling and rounding
I
Compute probability of fragment of length l with mass 6= m
I
Compute probability of string of length L to have no fragment of peak mass m
I
Can all be done in preprocessing
I
Estimate moments
I
Compute p-value
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Computing moments Main Idea Probability of a peak corresponds to probability of a fragment of same mass in peptide. I
Discretize masses by scaling and rounding
I
Compute probability of fragment of length l with mass 6= m
I
Compute probability of string of length L to have no fragment of peak mass m
I
Can all be done in preprocessing
I
Estimate moments
I
Compute p-value
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Fragment probability Weighted Alphabet We call the tuple (Σ, µ) with mass function µ : Σ → N an (integer) P|s| weighted alphabet. Define µ(s) := k=1 µ(sk ). Fragments Let x be the cleavage character and Σx = Σ\{x}. The number of fragments of length l with mass m is then given by X c[l, m] = c[l − 1, m − µ(σ)] σ∈Σx ,µ(σ)≤m
and for uniform character distribution we get the probability r[l, m] = 1 −
B¨ ocker, Kaltenbach
c[l, m] |Σx |l
Mass Spectra Alignments
CPM 2005
Probability in Strings Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate.
p[L,m] G
T
D
S
B¨ ocker, Kaltenbach
T
N K
D
M
K
A
S
Mass Spectra Alignments
T
A
K
A
K
Q
I
CPM 2005
T
Probability in Strings Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate.
p[L,m] G
T
D
S
T
N K
D
M
K
A
S
T
A
K
A
K
Q
I
1st cleavage site at position l
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
T
Probability in Strings Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate.
p[L,m] G
T
D
S
T
N K
D
M
K
A
S
T
A
K
A
K
Q
I
r[l-1,m-µ(K)] 1st cleavage site at position l
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
T
Probability in Strings Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate.
p[L,m] G
T
D
S
T
N K
D
M
K
A
S
r[l-1,m-µ(K)]
T
A
K
A
K
Q
I
p[L-l,m]
1st cleavage site at position l
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
T
Probability in Strings
Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate. The prob. of s ∈ ΣL to have NO fragment of mass m is given by p¯[L, m] = r[L, m] × P (no cleavage at all) +
L X l=1
r[l − 1, m − µ(x)] ×P (first cleavage at l) × p¯[L − l, m] | {z } | {z }
B¨ ocker, Kaltenbach
first frag.
suffix left
Mass Spectra Alignments
CPM 2005
Expected match score of a peak Score
Score distribution
threshold m/z [Da]
M1
U1
U2
M2
U3
M3
Extracted Peaks
The expected value of extracted peak j with support Uj is X E(matchscore(j)) = p[L, m] × score(mass(j), m) m∈Uj
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005
Conclusion
Main features I
Scoring schemes allow very flexible identification routines
I
Computation of significance is database independent
I
Extension to other cleavage schemes possible
I
Extension to nonuniform alphabets and to isotope masses straightforward
B¨ ocker, Kaltenbach
Mass Spectra Alignments
CPM 2005