Mass Spectra Alignments and their Significance - Semantic Scholar

Report 3 Downloads 76 Views
Mass Spectra Alignments and their Significance Sebastian B¨ocker1 , Hans-Michael Kaltenbach2

1

2

Technische Fakult¨at, Universit¨at Bielefeld NRW Int’l Graduate School in Bioinformatics and Genome Research, Universit¨at Bielefeld

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Overview

I

Mass Spectrometry in Proteomics

I

Protein Identification via MS

I

Alignment of Spectra

I

Score Significance

I

Conclusion

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Overview

I

Mass Spectrometry in Proteomics

I

Protein Identification via MS

I

Alignment of Spectra

I

Score Significance

I

Conclusion

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Overview

I

Mass Spectrometry in Proteomics

I

Protein Identification via MS

I

Alignment of Spectra

I

Score Significance

I

Conclusion

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Overview

I

Mass Spectrometry in Proteomics

I

Protein Identification via MS

I

Alignment of Spectra

I

Score Significance

I

Conclusion

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Overview

I

Mass Spectrometry in Proteomics

I

Protein Identification via MS

I

Alignment of Spectra

I

Score Significance

I

Conclusion

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Proteins

Biology Proteins are directed polymers of 20 different amino acids. G

T T

D I

S Q

N

T

D M

K

K A

K

A

K

A

T S

Mathematics Proteins are strings over an alphabet Σ.

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Mass Spectrometry Mass Spectrometry in Bioscience Mass spectrometry measures the masses and quantity of molecules in a probe. It is widely used in biosciences to identify proteins and other biomolecules.

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Fragmentation of peptides Problem Solely measuring the mass of a protein is not sufficient for identification. T I Q

S T N K K A K

D M

K A

A T

S

abundance

G T D

mass

Idea Break up the protein into smaller pieces in a deterministic way. The spectrum of these pieces is called a fingerprint of the protein.

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Fragmentation of peptides Problem Solely measuring the mass of a protein is not sufficient for identification. T I Q

S T N K K A K

D M

K A

A T

abundance

G T D

S

mass

D M

K

A K

T I Q

G T D

A T

K A S T N K

S

abundance

Idea Break up the protein into smaller pieces in a deterministic way. The spectrum of these pieces is called a fingerprint of the protein.

mass

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Peptide Mass Fingerprints Enzymatic cleavage example An enzyme cuts amino acid sequence after each letter K.

G

T T

B¨ ocker, Kaltenbach

D S T N K I Q

D M

K A

K A K A T

Mass Spectra Alignments

S

CPM 2005

Peptide Mass Fingerprints Enzymatic cleavage example An enzyme cuts amino acid sequence after each letter K.

G

T

B¨ ocker, Kaltenbach

I

Q

T D S T

N

D

K

M

K A

K A

Mass Spectra Alignments

K A

T

S

CPM 2005

Peptide Mass Fingerprints

1.00

Artificial Spectrum of GTDSTNKDMKASTAKAKQIT

0.85

QIT / 343.3801

AK / 199.3618

GTDSTNK / 703.7071

0.70

0.75

0.80

Rel. Abundance

0.90

0.95

DMK / 374.4614

ASTAK / 458.5151 200

300

400

500

600

Mass

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

700

Real Mass Spectrum (PMF peaks annotated)

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Processing the spectrum

Peak extraction Spectra are summarized into peak lists, but extracting peaks is inherently difficult. Problem: Peak lists are never correct I Inaccurate calibration I

Probe contamination

I

Peak detection

I

...

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Identification Protein Identification w/ PMF I

Isolate many copies of ONE protein

I

Digest it into specific smaller fragments (Mass Fingerprint)

I

Make a mass spectrum of these fragments

I

Compare spectrum to all predicted mass spectra from DB

Mass Fingerprint via Mass Spectrometry

Peaklist

Peaklist Comparison

Score + Significance

B¨ ocker, Kaltenbach

Mass Spectra Alignments

Mass Fingerprint via in-silico fragmentation AVKKPPTVHIIT... KVVGTASILLYV... VVNMTREEEASD... QEVFGGTELLPP... PLMKKRPHGTFD... ............... KLMMMTGERDFG... HILKMLVFDSAQ...

CPM 2005

Identification Protein Identification w/ PMF I

Isolate many copies of ONE protein

I

Digest it into specific smaller fragments (Mass Fingerprint)

I

Make a mass spectrum of these fragments

I

Compare spectrum to all predicted mass spectra from DB

Mass Fingerprint via Mass Spectrometry

Peaklist

Peaklist Comparison

Score + Significance

B¨ ocker, Kaltenbach

Mass Spectra Alignments

Mass Fingerprint via in-silico fragmentation AVKKPPTVHIIT... KVVGTASILLYV... VVNMTREEEASD... QEVFGGTELLPP... PLMKKRPHGTFD... ............... KLMMMTGERDFG... HILKMLVFDSAQ...

CPM 2005

Comparing Two Peak Lists Peaklists and Empty Peaks Let Sm , Sp be an extracted and a predicted peaklist. Let ε denote a special gap peak. Scoring Scheme Each assignment between the two peak lists can be scored: X score(Sp , Sm ) = score(i, j) matched peaks matched i,j

+

X

score(i, ε) missing peaks

missing

+

X

score(ε, j) additional peaks

additional

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Matching peaklists

Matching I One-to-one peak matching I

Peak matchings should not cross

I

Any peak must be matched either to a peak or to the gap peak

I

Matching score mainly based on mass difference but can include other features

Best matching Using such scoring schemes, the best peaklist matching can be computed using standard global alignment.

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Scoring scheme example: Peak counting

Peak counting score 

1 |mass(i) − mass(j)| ≤ δ 0 else score(i, ε) = score(ε, j) = 0

score(i, j) =

δ = 10, Sm = {1000, 1230, 1500} and Sp = {1000, 1235, 1700} Alignment Sp 1000 1235 ε 1700 Sm 1000 1230 1500 ε score(Sm , Sp ) = (1 + 1) + 0 + 0 = 2.

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Estimating the score distribution Problem The score distribution depends on I

Measured spectrum

I

Sequence length

I

Mass and probability of characters

Estimation techniques I Different null-models: Sampling against spectra or sampling against sequences I

Sampling against sequences Random or DB sequences both take long time

I

Estimation of moments Works with certain classes of distributions

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Score distribution Claim In most useful cases, the score distribution for fixed string length can be well approximated by a normal distribution and is then determined by expectation and variance. Missing and additional scores are usually very small compared to matches.

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Computing moments Main Idea Probability of a peak corresponds to probability of a fragment of same mass in peptide. I

Discretize masses by scaling and rounding

I

Compute probability of fragment of length l with mass 6= m

I

Compute probability of string of length L to have no fragment of peak mass m

I

Can all be done in preprocessing

I

Estimate moments

I

Compute p-value

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Computing moments Main Idea Probability of a peak corresponds to probability of a fragment of same mass in peptide. I

Discretize masses by scaling and rounding

I

Compute probability of fragment of length l with mass 6= m

I

Compute probability of string of length L to have no fragment of peak mass m

I

Can all be done in preprocessing

I

Estimate moments

I

Compute p-value

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Fragment probability Weighted Alphabet We call the tuple (Σ, µ) with mass function µ : Σ → N an (integer) P|s| weighted alphabet. Define µ(s) := k=1 µ(sk ). Fragments Let x be the cleavage character and Σx = Σ\{x}. The number of fragments of length l with mass m is then given by X c[l, m] = c[l − 1, m − µ(σ)] σ∈Σx ,µ(σ)≤m

and for uniform character distribution we get the probability r[l, m] = 1 −

B¨ ocker, Kaltenbach

c[l, m] |Σx |l

Mass Spectra Alignments

CPM 2005

Probability in Strings Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate.

p[L,m] G

T

D

S

B¨ ocker, Kaltenbach

T

N K

D

M

K

A

S

Mass Spectra Alignments

T

A

K

A

K

Q

I

CPM 2005

T

Probability in Strings Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate.

p[L,m] G

T

D

S

T

N K

D

M

K

A

S

T

A

K

A

K

Q

I

1st cleavage site at position l

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

T

Probability in Strings Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate.

p[L,m] G

T

D

S

T

N K

D

M

K

A

S

T

A

K

A

K

Q

I

r[l-1,m-µ(K)] 1st cleavage site at position l

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

T

Probability in Strings Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate.

p[L,m] G

T

D

S

T

N K

D

M

K

A

S

r[l-1,m-µ(K)]

T

A

K

A

K

Q

I

p[L-l,m]

1st cleavage site at position l

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

T

Probability in Strings

Main idea We compute prob. of string having NO fragment of mass m. Then the very first fragment must not have mass m and the following string must have no fragment of mass m. Iterate. The prob. of s ∈ ΣL to have NO fragment of mass m is given by p¯[L, m] = r[L, m] × P (no cleavage at all) +

L X l=1

r[l − 1, m − µ(x)] ×P (first cleavage at l) × p¯[L − l, m] | {z } | {z }

B¨ ocker, Kaltenbach

first frag.

suffix left

Mass Spectra Alignments

CPM 2005

Expected match score of a peak Score

Score distribution

threshold m/z [Da]

M1

U1

U2

M2

U3

M3

Extracted Peaks

The expected value of extracted peak j with support Uj is X E(matchscore(j)) = p[L, m] × score(mass(j), m) m∈Uj

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005

Conclusion

Main features I

Scoring schemes allow very flexible identification routines

I

Computation of significance is database independent

I

Extension to other cleavage schemes possible

I

Extension to nonuniform alphabets and to isotope masses straightforward

B¨ ocker, Kaltenbach

Mass Spectra Alignments

CPM 2005