Visualizing profile–profile alignment: pairwise ... - Semantic Scholar

Report 2 Downloads 42 Views
BIOINFORMATICS APPLICATIONS NOTE

Vol. 21 no. 12 2005, pages 2912–2913 doi:10.1093/bioinformatics/bti434

Sequence analysis

Visualizing profile–profile alignment: pairwise HMM logos Benjamin Schuster-Böckler∗ and Alex Bateman The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK Received on February 8, 2005; revised on March 29, 2005; accepted on March 31, 2005 Advance Access publication April 12, 2005

ABSTRACT Summary: The availability of advanced profile–profile comparison tools, such as PRC or HHsearch demands sophisticated visualization tools not presently available. We introduce an approach built upon the concept of HMM logos. The method illustrates the similarities of pairs of protein family profiles in an intuitive way. Two HMM logos, one for each profile, are drawn one upon the other. The aligned states are then highlighted and connected. Availability: A web interface offering online creation of pairwise HMM logos is available at http://www.sanger.ac.uk/Software/ analysis/logomat-p. Furthermore, software developers may download a Perl package that includes methods for creation of pairwise HMM logos locally. Contact: [email protected]

INTRODUCTION The problem of profile–profile comparison has a long history but has received a lot of attention recently (Söding, 2004; Lyngsø et al., 1999; Madera, 2005; Edgar and Sjölander, 2004a). This is a result of the growing number of well characterized protein families in databases, such as Pfam (Bateman et al., 2004). By adding additional information about properties of the entire family, it has been shown that profile–profile methods significantly increase sensitivity compared with profile–sequence comparison (Edgar and Sjölander, 2004b). Several different concepts for profile–profile comparison have been reported. We focused on the visualization of HMM–HMM alignments. The algorithms behind all currently available HMM alignment programs are very similar. Newer approaches mainly differ in details of the scoring function and in the transitions that are taken into account. The approach is to find a sequence of stateto-state pairings that maximizes the probability of both HMMs emitting the same sequence (frequently called co-emission probability). This can be done efficiently by creating a pair HMM (Durbin et al., 1998; Söding, 2004) from the two source HMMs and using standard forward or viterbi algorithms for searching an optimal solution. Nevertheless, the raw output of the alignment tools can be difficult to understand. From the state-to-state pairings alone, it is not immediately obvious which features the two protein families have in common. It was our aim to develop a graphical representation of HMM–HMM alignments that resolves this issue.

∗ To

whom correspondence should be addressed.

2912

FEATURES Pairwise HMM Logos can be currently accessed in two different ways. First, they can be made online at http://www.sanger.ac. uk/Software/analysis/logomat-p. Second, they can be constructed locally by downloading and installing the Perl sources. In the near future, pairwise HMM Logos will also be added to the Pfam website. A typical pairwise HMM Logo is shown in Figure 1. We intended to construct pairwise HMM Logos to look as similar to HMM Logos as possible. This should facilitate their comprehension for users accustomed to HMM Logos. Therefore, we draw two HMM Logos, one for each aligned family. To illustrate individual aligned states they are framed and connected by a block. Unaligned states are shaded in grey. In a local alignment, positions before the first and after the last aligned states are not shown. A brief summary on the features of simple HMM logos is given in the caption to Figure 1. A more detailed description can be found in (Schuster-Böckler et al., 2004). In our previous work (Schuster-Böckler et al., 2004), we introduced the HMM Perl package. It provides generalized methods to access and modify HMMs. Emission and transition probabilities are stored and retrieved as multidimensional matrices using PDL, the Perl Data Language. HMMER files can be parsed and written. It also allows the creation of HMM logos from profile HMMs. We added a class called HMM::Alignment to this existing framework that works as an abstraction layer to the HMM alignment program PRC (Madera, 2005, http://supfam.mrc-lmb.cam.ac.uk/PRC/). It can parse and write PRC output as well as run PRC directly if it is installed on the system. As it integrates into the HMM package, it takes HMM::Profile objects, HMMER files, Pfam IDs or combinations thereof as arguments for creating alignment objects.

REQUIREMENTS On-the-fly creation of pairwise HMM Logos from HMMER files, multiple sequence alignments or Pfam IDs is available from the website http://www.sanger.ac.uk/Software/analysis/logomat-p. Uploaded HMMs are aligned directly using PRC. Multiple alignments in ClustalW, MSF or SELEX format are used to create HMMs using HMMER before aligning them. The plain PRC output can be downloaded separately. Local installation of the HMM Perl package requires the PDL and Imager packages to be installed on the system together with a working PRC binary. Both Perl packages can be downloaded from http://www.cpan.org. PRC is available from http://supfam.mrc-lmb.cam.ac.uk/PRC/. This software was tested against PRC version 1.5.2.

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Visualization of pairwise HMM logos

Fig. 1. Alignment of the Toxin_7 against the Toxin_9 Pfam family. For each family, an HMM logo is drawn. The numbers above and below each logo show state positions in the HMM. The overall height of the letter stacks represents the information content, the relative letter height corresponds to its emission probability. The column width denotes the relative contribution, the product of the probability that the state is traversed with the expected number of self transitions for the respective state. This is to account for the varying length of insertions. Insert states are drawn in red. Frequently, their relative contribution is very small, making them hard to see. In this picture, you find narrow insert states e.g. at positions 27 and 28 of the Toxin_7 family. The aligned states in each HMM are framed and connected by a block. Omitted states are shaded in grey.

ACKNOWLEDGEMENTS We would like to thank Martin Madera and Robert Finn for the valuable information about theoretical and practical aspects of PRC. Johannes Söding kindly answered numerous questions about his HHsearch algorithm. The authors are grateful for the valuable suggestions and corrections made by the reviewers. B.S.-B. is funded by the Wellcome Trust.

REFERENCES Bateman,A. et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138–D141. Durbin,R., Eddy,S.R., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge, UK. Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763.

Eddy,S.R. (2001) HMMER User’s Guide: Biological Sequence Analysis Using Profile Hidden Markov Models, Version 2.2. Washington University School of Medicine, http://hmmer.wustl.edu. Edgar,R.C. and Sjölander,K. (2004a) COACH: profile–profile alignment of protein families using hidden Markov models. Bioinformatics, 20, 1309–1318. Edgar,R.C. and Sjölander,K. (2004b) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, 20, 1301–1308. Lyngsø,R. et al. (1999) Metrics and similarity measures for hidden Markov models. Proc. Int. Conf. Intell. Syst. Mol. Biol., 1999, 178–186. Madera,M. (2005) PRC—the profile comparer. Schneider,T.D. and Stephens,R. (1990) Sequence logos: A new way to display consensus sequences. Nucleic Acids Res., 18, 6097–6100. Schuster-Böckler,B., Schultz,J. and Rahmann,S. (2004) HMM Logos for visualization of protein families. BMC Bioinformatics, 5, 7. Söding,J. (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics, 21, 951–960.

2913