2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 1 Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework** Consortium leader PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund *** **Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben ***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg. PETER PAZMANY CATHOLIC UNIVERSITY SEMMELWEIS UNIVERSITY sote_logo.jpg dk_fejlec.gif INFOBLOKK 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 2 Peter Pazmany Catholic University Faculty of Information Technology INTRODUCTION TO BIOINFORMATICS CHAPTER 3 Sequence Alignment Algorithms www.itk.ppke.hu (BEVEZETÉS A BIOINFORMATIKÁBA ) (Szekvencia illesztési algoritmusok) András Budinszky Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 3 Sequence Alignment It is a process of comparing two (pairwise sequence alignment) or more (multiple sequence alignment) DNA or protein sequences. The sequences are arranged to discover similarities that could show a functional, structural or evolutionary relationships between the sequences. Similarity means a degree of match at corresponding positions of the sequences. Similarity is usually a consequence of homology but it could occur by chance (when comparing short sequences). www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 4 Types of Pairwise Alignment Global Alignment: It attempts to align every position in the entire sequences and determine the measure of their similarity from end to end in each sequence. It is usually used with sequences of approximately the same length. Local Alignment: It attempts to align sections of the sequences (“islands”, conserved regions) with significant similarity. It can be used for sequences of quite different length. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 5 Methods for Pairwise Alignment -Dot plot (matrix) analysis -Dynamic programming algorithm -Word or k-tuple methods Note: Each method has its strengths and weaknesses, and all three pairwise methods have difficulty with highly repetitive sequences, especially if the number of repetitions differ. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 6 Dot Plot (matrix) Analysis It is a graphical method. It is the simplest one and should be the primary method considered for pairwise sequence alignment. It creates a two-dimensional matrix. One of the sequences is written along the top row and the other along the leftmost column of the matrix. A dot is placed at any point where the characters match and the rest of the points are left blank. Matching sections of the sequences are shown as diagonals of dots. It works best if it uses value thresholds. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 7 Dot Matrix Programs A number of them available: DOTTER (http://sonnhammer.sbc.su.se/Dotter.html) with interactive features COMPARE and DOTPOLT (Genetics Computer Group) EMBOSS suite (http://emboss.sourceforge.net/): -dotmatcher (align sequences using a scoring matrix) -dottup (finds common words in sequences) -dotplot (finds common patterns in sequences) www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 8 Finding sequence repeats A special use of dot matrix: aligning a sequence with itself: The main diagonal shows the alignment with itself. Other lines show repetitive patterns within the sequence. www.itk.ppke.hu 200px-Zinc-finger-dot-plot.png Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 9 Biological Background Evolutionarily related DNA or protein sequences have mutations: -substitutions -insertions or deletions. When aligning sequences we can allow: -mismatch (corresponding to substitution) -gap insertion (corresponding to insertion or deletion) The second one is called indels. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 10 Measures for Sequence Similarities Hamming distance: number of positions differ in the two strings. Note: It is not to useful to compare DNA or protein sequences because it considers only substitution mutations. Levenshtein distance: minimum number of editing operations needed to transform one sequence into the other, where the editing operations are insertion, deletion and substitution. Note: A given editing sequence corresponds to a unique pairwise alignment, but the reverse is not true. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 11 Example for Measures s1 = TATAT s2 = ATATA Hamming distance = 5 Levenshtein distance = 2 (step 1: insert an ‘A’ in front of s1 step 2: delete the ‘T’ at the end of s1) www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 12 Intro to Dynamic Programming Solution Construct a grid where characters of one sequence index the rows, and characters of the other index the columns. Any path through the grid from the top left to the bottom right corner corresponds to an alignment. Each segment in a path corresponds -an indel (if its direction is down or side-way) -a match or a substitution (if its direction is diagonal). We need to find the “optimal” path assuming that each segment has an associated cost. Related problem: Manhattan Tourist Problem (MTP). www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 13 Find a path with the most number of attractions (*) in the Manhattan grid going from an upper West-side corner (Start) to a lower East-side corner (Finish) of Manhattan and traveling only eastward and southward. * * * * * * * * * * Start * Finish * www.itk.ppke.hu Manhattan Tourist Problem (MTP) Introduction to bioinformatics: Sequence Alignment Algorithms MTP: Exhaustive (Brute Force) Solution Generate ALL possible paths in the grid. Output the best path as solution. Guaranteed to find optimal solution. It is tractable if graph is not large. Not feasible for even a moderately sized graph. www.itk.ppke.hu 2010.11.15 14 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 Introduction to bioinformatics: Sequence Alignment Algorithms At every vertex, choose the adjacent edge with the highest weight. Easily achievable in polynomial time, but is unlikely to give the optimal solution, especially for larger graphs! 3 4 3 3 1 2 1 2 2 3 2 6 1 1 7 4 5 1 7 3 3 9 3 2 Start Finish 12v 28 MTP: A Greedy Solution www.itk.ppke.hu 2010.11.15 15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 16 MTP: Dynamic Programming Solution Store at each vertex the “value” of the optimal path (si,j ) leading to that vertex. Initialize s0,0to 0. Now computing values in 1strow and 1stcolumn (s0,j and si,0for all i and j) is easy. Finally we can compute the rest of the si,jvalues as www.itk.ppke.hu si-1,j + weight of edge between (i-1,j) and (i,j) si,j-1 + weight of edge between (i,j-1) and (i,j) si,j= max Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 17 Generalized MTP How to handle diagonal streets of Manhattan (like e.g. Broadway)? The only difference is that each vertex has not two but three neighbors: www.itk.ppke.hu si-1,j + weight of edge between (i-1,j-1) and (i,j) si,j-1 + weight of edge between (i,j-1) and (i,j) si,j= max si-1,j + weight of edge between (i-1,j) and (i,j) Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 18 Travelling the Grid The only additional issue is that one must decide on the order in which visit the vertices. By the time a vertex is analyzed, the values for all its predecessors (neighbors) should be computed –otherwise we are in trouble. The graph should be cycle free (DAG –Directed Acyclic Graph). We need to traverse the vertices in a so-called topological order. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 19 Topological Order for MTP www.itk.ppke.hu a) b) 3 different strategies: a) Column by column b) Row by row c) Along diagonals c) Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 20 Actual Optimal Route: Backtracking The discussed algorithm computes the value of the optimal path leading to ‘Finish’. However, we need to get the actual routing as well. We can take up a second (trace-back) matrix and in each of it’s cells we store the neighbor that was used to get the max value for the associated vertex. Then after finished computing the values, we can backtrack from cell (n, m) to cell (0, 0) of the trace-back matrix to recreate the optimal route. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 21 Run Time Comparison Exhaustive (brute force) solution It take too long –O(n) = 2n Greedy solution It is extremely fast –O(n) = n Not acceptable because it usually misses the optimal solution. Dynamic programming solution It is fast –O(n) = n2 It always finds an optimal solution. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 22 Back to DP Solution of Alignment We can apply the DP solution of MTP (using the alignment matrix). We need to use a scoring mechanism for assigning value to each possible path. Let us introduce a simple scoring schema: +1: premium for matches (on diagonal edges) -µ: penalty for mismatch (on diagonal edges) -.: penalty for indel (on non-diagonal edges) Value of a path: match# –µ(mismatch#) –.(indel#) www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 23 Scoring Matrix Studies of mutations show that the different substitutions do not have the same frequency. Therefore it is preferable to create a scoring matrix based on substitution probabilities and use the appropriate value from this matrix when computing a mismatch . To generalize scoring, a (4+1) x(4+1) scoring matrix can be used. The addition of extra column/line is to handle indels (that is, to include the score for comparison of a gap character “-”). In the case of an amino acid sequence alignment, the scoring matrix would be of (20+1)x(20+1) size (PAM, BLOSUM). www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 24 PAM250 matrix (developed by Dayhoff) www.itk.ppke.hu pam250.jpg Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 25 BLOSUM62 Matrix (developed by Henikoff) www.itk.ppke.hu blosum62.jpg Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 26 Comparing PAM and BLOSUM PAM matrices: List the likelihood of change from one amino acid to another in homologous protein sequences and during evolution. Based on a mutational model of evolution that assumes the changes occur according to a Markov process (each change at a site is independent of previous changes at that site) BLOSUM matrices: Based on an implicit evolutionary model and use the scores of local similarity of sections in the BLOCKS database www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 27 Affine Gap Penalties In evolution a series of kindels is often the result of a single event rather than a series of ksingle mutation events. Therefore using a fixed penalty .for every elements of a series of consecutive indels is too severe. More accurate to use a score for a gap of length x: -(.+.x) where .>0 is the penalty for introducing a gap (gap opening penalty) and .>0 is the penalty for extending a gap (gap extension penalty) www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 28 Analyzing DP Solution Advantages: It is guaranteed to find an optimal alignment given a particular scoring matrix. It is very fast when we need to compare only two sequences. Disadvantage: In large-scale database searches in particularly since a large proportion of the sequences from the database will have essentially no significant match with the query sequence, and it would take intolerable amount of time. . www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 29 More Issues with DP Solution Several different alignments of two DNA or protein sequences may have the highest score. Most sequence alignment programs provide only one optimal alignment. In some cases additional alignments may have scores that are only somewhat lower than the optimal one. These suboptimal alignments could sometimes be biologically more meaningful than the optimal one(s). Some programs (e.g. LALIGN) can provide these additional alignments. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 30 Global vs. Local Alignment Global alignment Includes all of the sequences. Uses the DP algorithm as describe on previous slides. Each matrix position can have positive, negative or 0 scores. Local alignment Includes only those parts of the sequences that provide a high-scoring alignment Uses the same DP with a modification: when a score gets negative at a matrix position, then the value is changed to 0 (terminating any alignment up to that point). www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 31 Word and k-tuple Methods These are rapid methods that are used when dynamic programming is not fast enough. They apply a heuristic approach and do not necessarily find the optimal alignment. In the process of aligning two sequences they -first search for identical short subsequences (so-called words or k-tuples) -and then join these words into an alignment using dynamic programming method. The algorithms FASTA and BLAST are based on this approach and their detailed discussion is in Chapter 4. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 32 References to DP Solution Global alignment Needleman-Wunsch algorithm Needleman, Wunsch, 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-53 Local alignment Smith-Waterman algorithm Smith, Waterman, 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195-97 www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 33 Multiple Sequence Alignment Comparing multiple sequences and trying to discover similarities between them. A faint similarity between two sequences becomes significant if present in many. Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal. Multiple alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. It can also be useful in genome sequencing. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 34 Visualization of Multiple Sequence Alignment Visualization by software tools can illustrate mutation such as point mutations (appearing as differing characters) and insertion/deletion mutations (indels, appearing as hyphens). www.itk.ppke.hu ClustalW_aln.gif First 90 positions of a protein multiple sequence alignment from several organisms, generated with ClustalX(Windows interface for a ClustalWmultiple sequence alignment) Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 35 Types of Multiple Alignment Just as at pairwise alignments, we could have Global alignment –attempts to align the entire sequences that participates in the process Local alignment –looks for well conserved regions in the sequences www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 36 Relationship Between Pairwise and Multiple Sequence Alignments From an optimal multiple alignment, we can infer pairwise alignments between every pairs of sequences, but they are not necessarily the optimal alignments. We have even more difficulties with the reverse problem; in some cases pairwise alignments cannot be combined into multiple alignments. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 37 Scoring of Multiple Sequence Alignments There are different ways to evaluate (score) multiple sequence alignments: -number of exact matches (only those columns count that have the same character in each sequence; it has limited value –useful only for very similar sequences) -entropy score (see details on next slide) -sum of pairs (SP, the sum of the scores of all possible pairwise alignments) www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 38 Entropy Score Determine the frequencies of occurrence of each letter in each column of the sequences. Compute entropy of each column: Entropy for a multiple alignment is the sum of entropies of its columns: www.itk.ppke.hu .over all columns.X=A,T,G,CpX log pX Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 39 Methods for Multiple Alignment -Extending the pairwise sequence alignment -Progressive alignment of the sequences -Iterative methods -Genetic algorithm -Hidden Markov Models (HMM) Note: Multiple sequence alignment algorithms are computationally difficult to produce and most real-life problems are NP-complete and therefore heuristics are used. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 40 Extending Pairwise Alignment www.itk.ppke.hu Start Finish For 3 sequences it is easy: use a 3-D “Manhattan Cube”, with each axis a sequence to align. For global alignments, find the optimal path from Start to Finish. Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 41 Architecture of the 3-D alignment www.itk.ppke.hu (i-1,j-1,k-1) (i,j-1,k-1) (i,j-1,k) (i-1,j-1,k) (i-1,j,k) (i,j,k) (i-1,j,k-1) (i,j,k-1) Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 42 Algorithm for Extending Pairwise Alignment -For each vertex it computes the maximum value considering allneighbors (predecessors): -There are 7 neighbors for 3 sequences, and generally 2k-1 neighbors for ksequences. -A k-dimensional scoring matrix is needed for ksequences. www.itk.ppke.hu sx= max of sy+ weight of vertex (y, x) where y . Predecessors(x) Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 43 Run Time for Extending Pairwise Alignment -For three sequences of length n, the run time is quite acceptable 7n3; O(n3). -For ksequences, if we use a k-dimensional Manhattan, the run time is (2k-1)(nk); O(2knk). -Thus extending the pairwise sequence alignment for larger number of sequences is impractical since the running time is exponentially grows. -Therefore it is rarely used for more than three or four sequences. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 44 Progressive Alignment of Sequences 1. Greedy approach: -Select (with pairwise alignment) the pair of sequences with the highest similarity value (as seed) -Merge them together into a so-called profile and replace them with the resulting sequence -Repeat the process on the reduced multiple alignment of k-1 sequences Note: It may go off-track by choosing a spuriously strong pairwise alignment (that is, a bad seed). www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 45 Example for Greedy Approach Sequences u1and u3are combined into a profile and replaced. www.itk.ppke.hu u1= ACg/tTACg/tTACg/cT… u2= TTAATTAATTAA… u4 … ….uk= CCGGCCGGCCGG… u1= ACGTACGTACGT… u2= TTAATTAATTAA… u3= ACTACTACTACT… … uk= CCGGCCGGCCGG k-1 k Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 46 Progressive Alignment of Sequences 2. Improved approach: CLUSTALW -Performs pairwise alignments on all possible pairs of the sequences (this could use a rapid k-tuple solution like FASTA). -Based on the alignment scores it produces a phylogenetic tree using the so-called neighbor-joining method. -Aligns the sequences using the pairwise dynamic programming algorithm, guided by the phylogenetic relationships indicated by the tree, inserting gaps as necessary. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 47 Problems with Progressive Alignment The major problem is that the final resulting multi-alignment heavily depends on the choice of the initial pairwise alignment (that is, errors of initial choice will propagate the result). This problem is more serious when the initial choice is between more distantly related sequences. Choice of suitable scoring matrix and gap penalties affect the result. Previous alignment information is lost when sequences are merged into profiles. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 48 Iterative Methods for Multiple Alignment This method attempts to correct these problems: -Repeatedly realigns subgroups of sequences and then aligns these subgroups into a global alignment. -Continues the iteration while the sum of the alignment scores for each pair of sequences (“overall score”, SP) in the multiple alignment can be improved. -Number of such programs exist (MultiAlin, PRRP, DIALIGN). www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 49 Genetic Algorithms They are general type of data mining algorithms. Main representative is SAGA (Sequence Alignment by Genetic Algorithm): -Creates an initial (random) set of 100 multi segment alignments (msa) as G0 -Selects some msa-s (“parents”) that best fit to generate offspring msa-s for next generation (Gk+1) -Evaluate the fitness of the population of Gk+1(using an objective function, a measure of multiple alignment quality) -If population is not stabilized then stop else generate next G. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 50 Comments on SAGA During “breeding” (creation of next generation) typically 50% of the fittest individuals from the previous generation are kept and the rest is replaced with the generated offspring sequences to form the new generation. As stabilization criteria, SAGA checks if unable to make improvement for some specified number of generations (typically 100). There is no valid proof that the optimum can be reached, even in an infinite amount of time. SAGA is fairly slow for large test cases (with >20 or so sequences) www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 51 Hidden Markov Models It is a probabilistic model that assigns likelihoods to possible combinations of gaps, matches, and mismatches and determines the most likely MSA or set of possible MSAs: -It is initiated with a directed acyclic graph (DAG) known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns and the estimates of transition probabilities. -Sequences to be aligned are used as training data set and the DAG (representation of HMM) is readjusted accordingly. -The trained model provides the most likely path for each sequence and thus the msa for the entire set of sequences. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 52 Pros and Cons for HMM Advantages -Offer significant improvements in computational speed especially for sequences with overlapping subsequences. -Has strong foundation in probability theory -No sequence ordering is needed. -Guesses of gap penalties are not needed. -Can produce the highest-scoring output (msa), but can also provide a set of possible alignments that can then be evaluated for biological significance. -Can be used for both global and local alignments. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 53 Pros and Cons for HMM (continued) Advantages -Can be used for both global and local alignments. -Experimentally derived information can also be used. Disadvantages -At least 20 sequences (and in some special cases many more) are needed for training purpose. -The success of applying HMM significantly depends on providing an appropriate initial model (e.g. should properly capture the expected amino acid frequencies in proteins). www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 54 Multiple Sequence Alignment Programs -ClustalW Higgins, Thompson, Gibson, 1996. Using Clustal for multiple sequence alignment. Methods Enzymol. 366:383-402 http://www.clustal.org/ -SAGA Notredame, Higgins, 1996. Sequence Alignment by Genetic Algorithm Nucleic Acid Research, 24:1515-24 www.tcoffee.org/Projects_home_page/saga_home_page.html www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 55 Multiple Sequence Alignment Programs (continued) -Sequence Alignment and Modeling Software (SAM) Krogh et al., 1994. Hidden Markov models in computational biology. J. Mol. Biol. 235:1501-31 http://compbio.soe.ucsc.edu/sam.html -HMMER Eddy, 1998. Profile hidden Markov models. Bioinformatics 14: 755-63 http://hmmer.janelia.org/ www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 56 Problems with Multiple Alignment Multidomain proteins evolve not only through point mutations but also through domain duplications and domain recombination. Although multiple sequence alignment is a 30 year old problem, there were no multiple sequence alignment approaches for aligning rearranged sequences (i.e., multi-domain proteins with shuffled domains) prior to 2002. Often impossible to align all protein sequences throughout their entire length. www.itk.ppke.hu Introduction to bioinformatics: Sequence Alignment Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 57 History of Multiple Sequence Alignment 1975 Sankoff Formulated multiple alignment problem and gave dynamic programming solution 1988 Carrillo-Lipman Branch and Bound approach for MSA 1990 Feng-Doolittle Progressive alignment 1994 Thompson-Higgins-Gibson-ClustalW Most popular multiple alignment program 1998 Morgenstern et al.-DIALIGN Segment-based multiple alignment 2000 Notredame-Higgins-Heringa-T-coffee Using the library of pairwise alignments 2004 MUSCLE www.itk.ppke.hu