2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

1

Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework**
Consortium leader
PETER PAZMANY CATHOLIC  UNIVERSITY
Consortium members
SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER
The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund ***

**Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben
***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg.

PETER PAZMANY
CATHOLIC UNIVERSITY

SEMMELWEIS
UNIVERSITY

sote_logo.jpg
dk_fejlec.gif
INFOBLOKK

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

2

Peter Pazmany Catholic University  
Faculty of Information Technology

INTRODUCTION TO BIOINFORMATICS

CHAPTER 3
Sequence Alignment Algorithms

www.itk.ppke.hu

(BEVEZETÉS A BIOINFORMATIKÁBA )

(Szekvencia illesztési algoritmusok)

András Budinszky


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

3

Sequence Alignment
It is a process of comparing two (pairwise sequence alignment) or more (multiple sequence alignment) DNA or protein sequences. 
The sequences are arranged to discover similarities that could show a functional, structural or evolutionary relationships between the sequences.
Similarity means a degree of match at corresponding positions of the sequences.
Similarity is usually a consequence of homology but it could occur by chance (when comparing short sequences).  

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

4

Types of Pairwise Alignment
Global Alignment:
It attempts to align every position in the entire sequences and determine the measure of their similarity from end to end in each sequence.
It is usually used with sequences of approximately the same length. 
Local Alignment:
It attempts to align sections of the sequences (“islands”, conserved regions) with significant similarity.
It can be used for sequences of quite different length.  

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

5

Methods for Pairwise Alignment
-Dot plot (matrix) analysis
-Dynamic programming algorithm
-Word or k-tuple methods
Note: Each method has its strengths and weaknesses, and all three pairwise methods have difficulty with highly repetitive sequences, especially if the number of repetitions differ.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

6

Dot Plot (matrix) Analysis
It is a graphical method.
It is the simplest one and should be the primary method considered for pairwise sequence alignment.
It creates a two-dimensional matrix.
One of the sequences is written along the top row and the other along the leftmost column of the matrix.
A dot is placed at any point where the characters match and the rest of the points are left blank.
Matching sections of the sequences are shown as diagonals of dots.
It works best if it uses value thresholds.  

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

7

Dot Matrix Programs 
A number of them available:
DOTTER (http://sonnhammer.sbc.su.se/Dotter.html) with interactive features
COMPARE and DOTPOLT (Genetics Computer Group)
EMBOSS suite (http://emboss.sourceforge.net/):
-dotmatcher (align sequences using a scoring matrix)
-dottup (finds common words in sequences)
-dotplot (finds common patterns in sequences)

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

8

Finding sequence repeats
A special use of dot matrix: aligning a sequence with itself:
The main diagonal shows the alignment with itself.
Other lines show repetitive patterns within the sequence.

www.itk.ppke.hu

200px-Zinc-finger-dot-plot.png

Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

9

Biological Background
Evolutionarily related DNA or protein sequences have mutations:
-substitutions
-insertions or deletions.
When aligning sequences we can allow:
-mismatch (corresponding to substitution)
-gap insertion (corresponding to insertion or deletion)
The second one is called indels.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

10

Measures for Sequence Similarities
Hamming distance: number of positions differ in the two strings.
Note: It is not to useful to compare DNA or protein sequences because it considers only substitution mutations.
Levenshtein distance: minimum number of editing operations needed to transform one sequence into the other, where the editing operations are insertion, deletion and substitution.
Note: A given editing sequence corresponds to a unique pairwise alignment, but the reverse is not true.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

11

Example for Measures
s1 = TATAT
s2 = ATATA
Hamming distance = 5
Levenshtein distance = 2
(step 1: insert an ‘A’ in front of s1
step 2: delete the ‘T’ at the end of s1)  

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

12

Intro to Dynamic Programming Solution
Construct a grid where characters of one sequence index the rows, and characters of the other index the columns.
Any path through the grid from the top left to the bottom right corner corresponds to an alignment.
Each segment in a path corresponds
-an indel (if its direction is down or side-way)
-a match or a substitution (if its direction is diagonal).
We need to find the “optimal” path assuming that each segment has an associated cost.
Related problem: Manhattan Tourist Problem (MTP).  

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

13

Find a path with the most number of attractions (*) in the Manhattan grid going from an upper West-side corner (Start) to a lower East-side corner (Finish) of Manhattan and traveling only eastward and southward.

*

*

*

*

*

*

*

*

*

*


Start

*

Finish


*

www.itk.ppke.hu

Manhattan Tourist Problem (MTP)


Introduction to bioinformatics: Sequence Alignment Algorithms

MTP: Exhaustive (Brute Force) Solution
Generate ALL possible paths in the grid.
Output the best path as solution.
Guaranteed to find optimal solution.
It is tractable if graph is not large.
Not feasible for even a moderately sized graph.

www.itk.ppke.hu

2010.11.15

14

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 


Introduction to bioinformatics: Sequence Alignment Algorithms

At every vertex, choose the adjacent edge with the highest weight.
Easily achievable in polynomial time, but is unlikely to give the optimal solution, especially for larger graphs!


3

4

3

3

1

2

1

2

2

3

2

6

1

1

7

4

5

1

7

3

3

9

3

2


Start

Finish

12v 28


MTP: A Greedy Solution

www.itk.ppke.hu

2010.11.15

15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

16

MTP: Dynamic Programming Solution
Store at each vertex the “value” of the optimal path (si,j ) leading to that vertex.
Initialize s0,0to 0.
Now computing values in 1strow and 1stcolumn (s0,j and si,0for all i and j) is easy.
Finally we can compute the rest of the si,jvalues as 

www.itk.ppke.hu


si-1,j + weight of edge between (i-1,j) and (i,j)

si,j-1 + weight of edge between (i,j-1) and (i,j)


si,j= max


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

17

Generalized MTP
How to handle diagonal streets of Manhattan (like e.g. Broadway)?
The only difference is that each vertex has not two but three neighbors: 

www.itk.ppke.hu


si-1,j + weight of edge between (i-1,j-1) and (i,j)

si,j-1 + weight of edge between (i,j-1) and (i,j)

si,j= max

si-1,j + weight of edge between (i-1,j) and (i,j)


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

18

Travelling the Grid
The only additional issue is that one must decide on the  order in which visit the vertices.
By the time a vertex is analyzed, the values for all its predecessors (neighbors) should be computed –otherwise we are in trouble.
The graph should be cycle free (DAG –Directed Acyclic Graph).
We need to traverse the vertices in a so-called topological order.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

19

Topological Order for MTP

www.itk.ppke.hu


a)

b)

3 different strategies:
a) Column by column
b) Row by row
c) Along diagonals

c)


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

20

Actual Optimal Route: Backtracking
The discussed algorithm computes the value of the optimal path leading to ‘Finish’.
However, we need to get the actual routing as well.
We can take up a second (trace-back) matrix and in each of it’s cells we store the neighbor that was used to get the max value for the associated vertex.
Then after finished computing the values, we can backtrack from cell (n, m) to cell (0, 0) of the trace-back matrix to recreate the optimal route. 

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

21

Run Time Comparison
Exhaustive (brute force) solution
It take too long –O(n) = 2n
Greedy solution
It is extremely fast –O(n) = n
Not acceptable because it usually misses the optimal solution.
Dynamic programming solution
It is fast –O(n) = n2
It always finds an optimal solution.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

22

Back to DP Solution of Alignment
We can apply the DP solution of MTP (using the alignment matrix).
We need to use a scoring mechanism for assigning value to each possible path.
Let us introduce a simple scoring schema:
+1: premium for matches (on diagonal edges)
-µ: penalty for mismatch (on diagonal edges)
-.: penalty for indel (on non-diagonal edges)
Value of a path:
match# –µ(mismatch#) –.(indel#)

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

23

Scoring Matrix
Studies of mutations show that the different substitutions do not have the same frequency.
Therefore it is preferable to create a scoring matrix based on substitution probabilities and use the appropriate value from this matrix when computing a mismatch .
To generalize scoring, a (4+1) x(4+1) scoring matrix can be used. The addition of extra column/line is to handle indels (that is, to include the score for comparison of a gap character “-”).  
In the case of an amino acid sequence alignment, the scoring matrix would be of (20+1)x(20+1) size (PAM, BLOSUM).

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

24

PAM250 matrix (developed by Dayhoff)

www.itk.ppke.hu

pam250.jpg

Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

25

BLOSUM62 Matrix (developed by Henikoff)

www.itk.ppke.hu

blosum62.jpg

Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

26

Comparing PAM and BLOSUM
PAM matrices:
List the likelihood of change from one amino acid to another in homologous protein sequences and during evolution.
Based on a mutational model of evolution that assumes the changes occur according to a Markov process (each
change at a site is independent of previous changes at that
site) 
BLOSUM matrices:
Based on an implicit evolutionary model and use the scores of local similarity of sections in the BLOCKS database

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

27

Affine Gap Penalties
In evolution a series of kindels is often the result of a single event rather than a series of ksingle mutation events.
Therefore using a fixed penalty .for every elements of a series of consecutive indels is too severe.
More accurate to use a score for a gap of length x: 
-(.+.x)
where   .>0 is the penalty for introducing a gap  
(gap opening penalty)    and
.>0 is the penalty for extending a gap
(gap extension penalty)

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

28

Analyzing DP Solution
Advantages:
It is guaranteed to find an optimal alignment given a particular scoring matrix.
It is very fast when we need to compare only two sequences.
Disadvantage:
In large-scale database searches in particularly since a large proportion of the sequences from the database will have essentially no significant match with the query sequence, and it would take intolerable amount of time. .

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

29

More Issues with DP Solution
Several different alignments of two DNA or protein sequences may have the highest score.
Most sequence alignment programs provide only one optimal alignment.
In some cases additional alignments may have scores that are only somewhat lower than the optimal one.
These suboptimal alignments could sometimes be biologically more meaningful than the optimal one(s).
Some programs (e.g. LALIGN) can provide these additional alignments. 

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

30

Global vs. Local Alignment
Global alignment
Includes all of the sequences.
Uses the DP algorithm as describe on previous slides.
Each matrix position can have positive, negative or 0 scores.
Local alignment
Includes only those parts of the sequences that provide a high-scoring alignment
Uses the same DP with a modification: when a score gets negative at a matrix position, then the value is changed to 0 (terminating any alignment up to that point).

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

31

Word and k-tuple Methods
These are rapid methods that are used when dynamic programming is not fast enough.
They apply a heuristic approach and do not necessarily find the optimal alignment.
In the process of aligning two sequences they
-first search for identical short subsequences (so-called words or k-tuples)
-and then join these words into an alignment using dynamic programming method.
The algorithms FASTA and BLAST are based on this approach and their detailed discussion is in Chapter 4. 

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

32

References to DP Solution
Global alignment
Needleman-Wunsch algorithm
Needleman, Wunsch, 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol.  48: 443-53
Local alignment
Smith-Waterman algorithm
Smith, Waterman, 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195-97

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

33

Multiple Sequence Alignment
Comparing multiple sequences and trying to discover similarities between them.
A faint similarity between two sequences becomes significant if present in many.
Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal.
Multiple alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees.
It can also be useful in genome sequencing.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

34

Visualization of Multiple Sequence Alignment
Visualization by software tools can illustrate mutation such as point mutations (appearing as differing characters) and insertion/deletion mutations (indels, appearing as hyphens).

www.itk.ppke.hu

ClustalW_aln.gif
First 90 positions of a protein multiple sequence alignment from several organisms, generated with ClustalX(Windows interface for a ClustalWmultiple sequence alignment)


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

35

Types of Multiple Alignment
Just as at pairwise alignments, we could have  
Global alignment –attempts to align the entire sequences that participates in the process
Local alignment –looks for well conserved regions in the sequences

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

36

Relationship Between Pairwise and Multiple Sequence Alignments
From an optimal multiple alignment, we can infer pairwise alignments between every pairs of sequences, but they are not necessarily the optimal alignments.
We have even more difficulties with the reverse problem; in some cases pairwise alignments cannot be combined into multiple alignments.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

37

Scoring of Multiple Sequence Alignments
There are different ways to evaluate (score) multiple sequence alignments:
-number of exact matches (only those columns count that have the same character in each sequence; it has limited value –useful only for very similar sequences)
-entropy score (see details on next slide)
-sum of pairs (SP, the sum of the scores of all possible pairwise alignments)

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

38

Entropy Score
Determine the frequencies of occurrence of each letter in each column of the sequences.
Compute entropy of each column:
Entropy for a multiple alignment is the sum of entropies of its columns:

www.itk.ppke.hu


.over all columns.X=A,T,G,CpX log pX


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

39

Methods for Multiple Alignment
-Extending the pairwise sequence alignment
-Progressive alignment of the sequences
-Iterative methods
-Genetic algorithm
-Hidden Markov Models (HMM)
Note: Multiple sequence alignment algorithms are  computationally difficult to produce and most real-life problems are NP-complete and therefore heuristics are used.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

40

Extending Pairwise Alignment

www.itk.ppke.hu

Start

Finish

For 3 sequences it is easy: use a 3-D “Manhattan Cube”, with each axis a sequence to align. 
For global alignments, find the optimal path from Start to Finish.


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

41

Architecture of the 3-D alignment

www.itk.ppke.hu


(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k)

(i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

42

Algorithm for Extending Pairwise Alignment
-For each vertex it computes the maximum value considering allneighbors (predecessors):
-There are 7 neighbors for 3 sequences, and generally 2k-1 neighbors for ksequences.
-A k-dimensional scoring matrix is needed for ksequences.

www.itk.ppke.hu

sx= max of sy+ weight of vertex (y, x) where 
y . Predecessors(x)


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

43

Run Time for Extending Pairwise Alignment
-For three sequences of length n, the run time is quite acceptable 7n3; O(n3).
-For ksequences, if we use a k-dimensional Manhattan, the run time is (2k-1)(nk); O(2knk).
-Thus extending the pairwise sequence alignment for larger number of sequences is impractical since the running time is exponentially grows.
-Therefore it is rarely used for more than three or four sequences.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

44

Progressive Alignment of Sequences 1.
Greedy approach:
-Select (with pairwise alignment) the pair of sequences with the highest similarity value (as seed)
-Merge them together into a so-called profile and replace them with the resulting sequence
-Repeat the process on the reduced multiple alignment of k-1 sequences
Note: It may go off-track by choosing a spuriously strong pairwise alignment (that is, a bad seed).

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

45

Example for Greedy Approach
Sequences u1and u3are combined into a profile and replaced.

www.itk.ppke.hu

u1= ACg/tTACg/tTACg/cT…
u2= TTAATTAATTAA…
u4 …
….uk= CCGGCCGGCCGG…

u1= ACGTACGTACGT…
u2= TTAATTAATTAA…
u3= ACTACTACTACT…
…
uk= CCGGCCGGCCGG


k-1

k


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

46

Progressive Alignment of Sequences 2.
Improved approach: CLUSTALW
-Performs pairwise alignments on all possible pairs of the sequences (this could use a rapid k-tuple solution like FASTA).
-Based on the alignment scores it produces a phylogenetic tree using the so-called neighbor-joining method.
-Aligns the sequences using the pairwise dynamic programming algorithm, guided by the phylogenetic relationships indicated by the tree, inserting gaps as necessary.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

47

Problems with Progressive Alignment
The major problem is that the final resulting multi-alignment heavily depends on the choice of the initial pairwise alignment (that is, errors of initial choice will propagate the result). This problem is more serious when the initial choice is between more distantly related sequences.
Choice of suitable scoring matrix and gap penalties affect  the result.
Previous alignment information is lost when sequences are merged into profiles.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

48

Iterative Methods for Multiple Alignment
This method attempts to correct these problems:
-Repeatedly realigns subgroups of sequences and then aligns these subgroups into a global alignment.
-Continues the iteration while the sum of the alignment scores for each pair of sequences (“overall score”, SP) in the multiple alignment can be improved.
-Number of such programs exist (MultiAlin, PRRP, DIALIGN).

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

49

Genetic Algorithms
They are general type of data mining algorithms.
Main representative is SAGA (Sequence Alignment by Genetic Algorithm):
-Creates an initial (random) set of 100 multi segment alignments (msa) as G0
-Selects some msa-s (“parents”) that best fit to generate offspring msa-s for next generation (Gk+1)
-Evaluate the fitness of the population of Gk+1(using an objective function, a measure of multiple alignment quality)
-If population is not stabilized then stop else generate next G.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

50

Comments on SAGA
During “breeding” (creation of next generation) typically 50% of the fittest individuals from the previous generation are kept and the rest is replaced with the generated offspring sequences to form the new generation.
As stabilization criteria, SAGA checks if unable to make improvement for some specified number of generations (typically 100).
There is no valid proof that the optimum can be reached, even in an infinite amount of time.
SAGA is fairly slow for large test cases (with >20 or so sequences)

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

51

Hidden Markov Models
It is a probabilistic model that assigns likelihoods to possible combinations of gaps, matches, and mismatches and determines the most likely MSA or set of possible MSAs:
-It is initiated with a directed acyclic graph (DAG) known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns and the estimates of transition probabilities.
-Sequences to be aligned are used as training data set and the DAG (representation of HMM) is readjusted accordingly.
-The trained model provides the most likely path for each sequence and thus the msa for the entire set of sequences.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

52

Pros and Cons for HMM
Advantages
-Offer significant improvements in computational speed especially for sequences with overlapping subsequences.
-Has strong foundation in probability theory
-No sequence ordering is needed.
-Guesses of gap penalties are not needed.
-Can produce the highest-scoring output (msa), but can also provide a set of possible alignments that can then be evaluated for biological significance.
-Can be used for both global and local alignments.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

53

Pros and Cons for HMM (continued)
Advantages
-Can be used for both global and local alignments.
-Experimentally derived information can also be used.
Disadvantages
-At least 20 sequences (and in some special cases many more) are needed for training purpose.
-The success of applying HMM significantly depends on providing an appropriate initial model (e.g. should properly capture the expected amino acid frequencies in proteins).

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

54

Multiple Sequence Alignment Programs
-ClustalW
Higgins, Thompson, Gibson, 1996. Using Clustal for multiple sequence alignment. 
Methods Enzymol. 366:383-402
http://www.clustal.org/
-SAGA
Notredame, Higgins, 1996. Sequence Alignment by Genetic Algorithm
Nucleic Acid Research, 24:1515-24
www.tcoffee.org/Projects_home_page/saga_home_page.html

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

55

Multiple Sequence Alignment Programs
(continued)
-Sequence Alignment and Modeling Software (SAM)
Krogh et al., 1994. Hidden Markov models in computational biology. J. Mol. Biol. 235:1501-31
http://compbio.soe.ucsc.edu/sam.html
-HMMER
Eddy, 1998. Profile hidden Markov models.
Bioinformatics 14: 755-63
http://hmmer.janelia.org/

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

56

Problems with Multiple Alignment
Multidomain proteins evolve not only through point mutations but also through domain duplications and domain recombination.
Although multiple sequence alignment is a 30 year old problem, there were no multiple sequence alignment approaches for aligning rearranged sequences (i.e., multi-domain proteins with shuffled domains) prior to 2002.
Often impossible to align all protein sequences throughout their entire length.

www.itk.ppke.hu


Introduction to bioinformatics: Sequence Alignment Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

57

History of Multiple Sequence Alignment
1975 Sankoff
Formulated multiple alignment problem and gave dynamic programming solution
1988 Carrillo-Lipman
Branch and Bound approach for MSA
1990 Feng-Doolittle
Progressive alignment
1994 Thompson-Higgins-Gibson-ClustalW
Most popular multiple alignment program
1998 Morgenstern et al.-DIALIGN
Segment-based multiple alignment
2000 Notredame-Higgins-Heringa-T-coffee
Using the library of pairwise alignments
2004 MUSCLE 

www.itk.ppke.hu