2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 1 Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework** Consortium leader PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund *** **Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben ***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg. PETER PAZMANY CATHOLIC UNIVERSITY SEMMELWEIS UNIVERSITY sote_logo.jpg dk_fejlec.gif INFOBLOKK 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 2 Peter Pazmany Catholic University Faculty of Information Technology INTRODUCTION TO BIOINFORMATICS CHAPTER 12 Phylogenetic Tree Algorithms www.itk.ppke.hu (BEVEZETÉS A BIOINFORMATIKÁBA ) (Filogenetikai fa algoritmusok) András Budinszky Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 3 Definitions A phylogenetic treeor evolutionary tree is a type of two-dimensional graph illustrating the inferred evolutionary relationships among various organisms or genes (including paralogs) based upon similarities and differences in their physical and/or genetic characteristics. The outer nodes (or leaves) represent existing (observed) entities and the inner nodes represent (usually hypothetical) ancestors that are not part of the data. Phylogenic prediction is trying to construct such trees based on available evidence (similarities and differences of characteristics, these days typically DNA, RNA and protein sequences). www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 4 Morphological vs. Molecular Analysis The morphological phylogenetics collects representative measurements for each of the phenotypic characteristics of the species into a matrix. The types of phenotypic data assembled depend on the individual species being compared; they may involve measurements of average body size, lengths or sizes of particular bones or other physical features, or even behavioral manifestations. The molecular phylogenetics uses data that are immediate and discretely defined: distinct nucleotide sequences of DNA or RNA, and distinct amino acid sequences of proteins. However, defining homology is not easy due to the difficulties of multi-sequence alignment. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 5 History of Phylogenetic Trees 1840Early representations of branchingphylogenetic trees include a "Paleontological chart" showing the geological relationships among plants and animals (Hitchcock) 1859Publishing of the book “The Origin of Species”. This book produced one of the first illustrations and crucially popularized the notion of an evolutionary tree based on anatomical features. However, using relatively subjective observations were often inconclusive, and sometimes the conclusion are incorrect (Darwin). www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 6 History of Phylogenetic Trees (cont) 1963Publishing the notion of a so-called "molecular clock and constructing the molecular phylogenies based on evolutionary information held by DNA (Zuckerkandl & Pauling) 1967Publishing the first phylogenetic algorithm in the Science magazine starting the molecular phylogenetics (Fitch and Margoliash) www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 7 History of Phylogenetic Trees (cont) 1985Construction of evolutionary tree of humans (based on mtDNA, mitochondrial DNA with about 16,500 base point). This showed the so-called Out-of-Africa hypothesis (Wilson) 1986Creation 100 distinct trees that were also consistent with the mtDNA data suggestion the incorrectness of the Out-of-Africa hypothesis (Templeton). www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 8 Characteristics of Trees As phylogeny trees binary trees are used; that makes sense based on evolutionary steps. There are two different types of trees: rooted (when one of the nodes is special) and unrooted one. Unrooted tree is used when either the position of the root is of no interest or its position is difficult to determine. www.itk.ppke.hu rooted tree unrooted tree Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 9 Main Methods for Phylogenetic Predictions There are usually one of the three following methods is being used to construct a phylogenetic tree that best fits the observations: • Parsimony method –this approach assumes an implicit model of evolution (parsimony) and thus attempts to minimize the overall number of mutations for the entire tree. • Distance-based method –this is simplest to implement, but does not assume an evolutionary model. • Likelihood-based method –this is the most complex one and needs an explicit evolutionary model. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 10 Parsimony Method This method looks for the tree with the lowest possible so-called parsimony score(sum of cost of all mutations in the tree). The method is also referred to as the minimum evolution method. The base data for the comparison of the different species could be morphological but –nowadays –it is molecular. Its logic is based on Occam’s razor principle. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 11 Occam (or Ockham) Razor Principle This principle comes from a 14th-century English logician, theologian and Franciscan friar Father William of Ockham. He wrote that "entities must not be multiplied beyond necessity”. This is also phrased as "plurality should not be assumed without necessity” (or as usually paraphrased: “keep it simple, stupid”). Nevertheless, in a scientific method, Occam's razor should not be considered an irrefutable principle of logic. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 12 Parsimony Method, Algorithm Steps: A. Selects those positions (columns) that are “informative” for parsimony analysis. A position is informative if at least 2 different characters occur on that position and each one appears at least in two sequences (there). B. For every informative positiona) Generates all possible unrooted trees with the given species (e.g., sequences of the same gene) at the leaves. b) Determines which of these trees require the smallest number of changes (see details of this step later, at Fitch algorithm) . C. The tree with the smallest total changes will be returned. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 13 Example for Parsimony Method • There are 3 possible unrooted tree configurations. • Positions 2. is not informative because it has only character ‘A’. • Positions 4., 5. and 7. are not informative because they have characters that appears only once in the position. • Positions 1., 3, and 6 are informative. www.itk.ppke.hu 1. 2. 3. 4. 5. 6. 7. 1. C A G A G A A 2. C A A T A A G 3. G A A C A G C 4. G A G G A G A Species character positions Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 14 Parsimony Score for Position 1 in the Example Here are three possible trees to be considered: 122 scores for the trees of position 1 www.itk.ppke.hu C G G 1. G G C 3. 4. 2. C G G 1. G G C 4. 3. 2. C C C 1. G G G 2. 4. 3. Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 15 Fitch Algorithm It is used to determine the associated parsimony score for a given tree configuration (on a given character position). Steps of the algorithm (for convenience, the algorithm considers the tree rooted and works its way up from leaves to root): A. For every leaf node i, assigns zero as initial parsimony score (Pi), and the character on that position as the only element in the possible base assignments set (Si). www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 16 Fitch Algorithm Steps (continued): A. For every parent node kwith children i and j, computes Pkand Skas• If Si. Sjis empty, then Pk= Pi+ Pj+ 1, Sk= SiSj • If Si. Sjis not empty, then Pk= Pi+ Pj, Sk= Si. Sj B. Returns Prootas the parsimony score of this tree. Note:This algorithm is executed for every tree configuration at every informative position. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 17 Weighted Parsimony Method On the previous slide in the formula of computing the parsimony score we have used value 1 to score each change (mutation). As we have discussed this at the alignment problems, studies of mutations show that the different mutations do not have the same frequency. Therefore it makes sense to use a scoring matrix when computing the parsimony score. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 18 Difficulties with Parsimony Method In the example we had only four sequences and thus three possible unrooted tree configurations. In general, the number of possible unrooted trees for nleaves is (2n–3)! (n –2)!2n-2 Clearly this makes this algorithm NP-hard and for larger nvalues heuristics must be applied. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 19 Distance-Based Method This method explicitly based on a measure of "genetic distance" between the sequences provided by an MSA (multiple sequence alignment). The first step of the method is finding the distances between all pairs of sequences and place these distances into a matrix. Then the goal of the method is to identify a tree that fits the matrix; that is, places the sequences on the leaves and creates internal vertices with such lengths that adding up the lengths between each sequence pair, the distance measures in the matrix are all matched. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 20 Requirements/Definitions The distances should meet the following three criteria: • Symmetry: dij= dji for each i, j • Distinguishability: dij. 0 if, and only if i. j • Triangle inequality: dij. dik+ dki for each i, j, k. A tree is additive if the distance between any pair of leaves is the sum of distances between those leaves and the first node they share on the tree. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 21 Testing for additivity In order to decide whether an additive tree fitting a given distance matrix D can be constructed, the following so-called four-point condition for additive trees could be tested: www.itk.ppke.hu D corresponds to an additive tree if and only if for any of the four sequences (labeled 1, 2, 3, 4) two of the sums d12+ d34, d13+ d24, d14+ d14, are equal and greater than or equal to the third. Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 22 Additive/Non-Additive Examples additive matrix/tree non-additive matrix www.itk.ppke.hu A B C D A 0 5 8 7 B 0 9 8 C 0 3 D 0 A B C D A 0 3 3 3 B 0 5 3 C 0 3 D 0 A B C 3 4 2 D 1 2 ? Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 23 Fitch-Margoliash Algorithm Steps: A. Finds the most closely related sequence pair in the distance matrix (say A and B), and makes a temporary distance matrix with A and B, and the rest of the sequences combined (as X). B. Calculates the average distance from A and B to the node X. C. From this special, 3-node case, computes the direct branches leading to A and B (see a follow-up slide). www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 24 Fitch-Margoliash Algorithm Steps (continued): D. Reduces the original distance matrix by deleting rows and columns for A and B and inserting a new row and column for a composite (AB) entity with averaged distances (e.g. the distance from C to (AB) is the average distance of C to A and C to B). E. Repeats steps A. through D. until a matrix is reduced to a 2 by 2 size. Note:Step A. assumes that the closest sequence pairs are not necessarily neighbors, so the algorithm should be repeated starting with another pair and check if get the same (correct) tree. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 25 Solving the Special 3-Node Case distance(AB) = a+ b distance(AX) = a+ x distance(BX) = b+ x Note: This computation is used in step C. of the F-M algorithm www.itk.ppke.hu A B combined node (X) b x a This is solved for aand b to get the direct branches to A and B Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 26 Finding Neighboring Leaves Closest leaves aren’t necessarily neighbors. Counter example: A and B are neighbors, but (dAB= 13) > (dBD= 12) Finding neighbors could be difficult task. www.itk.ppke.hu A B C 3 4 8 D 5 3 Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 27 None-Additive Matrix If there is no additive tree fitting a given distance matrix D(that is, one of the quadruplets do not satisfy the 4-point condition), then we look for a tree Tthat approximates Dthe best: Squared Error: .i,j(dij(T) –Dij)2 Squared error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it. Finding the best approximation tree Tfor a non-additive distance matrix D is NP-hard. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 28 Neighbor-Joining Algorithm Very much like the Fitch-Margoliash algorithm except the choice of sequence pair (step A.) is controlled. Developed by Naruya Saitou and Masatoshi Nei in 1987. Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves. Works well not only for additive distance matrices but for others as well. The general neighbor-joining is available from ftp.virginia.edu/pub/fasta/GNJ www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 29 Maximum Likelihood Approach This method uses standard statistical techniques for inferring probability distributions to assign probabilities to particular possible phylogenetic trees. The method requires an a priory substitution model to assess the probability of particular mutations. Roughly speaking, a tree that requires more mutations at interior nodes to explain the observed phylogeny will be assessed as having a lower probability. It is similar to the parsimony method in the sense that the analysis is performed on each column of the multiple sequence alignment. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 30 Maximum Likelihood, Algorithm Steps: A. Generates all possible trees for the given sequences. B. Places the characters at the first (next) position of the sequences on the leaves of all these trees. C. Select first (next) tree (Ti). D. Generates all variations of Ti (a set of Ti’) by placing all different patterns of changes at the inner nodes. E. Looks up the probability of change at each of the branching in each Ti’tree in a probability table of an applied evolutionary model. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 31 Maximum Likelihood, Algorithm Steps (continued): F. Calculates the probability of pattern change defined in each Ti’tree by multiplying the branch probabilities. G. Sums the probabilities of all Ti’trees to get the probability of Ti. H. Repeats steps C. through G for each tree T. I. Picks the highest probability value to the determine the most probable tree for this position. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 32 Maximum Likelihood, Algorithm Steps (continued): J. Repeats steps B. through I. for each position of the aligned sequences. K. The most likely tree will be the one that provides the highest overall probability at all of the positions found by summing the position probabilities for each tree. Note:Due to the fact that the maximum likelihood method considers all possible trees, it is only feasible for a small number of sequences. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 33 Problems with Building Phylogenetic Trees Some of the assumptions used by the various algorithms are violated but not quite addresses yet: • Evolution on the different sequence positions happens according to the same stochastic model • Not true; for example third positions evolve faster than first • Sequence positions evolve independently of one another• Not true for example in sequences with secondary structures • Sequences are aligned• Usually gaps are removed before building trees, however, the effects of this is hard to assess. www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 34 Phylogenetic Tree Software PHYLIP It is a free package of programs for inferring phylogenies. It is distributed as source code, documentation files, and a number of different types of executables. The program includes parsimony, distance matrix, and likelihood methods as well. (Felsenstein, 1989) http://evolution.genetics.washington.edu/phylip.html PAUP It has been the most widely used software package for the inference of evolutionary trees. Originally it used parsimony, but current version (4.0) extended with maximum likelihood and distance methods. (Swofford) http://paup.csit.fsu.edu/ www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 35 Phylogenetic Tree Software (cont) MacClade It is a program for phylogenetic analysis. Its analytical strength is in studies of character evolution. It runs only on Mac platform. (David & Wayne Maddison) http://macclade.org/macclade.html MESQUITE It is designed to analyze comparative data about organisms. Its emphasis is on phylogenetic analysis, but some of its modules concern population genetics, while others do non-phylogenetic multivariate analysis. (also David & Wayne Maddison) http://mesquiteproject.org/ www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 36 Phylogenetic Tree Software (cont) MEGA (now MEGA5) It is an integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, inferring ancestral sequences, and testing evolutionary hypotheses. It is a multi-threaded Windows application and runs on all releases of Microsoft Windows operating system. It is the most quoted phylogenetic program package. http://www.megasoftware.net/ www.itk.ppke.hu Introduction to bioinformatics: Phylogenetic Tree Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 37 References Tree-of-life web site http://tolweb.org/tree/phylogeny.html Felsenstein’sPhylogenetic Program Directory http://evolution.genetics.washington.edu/phylip.html UT Austin Phylogenetics Lab http://kristin.csres.utexas.edu/ WoeseLab http://www.life.uiuc.edu/micro/woese.html www.itk.ppke.hu