2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 1 Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework** Consortium leader PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund *** **Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben ***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg. PETER PAZMANY CATHOLIC UNIVERSITY SEMMELWEIS UNIVERSITY sote_logo.jpg dk_fejlec.gif INFOBLOKK 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 2 Peter Pazmany Catholic University Faculty of Information Technology INTRODUCTION TO BIOINFORMATICS CHAPTER 4 Similarity searching and the BLAST algorithm www.itk.ppke.hu (BEVEZETÉS A BIOINFORMATIKÁBA ) (hasonlóságkeresésés a BLASTalgoritmus) Sándor Pongor 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 2 Introduction to bioinformatics:Similarity Searching and BLAST Lecture outline Similarity searching principles and main steps, Sequence similarity, PAM and BLOSUM matrices Alignment types (local, global, exhaustive, heuristic) FASTA (briefly) BLAST (principle) Significance calculation BLAST refinements 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 3 2011.10.09.. Introduction to bioinformatics:Similarity Searching and BLAST What similarity searching is.. Given a query and a database, find the entry in the database that is most similar to the query in terms of a numerical similarity measure (distance, similarity score, etc.) In contrast: retrieval looks for an exact match to the query. Is John in the list? Retrieval: Yes/No, based on exact matching. Similarity search: His brother Joe Brown is. So we can classifyJohn into the Brown family, based on approximate matching. 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 4 2011.10.09.. Introduction to bioinformatics:Similarity Searching and BLAST The importance of similarity 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 5 2011.10.09.. Similar protein’s name (ID): Joe. Class: Brown family Introduction to bioinformatics:Similarity Searching and BLAST The use of similarity 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 6 2011.10.09.. Starting stage: Query and DB are in the same format (the search format) and we have a similarity measure. STEPS: 1. Comparequery with all entries in the DB and register similarity score. Store results above some threshold (cutoff) 2.Calculate significanceof the score 3. Rankentries according to similarity score or significance (top list) 4. Reportthe best hit (usually after some simple statistics, e.g. if it is higher than a threshold…) Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 7 2011.10.09.. We use a version of the edit distance and a specific substitution matrix (Dayhoff, BLOSUM, etc.) Exhaustive algorithms (Dynamic programming, Needleman Wunsch, Smith-Waterman) are expensive, We use heuristics that make use of the properties of biological sequences (FASTA, BLAST) Biological heuristics include a) local similarities are dense, b) similar regions are near each other, c) low complexity sequences excluded, etc. Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 8 2011.10.09.. Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 9 2011.10.09.. The score S is a sum of costsassigned to identities and mismatches, minus a penalty for gaps.Costs are stored in the substitution matrix Preliminaries HSP, high scoring segment pair Sequence similarity score Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 10 2011.10.09.. A simple example (without gaps): For a match/mismatch we look up the value in the substitution matrix. The matrix is a lookup table… Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 11 2011.10.09.. Substitution matricesin details The susbstitution matrix (also called scoring matrix) contains costs for amino acid identities and substitutions in an alignment. It is a 20x20symmetrical matrix that can be constructed from pairwise alignments of related sequences “Related” means either a) evolutionary relatedness described by an “approved” evolutionary tree (Dayhoff’s PAM matrices) b) any sequence similarity as described in the PROSITE database (Hennikoffs BLOSUM matrices) Groups of related sequences can be organized into a multiple alignment for calculation of the matrix elements. Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 12 2011.10.09.. Calculation of scoring matrices from multiple alignments. ASDEAKLVV | ATDDAKLSI || ASDEERITV Matrix elements are calculated from the observed and expected frequencies (“log odds” principle). E.g. for S/T(indicated by red): The values are calculated from many (not just one) multiple alignments. The log odds values in the matrix are then normalized to a range (e.g. -5 to +15) depending on the application Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 13 2011.10.09.. PAM matrices Percent Accepted Mutation: Unit of evolutionary change for protein sequences [Dayhoff78]. Calculated from related sequences organized into “accepted” evolutionary trees (71 trees, 1572 exchange [only]) 20x20 matrix, columns add up to the no of cases observed. All entries ×104 Converted into scoring matrix by log-odds and scaling Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 14 2011.10.09.. Pam_1 = 1% of amino acids mutate Pam_30 = (Pam_1)30 (matrix multiplication) PAM 250 (the higher the numbers the higher the divergence) Note: chemically similar amino acids are near each other … small polar basic large aromatic Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 15 2011.10.09.. BLOSUM matrices PAM uses evolutionarily related sequences, so they may not apply to divergent proteins Henikoff constructed the BLOSUM (BLOck SUbstitution Matrix) series in the same way, but using short blocks of divergent sequences taken from the PROSITE database of multiple alignments. No “grand theory” involved… In BLOSUM 62, the sequences are less than 62% identical. The higher the number the less divergent the proteins (in contrast to PAM). The most popular matrices today (they are deduced from much more alignments than PAM..) Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 16 2011.10.09.. Many other matrices possible Unitary matrix: 1 if the characters are identical (diagonal elements), zero otherwise… Such matrices are used for DNA... Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 17 2011.10.09.. Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 18 2011.10.09.. Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 19 2011.10.09.. Heuristic Sequence Alignment Why? With the Dynamic Programming algorithm, time is proportional to the product of the lengths of the two sequences. This is too slow for genome analysis.... There are two methods that are at least 50-100 times faster than dynamic programming (FASTA and BLAST) Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 20 2011.10.09.. Dynamic Programming: computational method that provide the mathematically optimalalignment fortwo sequences, and ascoring system. Heuristic Methods(e.g. BLAST, FASTA) prune the search space so they provide only aproximately best aligments. For related sequeces DP and heuristics give the same solution. For distantly related sequence the alignments differ... Restricting the search space:a) Only search the selected sequences; b) Only scan some portions of the sequences (a part of the dynamic programming matrix) Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 21 2011.10.09.. The practical trick: Represent sequences as n-character words and positions. Transform the query or dbase into a hash table (list of words and positions). Makes things faaaast… Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 22 2011.10.09.. 1 2 3 4 (kis word size) 4 Steps of FASTA Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 23 2011.10.09.. BLAST algorithm in 4 steps Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 24 2011.10.09.. Note: BLAST is faster than FASTA) because the word occurrences in the dbase are pre-computedin a hash table Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 25 2011.10.09.. Originally (BLAST1) the aligned regions (HSPs, high scoring pairs) were extended until the score went negative. The above, “2 hits” requirement exists since BLAST2, and increased accuracy dramatically. Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 26 2011.10.09.. The p-value is the probability of observing data at least as extreme as that being observed. This area is the p-value Density function (integral=1.0) Statistical significance Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 27 2011.10.09.. Introduction to bioinformatics:Similarity Searching and BLAST Significance: 1) The probability of finding a score by chance (p-value) ; 2) The number of times you expect to find a score >= a certain value by chance (E-value). (the smaller, the better) You can estimate pby making a histogram of chance (random) scores, linearizing it and reading pfrom the linear curve. An engineer’s guide to significance 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 28 2011.10.09.. An engineer’s guide to significanceA typical distribution of scores S S N Frequency (No of times found in dbase) Chance similarities (random score) Non-random similarities (biologically meaningful scores) including best hits 1) Compare a sequence with the database and make histogram 2) Almost always: biologically meaningful scores are a negligible minority : ~the whole distribution is dominated by random scores Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 29 2011.10.09.. p% S Linearize: try log-lin, lin-log, log-log transformations 1) Draw % histogram of chance similarities 2) Fit straight line 3) Read significance (y value) of a score S (x value) from curve. p% S Estimation of significance from an unkown distribution Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 30 2011.10.09.. An engineer’s guide to significanceEstimating significance from an unknown distribution Where to get distribution data: 1) comparison with real sequences, omitting largest scores. 2) Using simulated, random-shuffled sequences. Neither is “correct” but both work quite well Usually one has to extrapolate quite far since large S values are rare (red line)… True, but there is no other way. ? p% S Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 31 2011.10.09.. Just as the sum of many independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution (EVD). Since use maximal alignments between query and db sequences, EVD is applicable! Alignment Scores follow Extreme Value distribution Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 32 2011.10.09.. The Karlin-Altschul statisticsis based on Extreme Value distribution, the expected no of HSP-s, with score at least S is where mis the query length, nis the database length, Kand .are constants. m nis called the search space. Important formulas Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 33 2011.10.09.. The raw score Sdepends on the scoring system (matrix), Kand .. The normalized bit-score S’is more “portable” The probability Pof finding at least one HSP with score >=S is where E (right) is calculated by the Karlin-Altschul formula Further important formulas Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 34 2011.10.09.. 1) For a database of N sequences, E= p x N 2) For real protein sequence similarities, pvalues are very very small. E-values are bigger, but for P<0.000001, P and E are practically identical… 3) Local alignment without gaps: –Theoretical work: Karlin-Altschul statistics: Extreme Value Distribution –Local alignments with gaps: –Empirical studies (shuffled sequences) :Extreme Value Distribution. 3 facts to remember Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 35 2011.10.09.. •Every BLAST "hit" has a score, x, derived from the substitution matrix. •Parameters for the EVD have been previously calculatedand stored for m(the length of the database ) and n(the length of the query). • Now we can get P(S.x), which is our "p-value" •To get the expected number of times this score will occur over the whole database, we multiply by m.This is the “e-value”you see reported in BLAST. How does BLAST calculate E-values? Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 36 2011.10.09.. Increasing BLAST specificity: removal of aspecific (biased composition) regions Repetitive sequences will aspecifically match with many queries CSGSCTECT seq_1 CCCGCCGCC seq_2 Sequence complexity is an empirical measure, proportional to the number of words (of arbitrary length) necessary to reproduce a sequence. Seq_2 is of low complexitybecause it can be rewritten using CC and CG only. Low complexity regions have a biased composition, they are often very repetitive. SGSGSGS, GGGGG etc. Low complexity regions can be removed replaced by XXX so that they will not take part in the alignment (SEG program of John Wootton). Has threshold parameters… Problem: some interesting sequences ARE of low complexity SEG Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 37 2011.10.09.. Introduction to bioinformatics:Similarity Searching and BLAST Multiple alignments are much more informative than simple alignments or similarity scores A multiple alignment of length ncan be transformed into a frequency matrix of n x 20, which can be used as a query (in BLAST or in dynamic programming) The PSI BLAST program can iteratively build such a matrix and use it in more and more specific searches. PSI-BLAST Increasing BLASTspecificity: iterative, position specific scoring 1 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 38 2011.10.09.. Introduction to bioinformatics:Similarity Searching and BLAST Large checker board 1) Multiple alignment of npositions (arbitrary no. mof sequences) 1,2…………..n Large checker board 1,2…………..n 1 .. 20 2) 20 x n position specific frequency matrix. Each cell is the % frequency of occurrence of an aa in that position. Large checker board MSGCCGSR db entry query 3) Use the frequency matrix as a query 1 .. m PSI-BLAST Increasing BLASTspecificity: iterative, position specific scoring 2 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 39 2011.10.09.. Increasing BLASTspecificity: iterative, position specific scoring 3 Large checker board MSGCCGSR db entry query Comparing amin acid M of the entry with position 1 of the query yields a score S1 where the sum goes through the amino acids, fiis the element of the frequency matrix and bM,iis the element of the BLOSUM matrix for M and amino acid i This is the ~same as using BLOSUM or PAM as a lookup table, so the alignment can be carried out by the same algorithm (BLAST, DP) !!! (fivalues are like weights ) PSI-BLAST iteratively includes new sequences into the multiple alignment 1,2…………..n 1 .. 20 PSI-BLAST Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 40 2011.10.09.. BLAST: BasicLocal Alignment Tool 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 41 2011.10.09.. BLASTing with protein sequence queries: blastp = Compares a protein sequence with a protein database. If you want to find something about the function of your protein, use blastp to compare your protein with other proteins contained in the databases tblastn = Compares a protein sequence with a nucleotide database. If you want to discover new genes encoding proteins, use tblastn to compare. your protein with DNA sequences translated into their six possible reading frames 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 42 2011.10.09.. BLASTing with protein sequence queries: At the NCBI BLAST server: URL: http://www.ncbi.nlm.nih.gov/BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 43 2011.10.09.. European BLAST services: EXPASY Switzerlannd http://www.expasy.org/tools/blast/ EBI Hinxton UK http://www.ebi.ac.uk/Tools/sss/ BLASTing with protein sequence queries: 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 44 2011.10.09.. What you should know Similarity searching, main steps Sequence alignment (global, local, exhaustive, heuristic) FASTA algorithm BLAST algorithm Reading significance from linearized histogram BLAST statistics (E-value, p-value), how is BLAST calculating them… Refinements (SEG, PSI-BLAST) Different kinds of BLAST programs… Introduction to bioinformatics:Similarity Searching and BLAST 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 45 2011.10.09..