2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

1

Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework**
Consortium leader
PETER PAZMANY CATHOLIC  UNIVERSITY
Consortium members
SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER
The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund ***

**Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben
***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg.

PETER PAZMANY
CATHOLIC UNIVERSITY

SEMMELWEIS
UNIVERSITY

sote_logo.jpg
dk_fejlec.gif
INFOBLOKK

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

2

Peter Pazmany Catholic University  
Faculty of Information Technology

INTRODUCTION TO BIOINFORMATICS

CHAPTER 4
Similarity searching and the BLAST algorithm

www.itk.ppke.hu

(BEVEZETÉS A BIOINFORMATIKÁBA )

(hasonlóságkeresésés a BLASTalgoritmus)

Sándor Pongor

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

2


Introduction to bioinformatics:Similarity Searching and BLAST

Lecture outline
Similarity searching principles and main steps, 
Sequence similarity, PAM and BLOSUM matrices
Alignment types (local, global, exhaustive, heuristic)
FASTA (briefly)
BLAST (principle)
Significance calculation 
BLAST refinements

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

3

2011.10.09..


Introduction to bioinformatics:Similarity Searching and BLAST

What similarity searching is.. 
Given a query and a database, find the entry in the database that is most similar to the query in terms of a numerical similarity measure (distance, similarity score, etc.)
In contrast: retrieval looks for an exact match to the query.
Is John in the list? Retrieval: Yes/No, based on exact matching. Similarity search: His brother Joe Brown is. So we can classifyJohn into the Brown family, based on approximate matching.

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

4

2011.10.09..


Introduction to bioinformatics:Similarity Searching and BLAST

The importance of similarity

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

5

2011.10.09..


Similar protein’s name (ID): Joe. 
Class: Brown family


Introduction to bioinformatics:Similarity Searching and BLAST

The use of similarity

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

6

2011.10.09..


Starting stage: Query and DB are in the same format (the search format) and we have a similarity measure.
STEPS:
1. Comparequery with all entries in the DB and register similarity score. Store results above some threshold (cutoff)
2.Calculate significanceof the score 
3. Rankentries according to similarity score or significance (top list)
4. Reportthe best hit (usually after some simple statistics, e.g. if it is higher than a threshold…)

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

7

2011.10.09..


We use a version of the edit distance and a specific substitution matrix (Dayhoff, BLOSUM, etc.)
Exhaustive algorithms (Dynamic programming, Needleman Wunsch, Smith-Waterman) are expensive, 
We use heuristics that make use of the properties of biological sequences (FASTA, BLAST)
Biological heuristics include a) local similarities are dense, b) similar regions are near each other, c) low complexity sequences excluded, etc.

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

8

2011.10.09..


Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

9

2011.10.09..


The score S is a sum of costsassigned to identities and mismatches, minus a penalty for gaps.Costs are stored in the substitution matrix

Preliminaries


HSP, high scoring segment pair


Sequence similarity score

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

10

2011.10.09..


A simple example (without gaps):


For a match/mismatch we look up the value in the substitution matrix. The matrix is a lookup table…

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

11

2011.10.09..


Substitution matricesin details
The susbstitution matrix (also called  scoring matrix) contains costs for amino acid identities and substitutions in an alignment. 
It is a 20x20symmetrical matrix that can be constructed from pairwise alignments of related sequences
“Related” means either
a) evolutionary relatedness described by an “approved” evolutionary tree (Dayhoff’s PAM matrices)
b) any sequence similarity as described in the PROSITE database (Hennikoffs BLOSUM matrices)
Groups of related sequences can be organized into a multiple alignment for calculation of the matrix elements.

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

12

2011.10.09..


Calculation of scoring matrices
from multiple alignments. 

ASDEAKLVV
|
ATDDAKLSI
||
ASDEERITV

Matrix elements are calculated from the observed and expected frequencies (“log odds” principle).  E.g. for  S/T(indicated by red):


The values are calculated from many (not just one) multiple alignments. The log odds values in the matrix are then normalized to a range (e.g. -5 to +15) depending on the application

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

13

2011.10.09..


PAM matrices
Percent Accepted Mutation:  Unit of evolutionary change for protein sequences [Dayhoff78].
Calculated from related sequences organized into “accepted” evolutionary trees (71 trees, 1572 exchange [only])
20x20 matrix, columns add up to the no of cases observed.


All entries ×104

Converted into scoring matrix by log-odds and scaling

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

14

2011.10.09..


Pam_1 = 1% of amino acids mutate
Pam_30 = (Pam_1)30  (matrix multiplication)


PAM 250
(the higher the numbers the higher the divergence) 

Note: chemically similar amino acids are near each other …

small

polar

basic

large

aromatic


Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

15

2011.10.09..


BLOSUM matrices
PAM uses evolutionarily related sequences, so they may not apply to divergent proteins
Henikoff constructed the BLOSUM (BLOck SUbstitution Matrix) series in the same way, but using short blocks of divergent sequences taken from the PROSITE database of multiple alignments. No “grand theory” involved…
In BLOSUM 62, the sequences are less than 62% identical. The higher the number the less divergent the proteins (in contrast to PAM).
The most popular matrices today (they are deduced from much more alignments than PAM..)

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

16

2011.10.09..


Many other matrices possible 
Unitary matrix: 1 if the characters are identical (diagonal elements), zero otherwise…
Such matrices are used for DNA...

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

17

2011.10.09..


Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

18

2011.10.09..


Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

19

2011.10.09..


Heuristic Sequence Alignment Why?
With the Dynamic Programming algorithm, time is proportional to the product of the lengths of the two sequences. This is too slow for genome analysis....
There are two methods that are at least 50-100 times faster than dynamic programming (FASTA and BLAST)

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

20

2011.10.09..


Dynamic Programming: computational method that provide the mathematically optimalalignment fortwo sequences, and ascoring system.
Heuristic Methods(e.g. BLAST, FASTA) prune the search space so they provide only aproximately best aligments. For related sequeces DP and heuristics give the same solution. For distantly related sequence the alignments differ...
Restricting the search space:a) Only search the selected sequences; b) Only scan some portions of the sequences (a part of the dynamic programming matrix)

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

21

2011.10.09..


The practical trick: Represent sequences as n-character words and positions. Transform the query or dbase into a hash table (list of words and positions). Makes things faaaast…


Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

22

2011.10.09..


1

2

3

4

(kis word size)

4 Steps of FASTA

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

23

2011.10.09..


BLAST algorithm in 4 steps

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

24

2011.10.09..


Note: BLAST is faster than FASTA) because the word occurrences in the dbase are pre-computedin a hash table

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

25

2011.10.09..


Originally (BLAST1) the aligned regions (HSPs, high scoring pairs) were extended until the score went negative. The above, “2 hits” requirement exists since BLAST2, and increased accuracy dramatically.


Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

26

2011.10.09..


The p-value is the probability of observing data at least as extreme as that being observed.


This area is the p-value

Density function (integral=1.0)


Statistical significance

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

27

2011.10.09..


Introduction to bioinformatics:Similarity Searching and BLAST

Significance: 1) The probability of finding a score by chance (p-value) ; 2) The number of times you expect to find a  score >= a certain value by chance (E-value). (the smaller, the better)
You can estimate pby making a histogram of chance (random) scores, linearizing it and reading pfrom the linear curve.

An engineer’s guide to significance

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

28

2011.10.09..


An engineer’s guide to significanceA typical distribution of scores S


S


N

Frequency (No of times found in dbase)


Chance similarities (random score)

Non-random similarities (biologically meaningful scores) including best hits


1) Compare a sequence with the database and make histogram
2)  Almost always: biologically meaningful scores are a negligible minority : ~the whole distribution is dominated by random scores

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

29

2011.10.09..


p%

S


Linearize: 
try log-lin, lin-log, log-log transformations

1) Draw % histogram of chance similarities

2) Fit straight line

3) Read significance (y value) of a score S (x value) from curve. 


p%

S


Estimation of significance from an unkown distribution


Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

30

2011.10.09..


An engineer’s guide to significanceEstimating significance from an unknown distribution


Where to get distribution data: 1) comparison with real sequences, omitting largest scores. 2) Using simulated, random-shuffled sequences. Neither is “correct” but both work quite well
Usually one has to extrapolate quite far since large S values are rare (red line)… True, but there is no other way. 


?

p%

S

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

31

2011.10.09..


Just as the sum of many  independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution (EVD).  Since use maximal alignments between query and db sequences, EVD is applicable!


Alignment Scores follow Extreme Value distribution

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

32

2011.10.09..


The Karlin-Altschul statisticsis based on Extreme Value distribution, the expected no of HSP-s, with score at least S is
where mis the query length, nis the database length, Kand .are constants. m nis called the search space.


Important formulas

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

33

2011.10.09..


The raw score Sdepends on the scoring system (matrix), Kand .. The normalized bit-score S’is more “portable”
The probability Pof finding at least one HSP with score >=S is 
where E (right) is calculated by the Karlin-Altschul formula


Further important formulas

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

34

2011.10.09..


1) For a database of N sequences, E= p x N
2) For real protein sequence similarities, pvalues are very very small. E-values are bigger, but for P<0.000001, P and E are practically identical…
3) Local alignment without gaps:
–Theoretical work: Karlin-Altschul statistics: Extreme Value Distribution
–Local alignments with gaps:
–Empirical studies (shuffled sequences) :Extreme Value Distribution.

3 facts to remember

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

35

2011.10.09..


•Every BLAST "hit" has a score, x, derived from the substitution matrix.

•Parameters for the EVD have been previously calculatedand stored for m(the length of the database ) and n(the length of the query).

•
Now we can get P(S.x), which is our "p-value"

•To get the expected number of times this score will occur over the whole database, we multiply by m.This is the “e-value”you see reported in BLAST.


How does BLAST calculate E-values? 

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

36

2011.10.09..


Increasing BLAST specificity: removal of aspecific (biased composition) regions

Repetitive sequences will aspecifically match with many queries
CSGSCTECT  seq_1
CCCGCCGCC  seq_2
Sequence complexity is an empirical measure, proportional to the number of words (of arbitrary length) necessary to reproduce a sequence. Seq_2 is of low complexitybecause it can be rewritten using CC and CG only.
Low complexity regions have a biased composition, they are often very repetitive. SGSGSGS, GGGGG etc.
Low complexity regions can be removed replaced by XXX so that they will not take part in the alignment (SEG program of John Wootton). Has threshold parameters…
Problem: some interesting sequences ARE of low complexity

SEG

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

37

2011.10.09..


Introduction to bioinformatics:Similarity Searching and BLAST

Multiple alignments are much more informative than simple alignments or similarity scores
A multiple alignment of length ncan be transformed into a frequency matrix of n x 20, which can be used as a query (in BLAST or in dynamic programming)
The PSI BLAST program can iteratively build such a matrix and use it in more and more specific searches.

PSI-BLAST

Increasing BLASTspecificity: iterative, position specific scoring 1

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

38

2011.10.09..


Introduction to bioinformatics:Similarity Searching and BLAST

Large checker board


1) Multiple alignment of npositions (arbitrary no. mof sequences)

1,2…………..n

Large checker board
1,2…………..n

1
..
20

2) 20 x n position specific frequency matrix. Each cell is the % frequency of occurrence  of an aa in that position.

Large checker board
MSGCCGSR     db entry

query

3) Use the frequency matrix as a query


1
..
m

PSI-BLAST

Increasing BLASTspecificity: iterative, position specific scoring 2

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

39

2011.10.09..


Increasing BLASTspecificity: iterative, position specific scoring 3

Large checker board
MSGCCGSR     db entry

query

Comparing  amin acid M of the entry with position 1 of the query yields a score S1
where the sum goes through the amino acids, fiis the element of the frequency matrix and bM,iis the element of the BLOSUM matrix for M and amino acid i This is the ~same as using BLOSUM or PAM as a lookup table, so the alignment can be carried out by the same algorithm (BLAST, DP) !!! (fivalues are like weights )


PSI-BLAST iteratively includes new sequences into the multiple alignment


1,2…………..n

1
..
20

PSI-BLAST

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

40

2011.10.09..


BLAST: BasicLocal Alignment Tool

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

41

2011.10.09..


BLASTing with protein sequence queries:
blastp = Compares a protein sequence with a protein database. If you want to find something about the function of your protein, use blastp to compare your protein with other proteins contained in the databases
tblastn = Compares a protein sequence with a nucleotide database. If you want to discover new genes encoding proteins, use tblastn to compare. your protein with DNA sequences translated into their six possible reading
frames

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

42

2011.10.09..


BLASTing with protein sequence queries:
At the NCBI BLAST server:
URL: http://www.ncbi.nlm.nih.gov/BLAST


2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

43

2011.10.09..


European BLAST services:
EXPASY Switzerlannd http://www.expasy.org/tools/blast/
EBI Hinxton UK
http://www.ebi.ac.uk/Tools/sss/

BLASTing with protein sequence queries:

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

44

2011.10.09..


What you should know

Similarity searching, main steps
Sequence alignment (global, local, exhaustive, heuristic)
FASTA algorithm
BLAST algorithm
Reading significance from linearized histogram
BLAST statistics (E-value, p-value), how is BLAST calculating them…
Refinements (SEG, PSI-BLAST)
Different kinds of BLAST programs…

Introduction to bioinformatics:Similarity Searching and BLAST

2011.10.09..

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

45

2011.10.09..