2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 1 Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework** Consortium leader PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund *** **Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben ***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg. PETER PAZMANY CATHOLIC UNIVERSITY SEMMELWEIS UNIVERSITY sote_logo.jpg dk_fejlec.gif INFOBLOKK 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 2 Peter Pazmany Catholic University Faculty of Information Technology INTRODUCTION TO BIOINFORMATICS CHAPTER 11 Motif Finding Algorithms www.itk.ppke.hu (BEVEZETÉS A BIOINFORMATIKÁBA ) (Motívum kereső algoritmusok) András Budinszky Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 3 Motivation To discover and understand the mechanisms that regulate gene expression is a major challenge in biology. An important task in this challenge is to identify regulatory elements, especially the binding sites in DNA for transcription factors. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 4 Definitions Signalsare DNA or RNA sequence patterns that are “recognized” proteins or other molecules. A binding site of the transcription factor in the promoter region is such a signal. A transcription factor can bind to several binding sites in the promoter regions of different genes to make these genes co-regulated, and such binding sites should have common patterns called motifs. The task of motif finding is to locate these substrings of common patters in multiple sequences. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 5 The Challenge of Motif Finding The sequence of the motif is unknown. Its location relative to the genes start is also unknown. It can be located anywhere within the regulatory regions of the sequences, and the regulatory region typically stretching 100-1000 bpupstream of the transcriptional start site. It may vary slightly across different regulatory regions since non-essential bases could mutate. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 6 The Motif Finding Problem Given • tnumber of DNA sequences, each contains n nucleotides, • a fixed but unknown nucleotide sequence M (the motif) of length l, • the integer value d. Furthermore, each of the tDNA sequences contains a planted variant of M with at most dpoint substitution starting at unknown positions s(s1, s2, …, st). The MFP is to determine these starting positions (s). www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 7 Motif Finding Problem, Example www.itk.ppke.hu DNA sequences (t) s1 s2 st s5 s4 s3 start indices of motifs (s) genes … … genes (length l) Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 8 Consensus Sequence Once we know the starting positions, we can generate a consensus sequence; one that on each of its l position contains the nucleotide with the highest occurrence frequency in the corresponding positions in the t sequences . We can think of the consensus sequence as an “ancestor” motif, from which the modified motifs in the sequences mutated. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 9 Consensus Sequence, Example www.itk.ppke.hu T A GT G A T T AA AT G A T G C TA T G A CT C A AT AGT T C A AT TCT T A 1 440 1 30 0 C 30 0 0 0 1 1 0 G 0 0 1 0 31 0 1 T 1 1 0 51 0 4 4 C A AT G A T T alignment of motifs frequency matrix consensus sequence Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 10 Evaluation of Motifs Alignment We can introduce two evaluators as measure of alignment between the consensus sequence and the motifs in the sequences: A. The sum of the highest occurrence frequencies (an indicator of the total similarities between the “ancestor” motif and the “implantations”): TSim(s, DNA) = whereDNA is the set of sequences, lis the length of the motif, and s is the vector of the starting points www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 11 Evaluation of Motifs Alignment (cont) B. The sum of the Hamming distances (indicator of the total distances between the “ancestor” motif and the “implantations”): SDist(s, DNA) = whereDNA is the set of sequences, s is the vector of the starting points, tis the number of sequences, and c is the consensus sequence Note:Hamming distance is the number of point mutations (differences). www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 12 Evaluation Example www.itk.ppke.hu T A GT G A T T AA AT G A T G C TA T G A CT C A AT AGT T C A AT TCT T A 1 440 1 30 0 C 30 0 0 0 1 1 0 G 0 0 1 0 31 0 1 T 1 1 0 51 0 4 4 C A AT G A T T TSim(s, DNA) = 3 + 4 + 4 + 5 + 3 + 3 + 4 + 4 = 30 SDist(s, DNA) = 2 + 2 + 2 + 2 + 2 = 20 Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 13 Formal Definition of Motif Finding Problem Given: • a tx nmatrix of DNA (a set of DNA sequences), and • an integerl(the length of the pattern to find) Find the starting points (a vector s) of a set of subsequences of length l(l-mers), one in each sequence of DNA, that maximizes the value of total similarity (TSim(s,DNA)). . www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 14 Another Formal Definition of Motif Finding Given: • a tx nmatrix of DNA (a set of DNA sequences), and • an integerl(the length of the pattern to find) Find a sequence of length l(the so-called median string) that minimizes its total distance (SDist(s,DNA)) to the set of sequences in DNA. Note: This problem is also known as Median String Problem. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 15 Equivalence of MFP and MSP The two problems (motif finding problem and median string problems) are computationally equivalent, because TSim(s, DNA) + SDist(s, DNA) = t *l, and thus maximizing TSim corresponds to minimizing SDist. These problems are NP-hard and three categories of algorithms are known for solving them: A. Brute-force B. Clique search C. Heuristic search www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 16 Motif Finding Problem, Brute-Force Algorithm The idea of the algorithm is that it computes TSim for each possible starting point combination and picks the best. A. Initialize topSimto 0 B. For each starting point combination B1. compute TSim B2. if TSim > topSimthen saveTSim in topSimand current start points in motif C. Return motif . www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 17 Median String Problem, Brute-Force Algorithm The idea of the algorithm is that it computes SDistfor each possible median string (all combinations with length of l)and picks the best. A. Initialize minDistto . B. For each possible median string B1. compute SDist B2. if SDist< minDistthen saveSDistin minDistand current median string in median C. Return median . www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 18 Run Time Analysis Motif Finding: • Varying (n -l+ 1)positions in each of tsequences, we have (n -l+ 1)tstarting positions. • For each starting positions, TSim makes l*toperations, so complexity is l t(n –l+ 1)t=O(l tnt) • For a good size problem (for example t = 20, n = 600, and l= 15) this would take unacceptable amount of time. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 19 Run Time Analysis Median String Search: • Varying the median string candidates, we have 4lcombinations • For each median string candidate, minDistmakes t(n –l+ 1)operations, so complexity is t(n –l+ 1) 4l=O(nt4l) • For a good size problem this would also take unacceptable amount of time. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 20 Branch and Bound Improvement With the branch-and-bound technique, some improvement of the Motif Finding Algorithm can be achieved: • While iterating the starting position combination, if the TSim value for a substring of length k < l(a prefix of the motif being checked) is hopelessly small (that is, -with an optimistic assumption –even if the rest of the characters all match, and therefore they contribute the largest possible value to Tsim, it will be smaller than topSim), then the algorithm can trim the search space and directly skip to the next prefix combination. Similarly, branch-and-bound can be used for MSP algorithm. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 21 Motif Finding Problem, Greedy Algorithm Steps: A. Find the two closest l -mersin the first 2 sequences and form a 2 x lalignment matrixwith TSim(s,2,DNA). B. Initialize ito 3. C. For each of the rest of t-2sequences C1. find an l-merin sequence imaximizing TSim(s,i,DNA) using the already constructed (i-1) x lalignment matrix for the first (i-1)sequences C2. increment iby 1 This is the CONSENSUS algorithm that sacrifices optimal solution for speed. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 22 Motif Finding Problem, Clique Search Algorithm Steps: A. Construct a t-partitegraph as– Each partite belongs to one of the DNA sequences and contains vertices representing substring of length l (a total of (n-l+1). – Two vertices in different partites are connected by an edge if the Hamming distance between the two corresponding substrings is at most 2d. B. Find a clique of size t in this graph. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 23 Motif Finding Problem, Heuristic Search Algorithm The principle idea of these types of algorithms: A. First they find a set of subsequences of length l with high probability of being the motif. B. Then they refine these sequences by some local searching techniques. Note:These algorithms may solve problems that would take access amount of time with a brute-force algorithm, there is no guarantee that the motif can be found. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 24 Motif Finding Problem, Voting Algorithm Steps: A. Create two hash tables Voteand Record with entry values 0. B. Create an empty set (this will collect the motifs found). C. For each sequence in DNA (iruns from 1 to t) C1. For each l-length substring of it (from pos 1 to (n-l+1)) C11. Create all variants that is at most d distance away from that substring C12. For each variant C121. If Record[H(variant)] . i then add 1 to Vote[H(variant)] set Record[H(variant)] to i (Note: H(s) is a hashing function) www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 25 Motif Finding Problem, Voting Algorithm (cont) Steps (continued): D. For each l-length substring of sequence 1 (from position 1 to (n-l+1)) D1. Create all variants that is at most d distance away from that substring D2. For each variant D21. If Vote[H(variant)] = t (that is, being a variant in allsequences) thenincludethis variant into result set C. Note: It runs faster than the brute-force algorithm (O(nt(3l)d), but its space requirement increases exponentially with d. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 26 Voting Algorithm with Projection The voting algorithm discussed on the previous two slides exhaustive search algorithm time and space permits (see note at the end of previous slide). The algorithm can be modified to a heuristic version which can work much larger l and d (when l > 15 and d > 5) as well. Instead of all positions, it considers only l’of the lpositions (a projection) of the motif. Based on the voting results on these l’ positions, the motif of length lcan with high probability be found. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 27 Motif Databases There are various databases that contain known motifs which can be searched for: • Different JASPER databases (CORE, PHYLOFACTS, FAM, POLII, CNE, SPLICE). Collections of transcription factor DNA-binding preferences, modeled as matrices. The only databases where the data can be used with no restrictions (open-source). • TRANSFAC. Contains data on transcription factors, their experimentally-proven binding sites, and regulated genes. Its broad compilation of binding sites allows the derivation of positional weight matrices. • PROSITE. Primarily a protein database, but it contains protein motifs. www.itk.ppke.hu Introduction to bioinformatics: Motifs Finding Algorithms 2010.11.15 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 28 Motif Finding Programs CONSENSUSHertz, Stromo(1989) GibbsDNALawrence et al (1993) MEMEBailey, Elkan(1995) WINNOWERPevzner, Sze(2000) RandomProjectionsBuhler, Tompa(2002) MULTIPROFILER Keich, Pevzner(2002) MITRAEskin, Pevzner(2002) Pattern BranchingPrice, Pevzner(2003) Improved VotingChin, Leung (2005) www.itk.ppke.hu