2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

1

Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework**
Consortium leader
PETER PAZMANY CATHOLIC  UNIVERSITY
Consortium members
SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER
The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund ***

**Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben
***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg.

PETER PAZMANY
CATHOLIC UNIVERSITY

SEMMELWEIS
UNIVERSITY

sote_logo.jpg
dk_fejlec.gif
INFOBLOKK

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

2

Peter Pazmany Catholic University  
Faculty of Information Technology

INTRODUCTION TO BIOINFORMATICS

CHAPTER 7
DNA/Protein Sequencing Algorithms

www.itk.ppke.hu

(BEVEZETÉS A BIOINFORMATIKÁBA )

(DNS és fehérje szekvenálási algoritmusok)

András Budinszky


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

3

Definitions
DNA sequencing refers to methods for determining the order of the nucleotide (adenine, guanine, cytosine, and thymine) in a DNA molecule.
Protein sequencing refers to methods for identifying the amino acid sequence of a protein.
These two processes are playing a central role in basic biological research and in numerous applied fields such as diagnostic, biotechnology, system biology, drug development, forensic biology, etc.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

4

History of DNA Sequencing
1953Discovery of the structure of the DNA double helix (Watson & Crick)
1972Development of recombinant DNA technology (permits isolation of defined fragments of DNA
1972-6 Sequence of the first complete gene and the complete genome of bacteriophage MS2 (Friers)
1977Sequencing by chemical degradation (Gilbert)
Sequencing with chain-terminating inhibitors (Sanger)
1984Decipher the complete DNA sequence of the Epstein-Barr virus, 170 kb.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

5

History of DNA Sequencing (cont)
1987Marketing the first automated sequencing machine (Applied Biosystems)
1988Sequencing by hybridization suggested as an alternative sequencing method
Sequencing with chain-terminating inhibitors (Sanger)
1991Sequencing of human expressed sequence tags (ESTs) begins (Craig Verter’slab).
Light directed polymer synthesis developed (Steve Fodor)
1994Affymetrixdevelops first 64-kb DNA microarray

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

6

History of DNA Sequencing (cont)
1995Publish the first complete genome of a free-living organism (bacterium Haemophilus influenzae, 1,830,137 bases, Craig Venter)
1996Introducing pyrosequencing (sequencing by synthesis, Nyren)
2001A draft sequence of the human genome is published (Nature, Science)
2004Markets a parallelized version of pyrosequencing  machine (454 Life Sciences, first version reduced costs  6-fold compared to automated Sanger sequencing)

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

7

Shotgun Sequencing
A method used for sequencing long DNA strands.
Sequences are randomly subdivided into millions of smaller fragments by cutting with restriction enzymes or by shearing with mechanical forces.
About the first 500 –700 nucleotides from each small fragments are sequenced (“read”) by the Sanger method.
Multiple overlapping reads are obtained by performing several rounds of this fragmentation and sequencing.
Finally, the overlapping ends of different reads are assembled by computer program into a continuous sequence (see next slides)

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

8

Assembling the Fragments
In computational problem sense, the assembly task can be defined as the Shortest Superstring Problem (SSP).
SSP is looking for the shortest string which contains each member of a given set of strings. 
SSP has relevance in other areas such as data compression, sparse matrix compression.
A greedy algorithm (finds only an approximation)
•
picks those two strings that overlap in the most characters

•
merges them

•
repeats this until only one string left (a superstring).


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

9

Shortest Superstring Problem (SSP)
Unfortunately, finding the optimal solution for this problem is NP-hard (that is, the problem cannot be solved in polynomial time).
Proof:
We can show that SSP corresponds Traveling Salesman Problem(TSP), which is known to be  NP-complete.
In order to facilitate the proof, we will introduce Hamiltonian cycle/path on graphs (and meanwhile we cover Eulerian cycle/path needed for another topic).

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

10

Königsberg Bridge Problem
The city of Königsberg had two islands which were connected to each other and the mainland by seven bridges.
People tried to find a way to walk all seven bridges without crossing a bridge twice.

www.itk.ppke.hu

npo000001
Finally, in 1735 –using a graph –Euler proved that the problem has no solution. This was actually the foundation of graph theory. 


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

11

Abstract Definition of Königsberg Bridge Problem 

www.itk.ppke.hu


mainland

mainland

island

island


Reformulation of the problem in abstract terms: in a graph the vertices represent the islands and the mainland and the edges stand for the bridges.
Then a path needs to be found which crosses every bridge exactly once.


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

12

Eulerian Cycle/Path
Eulerian cycle in a graph is one that visits every edge exactly once.
•
Such cycle exists if and only if the degree of each vertex is even.


Note:The degree of a vertex is the number of edges touching it.
Eulerian path in a graph is one that visits every edge but the start and end vertices do not have to be the same.
•
Such cycle exists if and only if the graph contains zero or two (start and end) vertices of odd degree.


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

13

Algorithm for Finding an Eulerian Cycle
A.
Starting from an arbitrary vertex “walk” along unused edges until the start vertex is reached.

B.
If the Eulerian cycle has not been constructed yet, then there must be a vertex (v) along this route which has an untraversed edge. Execute step A again starting from vertex v.

C.
Combine this new route with the previous one into a single cycle through vertex v.


This algorithm is linear in time. 

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

14

Hamiltonian Cycle/Path
Hamiltonian cycle in a graph is one that visits every vertex exactly once.
Hamiltonian path in a graph is one that visits every vertex but the start and end vertices do not have to be the same.

www.itk.ppke.hu

Unfortunately, the problem of constructing such cycle or path is NP-complete.
Originally it was defined on a game (Icosian, in a dodecahedron) invented by Hamilton in 1857


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

15

Correspondence between SSP and TSP
Constructing a graph for SSP:
•
Vertices represent the nstrings s1, s2,…., sn

•
Edges are drawn between such vertex pair siand sjfor which prefix of sjmatches suffix of si; the length of the edge should be equal with the number of overlapping characters.


Now SSP is to find the longest path which visits every vertex exactly once.
This is exactly the same as the Traveling Salesman Problem (shortest and longest reversed) which is also NP-hard.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

16

An Example for SSP to TSP
Given s= AGTATCG segment.
SSP
TCG
TAT
AGT
AGTATCG
GTA  
ATC

www.itk.ppke.hu

AGTATCG

TSP


AGT

TAT

GTA

TCG

ATC


2

2

2

2


1

1

1


0

1


1


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

17

Summary of the Shotgun Sequencing
Systems using this method work in three phases:
•
Overlap –Generate potentially overlapping reads. Find the best match between the suffix of one read and the prefix of another. Correcterrors using multiple local alignment.

•
Layout –Merge reads into contigs and those into supercontigs. Repeats are major problems.

•
Consensus –Derive the DNA sequence and correct read errors.


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

18

Sequencing by Hybridization (SBH)
A non-enzymatic method that uses a DNA microarray.
The microarray is prepared by attaching all possible DNA probes of length in a systematic order.
Copies of a DNA fragment to be sequenced is fluorescently labeled.
The dyed DNA fragments are hybridized to the array.
DNA fragments hybridize with those probes that are complementary to substrings of length of the fragments.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

19

Sequencing by Hybridization (SBH)
Using a spectroscopic detector, it is determined which probes hybridize to the DNA fragment.
In this way we get the  -mercomposition of the target DNA fragment.
Finally, apply a combinatorial algorithm to reconstruct the sequence of the target DNA fragment from the -mercomposition (spectrum(s, l) –the unorderedset of all possible  (n – + 1)  -mersof a string sof length n).
Commercial system Affimetrix and Complete Genomics Inc. use this technology.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

20

An Example for spectrum(s, l) 
Given s= AGTATCG segment.
Since elements of spectrum(s, l) are unordered by definition, all of the following are equivalent representations of  spectrum (s, 3): 
{AGT, GTA, TAT, ATC, TCG}
{ATC, AGT, TAT, GTA, TCG} 
{AGT, ATC, GTA, TAT, TCG}
It is customary to use the  lexicographically maximal   representation as the canonical one (here the 3rdone).
Note: Different sequences may have the same spectrum.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

21

Solving SHB with a Hamiltonian Path
Constructing a directed graph (DAG) of SBH for a given spektrum(s,):
•
Vertices represent the -mersof the spektrum.

•
Edges are drawn between each vertex pair siand sjfor which prefix of sjoverlaps suffix of siin the length of  -1. Thisedge will be si›sj.


The Hamiltonian path in this graph provides the solution of theSHB  problem therefore it is an NP-hard solution.
Note:Multiple path (that is, multiple potential solution) might exist.
.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

22

An Example for Hamiltonian Path Approach
Given spectrum= {AGT, ATC, CGT, GTA, TAT, TCG}.
AGTATCGTATATTCGCGT
Path visits every vertex once.
Note: Multiple paths (and thus solution) might exist.

www.itk.ppke.hu

AGTATCGT


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

23

Solving SHB with an Eulerian Path
Constructing a directed graph (DAG) of SBH for a given spektrum(s,):
•
Vertices represent the (-1)-mersof the spektrum.

•
Edges are drawn between each vertex pair siand sjfor which there exists an -merxsuch thatthe first -1 nucleotides of x matches qand the last  -1 nucleotides of x matches p. Thisedge will be si›sj.


The Eulerian path in this graph provides the solution of theSHB  problem therefore it is a linear solution.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

24

An Example for Eulerian Path Approach
Given spectrum= {AGT, ATC, CGT, GTA, TAT, TCG}.
Vertices ((-1)-mers): {AG, AT, CG, GT, TA, TC}.
Edges (correspond to -mers): e.g. AG›GT belongs to AGT.
Path visits every edge once. 

www.itk.ppke.hu

AGTATCGT


AG

GT

TA

AT

TC


CG


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

25

Difficulties with SBH
It is difficult to differentiate between probes hybridized with perfect matches and the ones with 1 or 2 mismatches.
This problem can be decreased with longer l-mers, but array size increases exponentially in l; however,array size is limited with current technology.
The bottom line is that SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the future.
This technology has largely been displaced by Sequencing by Synthesis based methods.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

26

High-throughput Sequencing Technologies 
They are the so-called “next-generation” sequencing technologies.
They parallelize the sequencing and produce vast amount of sequences at once.
•
Massively Parallel Signature Sequencing (MPSS), developed by Lynx Therapeutics. Later merged with Solexaand led to the development of sequencing by synthesis (see slides 28-29)

•
DNA NanoballSequencing, developed by Complete Genomics. Short sequences of DNA are determined from each DNA nanoballand its major difficulty is the mapping of short reads to a reference genome.


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

27

More Next-Generation Technologies 
•
PolonySequencing, developed in the laboratory of George Church at Harvard. It combined an in vitro paired-tag library with emulsion PCR, an automated microscope, and ligation-based sequencing chemistry.

•
Illumina(Solexa), developed a technology based on reversible dye-terminators.

•
SOLiDSequencing, developed by Applied Biosystems, uses sequencing by ligation. It has incorporated Polonysequencing.


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

28

Sequencing by Synthesis 
The method allows sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting which base was actually added at each step.
The template DNA is immobile and solutions of A, C, G, and T nucleotides are sequentially added and removed from the reaction. 
Light is produced only when one of the nucleotide solution complements the first unpaired base of the template.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

29

Sequencing by Synthesis (cont)
Limitation of the method is that the lengths of individual reads of DNA sequence are in the neighborhood of 300-500 nucleotides, shorter than the 800-1000 obtainable with chain termination methods (e.g. Sanger sequencing).
This can make the process of genome assembly more difficult, particularly for sequences containing a large amount of repetitive DNA.
Pyrosequencing AB (later Biotage, Qiagen) commercialized the process.
GS FLX, the latest pyrosequencing platform by 454 Life Sciences can generate 400 million nucleotide data in a 10 hour run.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

30

History of Protein Sequencing
1934Bergman degradation
late 1940s Edmandegradation reaction (used for many years as the predominant method)
1958Insulin sequencing by an enzymatic digestion process(Sanger, his first Nobel prize)
1989Protein sequence by tandem mass spectrometer (MS/MS)
Electrospray ionization, ESI (Fenn, Nobel prize in 2002)
Matrix-assisted laser desorption/ionization.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

31

Determining Amino Acid Composition
It is often desirable to know just the unorderedamino acid composition of a protein before trying to determine the actual sequence.
This knowledge can be used to help the discovery of errors in the sequencing process or to distinguish between ambiguous results.
Steps of process:
•
Hydrolysis –break up protein into its constituent amino acids (applying heat of 100-110 C0for 24+ hours)

•
Separation –get the amino acid components


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

32

Protein Sequencing by MS/MS 
Steps of process:
•
Break the protein into peptides (using proteases, e.g. trypsin).

•
Break down the peptides into fragment ions in a Tandem Mass Spectrometer (MS/MS).

•
The mass spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones.

•
Thus the spectrometer measures mass/chargeratio of an ion, and produces a spectrum.

•
This spectrum is then used by a computer program  attempting to determine the amino acid sequence.


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

33

Processing Spectrum of MS/MS
There are two major approaches:
•
De novo peptide sequencing


This is performed without using prior knowledge of any amino acid sequence. It is the process of assigning amino acids from peptide fragment masses of the protein. 
•
Database search


This a protein identification process which uses prior knowledge of amino acids stored in a database, and attempts to find similarity between the spectrum provided by MS/MS and the spectrum of the proteins in the database.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

34

De Novo Peptide Sequencing
Constructing a directed spectrum graph for a spectrum produced by MS/MS:
•
Vertices:


Since a mass sin an MS/MS spectrum was created by one of the ion types from .={.1, .2,…, .k}, a set of potential masses of the original partial peptide and –correspondingly –a set of vertices are generated for each value s of the spectrum: 
V(s) = {s+.1, s+.2, …, s+.k}

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

35

De Novo Peptide Sequencing (cont)
•
Vertices (continued):


Therefore the complete of set of vertices for the spectrum graph:
{initial vertex}.V(s1) .V(s2) .... .V(sm) .{terminal vertex}
•
Edges:


For each vertex pair with mass difference of amino acid A, a directed edge (from smaller to larger mass) labeled w/ A is drawn.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

36

Using the Spectrum Graph
The task is to find paths from initial vertex to terminal vertex.
There could be multiple such paths in the labeled spectrum graph.
Each path represents an amino acid sequence (read out from the labels).
A probability that peptide P represented by a sequence would produce the received spectrum S can be computed as
p(P,S) = .s.Sp(P, s) 
where p(P, s) is the probability of peak s.
The peptide with the highest probability can be chosen as the most likely sequence.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

37

Pros and Cons of De Novo Sequencing
Advantages:
–
Gets the sequences that are not necessarily in the database.

–
An additional similarity search step using these sequences may identify the related proteins in the database.

–
It is the best method for database search results validation. False positives are virtually eliminated this way.


Disadvantages:
–
Requires higher quality data.

–
Often contains errors.


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

38

De Novo Sequencer Implementations
Lutefisk
Johnson & Taylor, 1997; 19% peptide accuracy
http://www.hairyfatguy.com/lutefisk/
SHERENGA
Dancik et. al., 1999; 29% peptide accuracy
Peaks
Ma et. al. 2003; 25% peptide accuracy
www.bioinformaticssolutions.com/products/peaks/index.php
PepNovo
Frank & Pevzner, 2005; 30% peptide accuracy
http://proteomics.ucsd.edu/Software/PepNovo.html

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

39

Database Search
Steps of the algorithm:
A.
Evaluates protein sequences from a database to compile the list of peptides that could result from each protein.

B.
Determines the set of candidate peptide sequences that could meaningfully be compared to the spectrum by including only those which are near the mass of the observed peptide ion.

C.
Projects a theoretical tandem mass spectrum for each candidate peptide.


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

40

Database Search
Steps of the algorithm (continued):
D.
Compares these theoretical spectra to the observed tandem mass spectrum by the use of cross correlation (a measure of similarity of two waveforms, here the two spectra). 

E.
The candidate sequence with the best matching theoretical tandem mass spectrum is reported as the best identification for this spectrum.


Note: The algorithm works real well, if the protein did not go through multiple posttranslational modifications.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

41

Post-Translational Modifications 
Proteins –while involved in metabolic regulation –are subject to a large number of modifications.
Almost all protein sequences are post-translationally modified and about 200 types of modifications of amino acid residues are known.
A peptide fragment of a multiple times post-translationally modified protein produces a significantly different spectrum and therefore the above described identification algorithm will not find match with the spectrum of the original (unmodified) peptide derived from the database. 

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

42

Virtual Database Search
Possible modification of the original algorithm:
In step B. not only determines the base-line set of candidate peptide sequences, but it also generates candidate peptides from all different possible multi posttranslationally modified version of the proteins.
The rest of the steps are the same.
Note: Leads to an unmanageable large combinatorial problem.

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

43

Another Approach 
•
Another possible modification of the original algorithm to handle the multi posttranslational modifications:•
As an additional input to the algorithm the maximum number of allowed posttranslational modifications is also specified.

•
Instead of generating peptides from all possible modified proteins, it generates them only from the base proteins in the database (same as step B. in the original search algorithm).

•
Then in step D. (when comparing spectra) considers adjustments on the theoretical spectra (using dynamic programming).


www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

44

Database Search Implementations
SEQUEST
Yates & Eng, 1994; it is a complete system and one of the first database search programs
http://fields.scripps.edu/?q=content/software
Mascot
Pappin & Perkins, 1999; it is a software search engine that uses MS data to identify protein from primary sequence databases
http://www.matrixscience.com/
Peaks
Ma et. al. 2003; it is a system that has also a database search software
www.bioinformaticssolutions.com/products/peaks/index.php

www.itk.ppke.hu


Introduction to bioinformatics: DNA/Protein Sequencing Algorithms 

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

45

Database Search Implementations
X! Tandem
GPM, 2009; an open source software that can match tandem mass spectra with peptide sequences; simple-to-use, sophisticated application programming interface. 
http://www.thegpm.org/tandem/
X!! Tandem
GPM, 2009; a parallel, high performance version of X!Tandem that has been parallelized via MPI to run on clusters or other non-shared memory multiprocessors . 
http://wiki.thegpm.org/wiki/X!!Tandem

www.itk.ppke.hu