2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

1

Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework**
Consortium leader
PETER PAZMANY CATHOLIC  UNIVERSITY
Consortium members
SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER
The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund ***

**Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben
***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg.

PETER PAZMANY
CATHOLIC UNIVERSITY

SEMMELWEIS
UNIVERSITY

sote_logo.jpg
dk_fejlec.gif
INFOBLOKK

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

2

Peter Pazmany Catholic University  
Faculty of Information Technology

INTRODUCTION TO BIOINFORMATICS

CHAPTER 9
Gene Prediction Algorithms

www.itk.ppke.hu

(BEVEZETÉS A BIOINFORMATIKÁBA )

(Gén predikció)

András Budinszky


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

3

Definition of Gene
As it has been discussed in Chapter 1, the gene is a segment of the DNA molecule that encodes the information required for the synthesis of a gene product (protein or RNA).
The region that encodes amino acid sequence of a protein sometimes referred to as „protein-gene”.
In addition, there are regulatory sequences in a genome as well, that guide and control the gene expression (promoters, enhancers, operators, terminators, etc.), therefore we could also consider these sequences integral parts of a gene.

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

4

Gene Prediction Problem
Its primary task is to find stretches of a DNA sequence that is responsible for protein coding.
It may also include looking for other functional elements such as regulatory regions. 
Gene finding is one of the first and most important steps in understanding the genome of a species after it has been sequenced.
While earlier days gene prediction was done by experimentation, today –thanks to powerful methods –it is considered a computational problem.

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

5

Components of “Protein-Gene”
Codon:
-triplet of nucleotides that codes exactly one amino acid    in a protein
-43 = 64 possible codons
-20 possible amino acid as final product
-redundant coding
-someamino acids are coded by more than one codon (max. six, e.g. leucine)
-includes one start (ATG) and three stop codons (TAA, TAG, TGA)

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

6

Genetic Code Table

www.itk.ppke.hu

1
Each codon encodes one kind of amino acid.
Each amino acid is encoded by one or more codons.
The 3rdposition of codons is the most likely to vary, for a given amino acid.


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

7

Components of “Protein-Gene” (continued)
Exons:
-during splicing they are joined into a continuous piece of mRNA
-it is often misused to refer only to coding sequences for the final protein. This is incorrect, since non-coding exons also exist. 
-they are relevant only for eukaryotic genes
Introns:
-intervening, non-coding segments in eukaryotic genes
-could be much longer than exons
-during splicing they are cut out
Signals:
-donor sites (usually GT, beginning of introns)
-acceptor sites (AG, end of introns)

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

8

Gene Structure (in eukaryotes) 
Most of the regions outside the genes (intergenic regions) are ignored by gene finders, though some of them look for important subsequences of regulatory functions (like promoters, enhancers).
Note: Prokaryotes do not have introns, only one contiguous exon.

www.itk.ppke.hu

ATG…

…….…

…TGA

exon

exon

exon

intron

intron

start codon

stop codon

donor sites

acceptor sites

GT…….AG

GT…....AG

Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

9

Gene Prediction for Prokaryotes
Since prokaryotes have contiguous exons, the most common approach is using an intrinsic (ab initio) method.
That is, looking for ORFs as exon in all reading frames and then decide if the found punitive exons are really exons.
An ORF (Open Reading Frame) is a subsegment that starts from the Start codon (ATG) and ends at one of the Stop codons (TAA, TAG, TGA).
There are six possible frames: 3 upstream direction (starting at positions 0, 1, or 2) and 3 downstream direction.
ORFs in different frames may overlap.

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

10

Example for Reading Frames
Original sequence:
5'3' atgcccaagctgaaatgcgtagaggggttttcatcatttgaggacgatgtataa
Three reading frames in one direction:
1atgcccaagctgaaatgcgtagag gggttttcatcatttgag gacgat gtataaMPKLKCVEGFSSFEDDV* 2tgcccaagctgaaatgcgtag aggggttttcat catttgaggacgatgtat CPS*NA* RGFHHLRTMY3gcccaagctgaaatgcgtagaggggttttcatcatttgaggacgatgtataAQAE#RRGVFII*GRCI
where A, C, D, …, Y amino acids codes; # start and * stop codons 

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

11

Verifying an ORF as Protein-Coding Exon
Various methods can be used to collect evidence:
-Checking the length of the ORF which should (usually) be above a threshold value. 
(Typically a protein-coding exon is at least 100 base pair long and since 3 of the 64 possible codons in the genetic code are stop codons, one would expect to see a stop codon approximately in every 20-25 codons, or 60-75 base pairs in a random sequence.)
This may not always true because some genes (e.g. some neural and immune system genes) are relatively short.

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

12

Verifying an ORF as Protein-Coding Exon (cont)
-Applying statistical verification. 
Check nucleotide composition and especially (G+C) content (introns being more A/T-rich than exons, especially in plants).
Check codon composition. (Amino acids are typically coded by more than one codon, but the occurrence of “synonyms” are (very) different in frequency –see Codon Usage for Human Genes on next slide. Furthermore expected codon occurrence isdifferent in coding and non-coding regions.)
Hexamer frequency can also be typical and reliable.

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

13

Codon Usage in Human Genes

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

14

Codon Usage in Mouse Genes (fragment)

www.itk.ppke.hu

AA  codon  /1000 Prob. 
Ser  TCG    4.31  0.05
Ser  TCA   11.44  0.14
Ser  TCT   15.70  0.19
Ser  TCC   17.92  0.22
Ser  AGT   12.25  0.15
Ser  AGC   19.54  0.24
Pro  CCG    6.33  0.11
Pro  CCA   17.10  0.28
Pro  CCT   18.31  0.30
Pro  CCC   18.42  0.31
.....

AA  codon  /1000 Prob. 
Leu  CTG   39.95  0.40
Leu  CTA    7.89  0.08
Leu  CTT   12.97  0.13
Leu  CTC   20.04  0.20
Ala  GCG    6.72   0.10
Ala  GCA   15.80  0.23
Ala  GCT   20.12  0.29
Ala  GCC   26.51  0.38 
Gln  CAG   34.18  0.75
Gln  CAA   11.51  0.25
.....


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

15

Verifying an ORF as Protein-Coding Exon (cont)
-Looking for a typical ribosome-binding site (searching for a Shine-Dalgarnosequence in front of the putative protein coding sequence).
-Looking for a typical promoter (if consensus promoter sequences for the given organism are known, check for the presence of a similar in the upstream region).
-Checking if the ORF in question encodes a protein that is similar to previously described ones (searching the protein database for homologs of the given sequence).
Note: This last one is an integrated (intrinsic and extrinsic) method.

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

16

Verifying an ORF as Protein-Coding Exon (cont)
-Using Hidden Markov Model (HMM, Principles see next slide).
HMM must be trained on known sequences of the same species to setup statistical rules to evaluate unknowns.
So take known genes encoding known proteins to use as input into the program.
These model genes will then provide statistics on codon bias, codon pairs, etc.
Depends on accuracy of model genes (poor data can obscure important rules). 

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

17

Principles of HMM
Probabilistic parameters of a HMM(example)x—statesy—possible observationsa—state transition probabilitiesb—output probabilities

www.itk.ppke.hu

HiddenMarkovModel.png

Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

18

Ab initio Gene Prediction for Eukaryotes
This kind of methods of gene finding for eukaryotes, especially in complex organisms like humans, is considerably more challenging for several reasons:
-The promoter and other regulatory signals are more     complex and less well-understood, making them more difficult to reliably recognize. (Two classic examples of signals identified by eukaryotic gene finders are CpGislands and binding sites for a poly(A) tail.
-Eukaryotes have exons and introns, and typically exons are much shorter than introns. The length of exons (usually less than 200 and can be as short as twenty to thirty) makes it much more difficult to detect periodicities and other known content properties of protein-coding DNA. 

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

19

Ab initio Gene Prediction for Eukaryotes (cont)
Finding potential exon/intron boundaries (so-called splice sites) can rely on looking for donor sites (GT) and acceptor sites (AT).
The exons can be checked with the statistical methods listed for prokaryotic exon verification. 
Another type of refinement is often needed that consists of estimating several gene models according to the G+C content of the genomic sequence.
Many currently existing programs use two types of content sensors: one for coding sequences and one for noncoding sequences, i.e. introns, UTRs and intergenic regions.

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

20

Ab initio Gene Prediction for Eukaryotes (cont)
There are some non-canonical splice signals (other than GT and AT), but they are rare; unfortunately current programs cannot handle them. 
The quality of eukaryotic gene prediction achieved by different programs show a rather gloomy picture of numerous errors in exon/intron recognition. Even the best tools correctly predict only ~40% of the genes.
The most serious errors come from genes with long introns, which may be predicted as intragenic sequences.
Sequencing errors in the analyzed sequence affects ORF prediction, and frameshift corrections was found to substantially improve the overall quality of gene prediction. 

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

21

Extrinsic Gene Prediction for Eukaryotes
Closely-related organisms may have similar genes, therefore genome of one species may be compared to genes in some closely-related species, and a sufficient similarity between genomic sequence regions and a protein or DNA sequence present in a database can be exploited.
Basic tools for detecting sufficient similarities between sequences are local alignment methods (like the optimal Smith–Waterman algorithm, or the fast heuristic approaches such as FASTA and BLAST.
Then the best (optimal) chaining of the found local alignment subsequences (putative exons) should be produced (see later slide) 

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

22

Sources for Similarity Search
1.
Protein sequences that can be found in databases such as SwissProtor PIRAs. In this case –before similarity search -the program should convert each protein sequence to a family of possible coding DNA sequences by reverse translation.

2.
Transcripts, sequenced as cDNAseither in the classical way for targeted individual genes with high coverage sequencing of the complete clone or as expressed sequence tags (ESTs), which are one shot sequences from a whole cDNAlibrary or applying RNA-seq, a use of high-throughput technologies.

3.
Genomic DNA (under the assumption that coding sequences are more conserved than non-coding ones). Nevertheless, in this case the similarity may not cover entire coding exons, but be limited to the most conserved part of them. 


www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

23

Pros/Cons for Similarity-Based Approach
Advantages:
Predictions rely on accumulated preexisting biological data, thus they should produce biologically relevant predictions.
A single match is enough to detect the presence of a gene.
Disadvantages:
Obvious weakness is that nothing will be found if the database does not contain a sufficiently similar sequence.
Small exons are easily missed.
Some databases may contain information of poor quality.

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

24

Exon Chaining Problems
The local alignment returns a set of putative exons (“islands” in the sequence searched for a gene).
Each putative exon can be represented with a triplet (the two end positions and the similarity score returned by the alignment algorithm. 
We need to find the maximum non-overlapping set of these exons (that is, the chain of exons that has the maximum weight).
This can be solved with a dynamic programming method.

www.itk.ppke.hu

•Name the 4 core data types of bioinformatics! (4)


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

25

An Example for Exon Chaining

www.itk.ppke.hu

4

3

9

9

15

5

7


0

2

3

5

6

11

13

16

20

25

27

28

30

32


Total: 17

Total: 20


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

26

DP Solution of the Exon Chaining
Given:N putative exon as (l, r, w)
where l and r are begin and end positions
w is score returned by the local alignment
Goal:Chain of non-overlapping no adjacent exons with the maximum weight.
Work data structures:
Weighted directed graph G with
vertices: one for each begin and end position
edges: one for each exon (from l to r) with weight w
one from each vertex to the one representing the next higher position with weight 0
A vector with 2*N elements (one for each vertex/position)  

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

27

DP Solution of Exon Chaining (continued)
Initialize all elements of the vector to 0
For each vector elements vi(from 2ndto last)
If an edge (e) ends in the vertex that virepresents
then
j = index of the element that belongs to left end of edge e
w = weight of edge e
vi  = max {vj+ w, vi-1}
else
vi  = vi-1
At the end, we have the total weight in v2Nand with backtracking we can get the optimal exon chain.

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

28

Gene Finding Programs
GeneMark
It is documented as the most accurate prokaryotic gene finder. It use the Hidden Markov models and exists in separate variants for gene prediction in prokaryotic, eukaryotic, and viral DNA sequences 
http://opal.biology.gatech.edu/genemark/
Glimmer
It is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. It uses interpolated Markov models (IMMs) to identify the coding regions. 
http://cbcb.umd.edu/software/glimmer/

www.itk.ppke.hu


Introduction to bioinformatics: Gene Prediction Algorithms

2010.11.15

TÁMOP –4.1.2-08/2/A/KMR-2009-0006 

29

Gene Finding Programs
GenScan
It uses a complex probabilistic model of the gene structure. Its high speed and accuracy make it the method of choice for the initial analysis of large stretches of eukaryotic genomic DNA.
http://genes.mit.edu/GENSCANinfo.html
GenBuilder
It performs ab initiogene prediction using numerous parameters, such as GC content, hexon frequencies, splicing site data, CpG islands, repetitive elements, and others.
http://www.itb.cnr.it/sun/webgene/

www.itk.ppke.hu