2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 1 Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework** Consortium leader PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund *** **Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben ***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg. PETER PAZMANY CATHOLIC UNIVERSITY SEMMELWEIS UNIVERSITY sote_logo.jpg dk_fejlec.gif INFOBLOKK 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 2 Peter Pazmany Catholic University Faculty of Information Technology INTRODUCTION TO BIOINFORMATICS CHAPTER 5 Sequencing and manipulating of DNA www.itk.ppke.hu (BEVEZETÉS A BIOINFORMATIKÁBA) (DNS szekvenálás és manipulálás ) Péter Gál Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 3 DNA sequencing The development of the modern DNA sequencing techniques prompted the birth of bioinformatics. Due to the improving of the efficiency of the DNA sequencing techniques the size of the nucleotide databases has increased rapidly. DNA sequencing was and is the major source of biological data to be analysed by bioinformatics. DNA sequencing partially replaced protein sequencing, since it is much harder to sequence a protein than its gene or cDNA. The continuous improving of the sequencing techniques is a challenge for bioinformatics. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 4 In 1972, a major achievement of DNA sequencing was to determine the sequence of a 174-bp-long DNA molecule. This was featured on the front page of the prestigious scientific journal Nature. At that time the sequencing of the entire human genome(3.2x109bp) would have taken more than one million years. Now this task could be accomplished in a few weeks. Sequencing the human genome (the Human Genome Project) was one of the greatest scientific achievement of the 20th century. Till now the genome of at least 6000 different species has been sequenced and the data have been put to the data bases. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 5 Landmarks in the Human Genome Project 1953 Watson and Crick published on the structure of the double stranded DNA molecule (Nature, April 25th) „We wish to suggest a structure for the salt of deoxyribose nucleic acid (D.N.A.). This structure has novel features which are of considerable biological interest.” „It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.” The most famous understatements of the 20th century. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 6 1975 The advent of the modern DNA sequencing techniques At that time two methods of DNA sequencing were developed independently. Both methods generates DNA fragments that represent the entire sequence. The fragments are resolved by polyacrylamide gel electrophoresis under denaturing conditions. The Maxam-Gilbert method uses chemical modifications of the bases while the Sanger method uses DNA polymerase enzyme to copy fragments form the template DNA strand. The Sanger method was for a long time the leading method for DNA sequencing. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 7 1977 The bacteriophage .X-174 was sequenced. This 5.4 kbp DNA represents the first complete genome ever sequenced. 1981 The DNA of the human mitochondrion was sequenced. It consist of 16569 bp. 1984 The first sequence of a human virus, the Epstein-Barr virus (one of the eight known human herpes viruses ), was determined. The virus genome is 172281-bp-long. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 8 1990 The International Human Genome Project was launched. At the beginning this international project planned to sequence the human genomein 15 years. 1991 Craig Venter identifies active genes in the genomeby sequencing the initial portions of complementary DNA (cDNA). cDNA is made from mRNA by means of reverse transcription. Every short piece of cDNA can be considered as a tag of the whole gene. Hence the name Expressed Sequence Tags (ESTs). 1992 The complete low resolution linkage map of the human genomewas determined. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 9 1992 The Caenorhabditis elegans(nematode worm) sequencing project was launched. This year two new research center were established. In the framework of the Human GenomeProject the Sanger Center for large-scale genomic sequencing was established in Hinxton, UK. Craig Venter established The Institute for Genome Research (TIGR) for the commercial application of the sequence information derived form the genome sequencing projects. The major aim was identify genes that encodes potential drug targets for the pharmaceutical industry. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 10 1995 The first bacterial genomewas sequenced by TIGR. The Haemophilus influenzaegenome, the first genome of a free-living organism (1.8 million base pairs), was sequenced. The same year the genomeof Mycoplasma genitalium, the smallest genome of a self-replicating organism (582970 bp) was also sequenced by Craig Venter’s company. 1996 The high-resolution map of human genomewas established. The resolution of the map is approximately 600 000 bp. The same year the first complete eukaryotic genome, the genome of the yeast Saccharomyces cerevisiae, was sequenced. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 11 1998 Craig Venter’s company, Celera, announced that they would finish the sequencing of the human genome by 2001. In response to that the sponsors of the Human Genome Project increased the funding of the Sanger Center. The same year the complete genomesequence of the nematode worm Caenorhabditis eleganswas published. 1999 At the end of this year Celera Genomics announced that they determined the genome sequence of Drosophila melanogaster. The sequence data were released in spring 2000. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 12 1999 The Human Genome Project announced its plan to finish the working draft of human genome (90% of genes sequenced to >95% accuracy) by 2001. December 1, 1999 Sequence of first complete human chromosome published. June 26, 2000 Joint announcement (Prime Minister of the United Kingdom, Tony Blair, and President of the United States, Bill Clinton) of the completion of the draft of the Human Genome. Scientists form the Human GenomeProject and from Celera Genomics were also present. 2003 Completion of high-quality human genome sequence by the public consortium. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 13 DNA sequencing methods Maxam-Gilbert method Principle: To generate four sets of labeled fragments from the DNA to be sequenced by chemical reactions. www.itk.ppke.hu pAATCGACT pAATCG pA pAATCGAC pAA pAATCGA pAAT pAATC A reaction T reaction C reaction G reaction Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 14 www.itk.ppke.hu A T C G A T C G A C T ? Read sequence from the gel Separation of the labaled fragment by gel electrophoresis - + Direction of the electrophoresis Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 15 The Maxam-Gilbert sequencing was the method of choice for sequencing shorter DNA molecules, especially chemically synthesized oligonucleotides. The main drawbacks of the method are the need of purified homogeneous DNA and the usage of toxic chemicals. The Sanger method uses basically the same principle as the Maxam-Gilbert(i.e. generation of base specific fragments), however this method generates the base specific DNA fragments by enzymatic synthesis of the complementary DNA strand. To stop the synthesis of the DNA strand at a specific point it uses dideoxynucleoside triphosphates(ddNTP). www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 16 Dideoxynucleoside triphosphates (ddNTPs) are deoxynucleoside triphosphate (dNTP) analogues where there is no OH group at the 3’ position of the ribose. www.itk.ppke.hu O Base CH2 3’ X P P P X= OH› dNTP X= H› ddNTP Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 17 The Sanger method of DNA sequencing uses DNA polymerase enzyme for synthesizing the complementary DNA strand. DNA polymerases cannot initiate the synthesis of a new DNA strand alone. They need a primer oligonucleotide, that hybridize to the single-stranded DNA template and the DNA polymerase adds the next incorporating nucleotide to the 3’ OH of the new DNA strand. In case of the incorporation of a dideoxy nucleotide, there will be no 3’ OH, consequently the synthesis of the new DNA strand terminates. If we use all the four ddNTP in separated reactions (a certain ratio of dNTP and ddNTP) we can generate all possible DNA fragments. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 18 www.itk.ppke.hu Template 3’ AATCGACT 5’ Primer 5’ TT 3’ 5’ TTA3’ 5’ TTAGCTGA3’ 5’ TTAGCT3’ 5’ TTAGC3’ 5’ TTAG 3’ 5’ TTAGCTG 3’ ddATP ddTTP ddCTP ddGTP A T C G 5’ AGCTGA 3’ Sequence of the complementary DNA strand electrophoresis Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 19 The Sanger method become the most popular method of DNA sequencing, since it works with partially purified double stranded DNA (e.g. recombinant plasmid DNA) and it can be readily automated. The automated Sanger sequencing method was the workhorse of the Human GenomeProject. Nowdays the next generation sequencing (NGS) methods gradually replace the Sanger method. Frederick Sanger has been awarded two Nobel Prizes in Chemistry, one for protein sequencing (1958) and one for DNA sequencing (1980). (There are only four persons in history having two Nobel Prizes.) www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 20 Automation of the Sanger DNA sequencing www.itk.ppke.hu Electrophoresis of fluorescent-dye-labeled DNA fragments on a capillary column Laser source Detector Laser beam - + Direction of the eletrophoresis Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 21 Computer-generated result (chromatogram) of the automated Sanger DNA sequencing www.itk.ppke.hu good Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 22 Next generation sequencing Recently new methods of DNA sequencing appeared that are an order of magnitude faster than the Sanger’s one. Moreover they arecheaper and havedeep enough coverage to look at genomes of individuals or even tissues. It makes possible to read the sequence of several hundred millions DNA molecules at the same time. The rise of the next generation sequencing is a challenge for bioinformatics. The first task is to assemble a whole sequence (e.g. a chromosome) from tens of millions of short sequencing fragments. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 23 www.itk.ppke.hu A tipical output of an NGS experiment Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 24 Steps of a high-throughput sequencing experiment 1.) Sample preparation: the large DNA molecule is fragmented to small single-stranded pieces. 2.) The individual DNA molecules are immobilized on small beads. 3.) The synthesis of the complementer strand takes place on the bead. Every incorporation of a nucleotide gives a flash of fluorescent light which is characteristic to the given nucleotide. 4.) The process is monitored by an extremely sensitive digital camera. The camera with the data processing device is capable of monitoring millions of bead, that is the synthesis of millions individual DNA molecules. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 25 The primary data are the pictures with the millions of fluorescent flashes which can be followed real-time online. The colors of the flashes are then translated into DNA sequences (secondary data) that correspond to the complementary strand of the DNA immobilized on the bead. Finally, the sequence of the original full-length DNA molecule should be assembled from the tens of millions of short sequencing fragments. It took ten years and approximately 300 million dollars for the Human Genome Project to sequence the entire human genome in the last century. Now an individual’s genome can be sequenced within a few weeks for several thousand dollars. The aim is to reduce the price under a thousand. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 26 Strategies for genome sequencing 1.) Systematic sequencing: Mapping the DNA, sequencing relatively long pieces of DNA, and then assemble the entire DNA molecule (chromosome) by using the map. It is a slow but reliable method. It is labor-intensive. 2.) Shotgun sequencing Fragmentation of the large DNA molecule into small pieces, sequencing the fragments, assemble the whole sequence by using the overlapping sequence. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 27 Problems with the shotgun sequencing: We have to read at least 10 times more nucleotides than the length of the original DNA in order to achieve a gap-free sequence. The repetitive sequences make the assembly ambiguous. (We need a map anyway.) We need a lot of computer-power: (memory and running time). It works perfectly with small (prokaryotic) genomes. In case of eukaryotic genome the high ratio of repetitive sequences causes problems. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 28 www.itk.ppke.hu Sequence chromatogram computer Assemble sequence Procedure of the shotgun sequencing Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 29 DNA cloning Many DNA sequence data that can be found in the databases are sequences of cloned DNA molecules. Usually cloning and/or amplification of a DNA (or RNA) molecule precludes the sequencing. Cloning is the prerequisite of obtaining certain type of sequence information, such as cDNA sequences, EST libraries. Alternative names of DNA cloning: recombinant DNA technology; genetic engineering. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 30 DNA cloning is basically manipulation of the DNA by special enzymes. The most important enzymes are the restriction endonucleases. These enzymes are of bacterial origin and recognize 4-6 nucleotide-long sequences within the double stranded DNA and cut the DNA chain. Several hundred restriction endonucleases are known. We can create the restriction map of the DNA molecule which is also called the physical map of DNA. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 31 The products of the restriction digestion (restriction fragments) can be inserted into suitable carrier DNA molecules (vectors) and then we can seal the DNA chains by DNA ligase enzyme. The resulting artificial DNA molecule is called recombinant DNA. The recombinant DNA can be introduced into suitable host cells (e.g. Escherichia coli), which can be amplified up to millions of copies. The bacterial colony that contains the copies of a distinct recombinant DNA molecule is called a clone. The term „clone” usually means the recombinant DNA itself. The recombinant „clone” contains enough DNA for any standard analysis (e.g. sequencing). www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 32 Restriction endonucleases AluI 5’….AGCT….3’ 3’….TCAG….5’ EcoRI 5’…..GAATTC….3’ 3’…..CTTAAG….5’ HindIII 5’….AAGCTT….3’ 3’….TTCGAA….5’ www.itk.ppke.hu AluI produces blunt ends EcoRI and HindIII produce „sticky” ends Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 33 Recombinant DNA www.itk.ppke.hu Molecule A 5’…GGATCC…3’ 3’…CCTAGG…5’ Molecule B 5’…GGATCC…3’ 3’…CCTAGG…5’ 5’…G 3’…CCTAG GATCC…3’ G…5’ 5’…GGATCC…3’ 3’…CCTAGG…5’ Digest with the same restriction endonuclease, BamHI Mix, seal with DNA ligase Recombinant DNA Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 34 Complementary DNA (cDNA) Since the eukaryotic mRNAs do not contain the introns, there is a direct relationship between their nucleotide sequence and the amino acid sequence of the proteins they encode. RNA molecules, however are chemically unstable, they are not suitable for cloning purposes. Solution: Making a DNA copy from the mRNA molecule. Reverse transcriptase enzymes can use RNA templates to synthetise DNA. The resulting DNA molecule is called complementary (copy) DNA or simply cDNA. cDNA is suitable for cloning purposes and can be used for recombinant production of eukaryotic proteins in prokaryotic hosts (bacteria). www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 35 Synthesis of cDNA www.itk.ppke.hu 5’ AAAA3’eukaryotic mRNA TTTT5’oligo dT primer Reverse transcriptase + dNTPs 5’ AAAA3’RNA-DNA hybrid 3’ TTTT5’ 5’ AAAA3’ 3’ TTTT5’ double stranded cDNA RNAse H, second strand synthesis, S1 nuclease Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 36 The most frequently used host system in the recombinant DNA technology is the K12 strain of Escherichia coli. It is a Gram-negative bacterium, that is well characterized at molecular level. It is suitable for propagating recombinant DNA molecules. The most frequently used vectors that can be used for introduction of foreign DNA into the E. colicells are the plasmids and the bacteriophages. Eukaryotic cells (yeast, insect cells, mammalian cells, etc.) are also used for expression of recombinant proteins but E. coliis always the system of choice for manipulation of DNA. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 37 Vectors Plasmids: Extra-chromosomal genetic elements in bacteria. The size of a plasmid is typically 1-200 kbp. They are circular, double stranded DNA molecules. They can replicate independently form the bacterial chromosome. The copy number of a plasmid inside the bacterial cell is determined by the origin of replication. It can range from a few copies up to several hundreds of molecules. The modern plasmid vectors contain many artificial DNA segments for facilitating the cloning procedure (e.g. polycloning site, antibiotic resistance, single-stranded DNA replication origo. LacZ gene for blue-white selection, etc.). www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 38 Vectors Bacteriophages: the viruses of the bacteria They can replicate independently inside the bacterial cell and they tolerate the presence of foreign DNA inserts. Advantages over the plasmids: They are their own efficient way to enter into the cell. The transformation process is very efficient. The foreign DNA insert can be much bigger than in the case of bacteria (8-24 kbp). Disadvantage: The laboratory process is more difficult and expensive. The most frequently used bacteriophage vectors are the .-phage and the M13 phage. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 39 Vectors M13 phage is the vector of special applications, such as preparing single-stranded DNA and displacing recombinant proteins on the surface of the virus for in vitro selection (phage-display). Other vectors: Cosmids, BACs (bacterial artificial chromosomes), YACs (yeast artificial chromosomes). They are suitable for cloning of large pieces of DNA: Kozmid30-45 kbp BAC (bacterial artificial chromosome)120-300 kbp YAC (yeast artificial chromosome)250-400 kbp www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 40 In order to identify the cloned DNA we can use several methods including colony hybridization, restriction mapping, direct sequencing and polymerase chain reaction (PCR). PCR is the cell-free method of DNA amplification. We can clone DNA without using vector and host cells. PCR resembles to the in vivoDNA replication. Any DNA sequence can be amplified. The only restriction is that we should know the short flanking sequences around the target DNA. At present PCR is the most frequently used method for DNA cloning. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 41 Polymerase chain reaction www.itk.ppke.hu DNA to be amplified (n molecule) + primers, dNTPs, heat-stable DNA polymerase Denatured, single-stranded DNA (2n molecule) DNA-primer hybrids (2n molecule) Replicated double stranded DNA (2n molecule) Repeat the cycle 20-30 times Denaturation, 95 °C, 20s Hybridization, 60 °C, 30s Polymerization, 72 °C, 2min Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 42 The power of the PCR www.itk.ppke.hu Cycles Copies of DNA molecules 12 24 416 101024 1532768 201048576 2533554432 301073741824 Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 43 PCR Primer design: the most critical step at PCR. Critical parameters are: length, melting point, cross hybridization, secondary structural elements in the DNA At present many softwaresare available for PCR primer design. One real limitation: we should know the sequence flanking the DNA to be amplified. This limitation however can be overcome by certain cases: e.g. we can design primers for the vector to amplify the cloned DNA fragment, we can use oligo dT primers at the 3’ end of the cDNA, we can synthetise homopolymer region on the 3’ end of the DNA, etc. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 44 Cloning in the genomic age The first step is to find the sequence of the gene to be cloned in a database (e.g. gene bank). Knowing the sequence we have several choices: • Many genes can be purchased from biotech companies ligated into a vector. • We can design primers to amplify the DNA. In this case we need suitable template DNA. • We can synthesize the gene. The advantages of the artificial genes are that we can design the restriction map of the DNA and we can tune the codon usage for recombinant protein expression. www.itk.ppke.hu Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 45 Due to the recombinant DNA technology and the rapid develpoment of the modern sequencing methods the size of the (nucleotid) sequence databases are expanding very rapidly. www.itk.ppke.hu Genome sequences Gene finding/ annotation Prediction of the gene functions/ functional genomics Network modelling/ systems biology Introduction to bioinformatics: Sequencing and manipulating of DNA 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 46 Problem: we have a plethora of predicted potential gene sequences from the recently sequenced genomes, however we do not know for sure which one is really encoding protein, let alone the structure and the function of the encoded proteins. Even in the case of a „simple” organism there are many genes with unidentified function (orphan genes). Example: The genome of the yeast (Saccharomyces cerevisiae) contains more than 6000 genes. Before genome sequencing approximately 2000 genes have been characterized experimentally. Another 2000 genes was not know before, however their functions can be predicted by homology. The remaining 2000 genes however do not show any homology with other known genes (orphan genes) therefore their function is elusive. www.itk.ppke.hu