2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 1 Development of Complex Curricula for Molecular Bionics and Infobionics Programs within a consortial* framework** Consortium leader PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER The Project has been realised with the support of the European Union and has been co-financed by the European Social Fund *** **Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben ***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg. PETER PAZMANY CATHOLIC UNIVERSITY SEMMELWEIS UNIVERSITY sote_logo.jpg dk_fejlec.gif INFOBLOKK TÁMOP –4.1.2-08/2/A/KMR-2009-0006 2 Peter Pazmany Catholic University Faculty of Information Technology INTRODUCTION TO BIOINFORMATICS CHAPTER 2 Knowledge representation and core data-types www.itk.ppke.hu (BEVEZETÉS A BIOINFORMATIKÁBA ) (Ismeretábrázolás és alapvető adattípusok) Sándor Pongor Introduction to bioinformatics:Core data-types What will we speak about? Core elements. Systems theory of biological knowledge representation. Core data-types: sequences, 3D-structures, networks, texts + database records as a summary. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 3 2011.10.09.. Introduction to bioinformatics:Core data-types COGNITIV SCIENCES MOLECULAR BIOLOGY COMPUTER SCIENCE Bioinformatics is interdisciplinary 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 4 2011.10.09.. Introduction to bioinformatics:Core data-types The variety of objects: molecular structures, metabolic pathways, regulatory networks AND their databases A fewmethods: analysis and use of similarity; Complexity of biological knowledge (and NOT so much the quantity of data...) What is particular in bioinformatics? 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 5 2011.10.09.. Introduction to bioinformatics:Core data-types MARTKQTARK STGGKAPRKQ LATKAARKSA Sequences CIPKWNRCGPKMDGVPCCEPYTCTSDYYGNCS Extended sequences (pl. disulfide topology) Cartoons of domains or secondary structures Symbolic diagrams (e.g. hydrophobicity plots, helical circle diagrams) Simplified 3D cartoons 3D structures The same molecule hasmany different representations 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 6 2011.10.09.. Introduction to bioinformatics:Core data-types Structureand functionareconceptsof systemstheory ViennesebiologistLudwig von Bertalanffyfoundedgeneralsystemstheorytoexplaincommonalities of biological, environmental phenomena. Itis nowusedinmanyfields(socialsystems, companyorganization, military). Advantage: Qualitativeexplanations, generalizationpower, abstraction Disadvantage: Containslittlemathematicalorquantitativefoundations Systems theory, structure and function Bertalanffy Ludwig von Bertalanffy (1901-1972) 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 7 2011.10.09.. Introduction to bioinformatics:Core data-types What are systems? Any part of reality that can be ~separated from the environment (by a boundary). A community in an environment. Consist of interacting parts Interact with the environment (inputs, outputs) System models are generalizations of reality Have a structure that is defined by parts and processes Parts have functional as well as structural relationships between each other. 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 8 2011.10.09.. Introduction to bioinformatics:Core data-types Systems theory explains the variety of molecular descriptions A system of moving particles Populated positions and a boundary Structure: Entities and relationships Form Abstract example: 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 9 2011.10.09.. Introduction to bioinformatics:Core data-types General definitions for structure and function Structureis a ~constant spatio-temporal arrangement of elements or properties. A molecular structureis a subset of this: a constant (spatio-temporal) arrangement of elements (e.g. atoms) and relationships (e.g. bonds) Substructure:A part of a structure Functionis a role played within a system. A system’s function is its role played within a higher system (hierarchical description) 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 10 2011.10.09.. Introduction to bioinformatics:Core data-types Systems explain various phenomena as repetition (recurrence) External repetition: same substructures in different systems Internal repetition: same substructure within the same system 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 11 2011.10.09.. Introduction to bioinformatics:Core data-types SYSTEM EXAMPLES Entities Relationships a) General examples Molecules Atoms Atomic interactions (chemical bonds) Assemblies Proteins, DNA Molecular contacts Metabolic Pathways Enzymes Chemical reactions (substrates/products) Genetic networks Genes Co-regulation b) Examples for proteins Protein sequence Amino acid Sequential vicinity Protein structure Atoms Chemical bonds Protein structure (simplified) Secondary structures Sequential and topological vicinity Backbone structure (Fold) C.atoms Peptide bond 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 12 2011.10.09.. Introduction to bioinformatics:Core data-types Core data-types A very large number of description can be built from the various entities and relationships. We select a few of them. Biological sequences(character strings built from amino acid alphabet [20 letters] or nucleotides [4 letters] 3D structures(atoms with x,y,z coordinates, chemical bonds) Networks(generalized descriptions, e.g. node can be a gene, edge can be regulatory link) Texts(e.g. PubMed abstracts) Database records We discuss them as a standard way to store the core data 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 13 2011.10.09.. Introduction to bioinformatics:Core data-types SEQUENCES 3-D NETWORKS TEXT Core data-types 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 14 2011.10.09.. Introduction to bioinformatics:Core data-types BIOLOGICAL SEQUENCES Biological sequences (character strings built from amino acid alphabet [20 letters] or nucleotides [4 letters] 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 15 2011.10.09.. Introduction to bioinformatics:Core data-types Model: Chemical structure of proteins (far too complicated for large molecules) Description: Character strings. Characters denote amino acids. (relations –sequential vicinity –are implicit!) Simplified and/or extended (annotated) forms of visualization covalent IFPPVPGP Enzyme Binding site SEQUENCES 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 16 2011.10.09.. Introduction to bioinformatics:Core data-types •Sequences are like texts written in an unknown language •Imperfect analogies to human language and coded messages -we can talk about a “language metaphor” •Analysis tools (exact and approximate string matching ([=alignment]) were originally developed for texts •Theory of computer languages (Chomsky) can be applied to biological sequences qfinetdttvivtwtpprarivgyrltvgllseegdepqyldlpstatsvnipdllpgrkytvnvyeiseegeqnlilstsqttapdappdptvdqvddtsivvrwsrprapitgyrivyspsvegsstelnlpetansvtlsdlqpgvqynitiyaveenqestpvfiqqettgvprsdkvppprdlqfvevtdvkitimwtppespvtgyrvdvipvnlpgehgqrlpvsrntfaevtglspgvtyhfkvfavnqgreskpltaqqatkldaptnlqfinetdttvivtwtpprarivgyrltvgltrggqpkqynvgpaasqyplrnlqpgseyavslvavkgnqqsprvtgvfttlqplgsiphyntevtettivitwtpaprigfklgvrpsqggeaprevtsesgsivvsgltpgveyvytisvlrdgqer LANGUAGE Biological sequences as language 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 17 2011.10.09.. Introduction to bioinformatics:Core data-types 3D STRUCTURES 3D structures are atoms with x,y,z coordinates, chemical bonds. For macromolecules we typically simplify them into larger blocks, backbone or surface representations… 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 18 2011.10.09.. Introduction to bioinformatics:Core data-types vanthoff Van t’Hoff 1852-1911 1898 Chimie dans l’espaceDutch chemist (Nobel prize 1902) discovered that some phenomena in chemistry need a 3D description. Before that we had no idea of 3D nature of molecules. Object metaphoreThe analogies with objects (collisions, movements, no overlap in space) is obvious but imperfect. Nevertheless it profoundly influences our thinking about atoms. 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 19 2011.10.09.. Introduction to bioinformatics:Core data-types …”This figure is purely diagrammatic. The two ribbons symbolize the thephosphate-sugar chains, and the horizontal rods the pairs of the bases holding the chains together. The vertical line marks the fibreaxis” Watson, Crick, 1953 05-x3-WatsonCrick.jpg Macromolecules are so complex that only their simplified view make visual sense The double spiral was shown in a simplified form already in the first, epoch-making publication. 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 20 Introduction to bioinformatics:Core data-types Pg13scanCAM(CMYK) Molecular models today are more an art then science. There are extablished methods of visualization for macromoleculs (backbones, surfaces, color codes etc) 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 21 2011.10.09.. Introduction to bioinformatics:Core data-types 3D structures Model: 3D chemical structures Description: 3D coordinates Simplified and/or extended (annotated) visualization (xi, yi, zi)n !!!?? Backbone (main chain) Surface 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 22 2011.10.09.. Introduction to bioinformatics:Core data-types NETWORKS are the most generalized entity-relationship models, applicable to any system (e.g. node can be a gene, edge can be regulatory link). Strong analogies with mathematical graphs, week analogies with social systems (“social metaphore”). 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 23 2011.10.09.. Introduction to bioinformatics:Core data-types Small molecules–classical graphsThe first network models were the chemical formulas applied in the 19thcentury. Much of early graph theory was inspired by chemical formulas… Van ’ t Hoff, 1898 Loschm i dt , 1861 Kekul é , 1865 Crum Brown, 1861 Cayley, 1872 Van ’ t Hoff, 1898 Loschm i dt , 1861 Kekul é , 1865 Crum Brown, 1861 Loschm i dt , 1861 Kekul é , 1865 Crum Brown, 1861 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 24 2011.10.09.. Introduction to bioinformatics:Core data-types Networks of genomesToday we employ networks to all biological problems, from the molecular (top left) to the ecological level (bottom right is a food network with species as nodes and predator/prey relations as edges). 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 25 2011.10.09.. Introduction to bioinformatics:Core data-types S. cerevisiae + (up) -(down) The transcription regulatory networkshave genes as nodes and up and down regulatory relations as edges. 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 26 2011.10.09.. Introduction to bioinformatics:Core data-types TEXTS (article abstracts in PubMed) Scientific texts are written in human language. They contain encoded annotations (abbreviated citations, postal addresses etc) and specific language (molecular names, chemical formulas etc). Strong analogies with human semantics. 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 27 2011.10.09.. Introduction to bioinformatics:Core data-types Keyword-collecttions, onthologies, etc. Scientific textshave a strict or close to strict structure, similar to database records. The meaning of scientific texts is at present not machine-readable. Auxiliary informations (author and journal names, or annotations such as keywords) are machine readable 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 28 2011.10.09.. Introduction to bioinformatics:Core data-types Model: ?? (none) Description: structured files (records, fields), standardized language Simplified and/or extended visualization database screenshot06 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 29 2011.10.09.. The core data-types are all entity-relationship descriptions. The entities and relationships have to be formally defined, either as concept hierarchies (simplified) or as ontologies that contain descriptions + rules. Introduction to bioinformatics:Core data-types 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 30 2011.10.09.. Introduction to bioinformatics:Core data-types DATABASE RECORD Putting the core-data into database records 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 31 2011.10.09.. Introduction to bioinformatics:Core data-types Biomolecular databases in a nutshell They contain one molecule in a record. Sequence databases are the most developed. The main part of the record is the structural description which is typically a sequence or a structure. In addition they contain an annotation part which is a collection of various informations, functional descriptions, crossreferences, and also structural descriptions (info assigned to parts of the structure. So annotation duplicates certain aspects of the molecules. As a result, a sequence database is a complex object that can be handled with dedicated programs (parsers). database 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 32 2011.10.09.. Introduction to bioinformatics:Core data-types 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 33 2011.10.09.. Introduction to bioinformatics:Core data-types Global descriptors e.g. function Local descriptors e.g. binding sites, domains Annotation requires database searchingand knowledge of „biology”(chemistry, medicine..) Annotation of (sequence)data means assigning global and local descriptors to a molecule 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 34 2011.10.09.. Introduction to bioinformatics:Core data-types Generalized annotation If we take a theoretical topology, the number line, and assign amino acids to it, we obtain sequence.We can carry on assigning local descriptors or global descriptors and we end up creating a database-record of a structure.This is a database-centric view of a structure. 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 35 2011.10.09.. 1 2 3 4 5 6 7 8 9 10 11 … . Theoretical topology (number line) | | | | | | | M R N G G T T... Assigning aminno acids to positions = sequence . - h e lix secondary structures Hydrophobicity or other numerical properties Function (prote ase) Introduction to bioinformatics:Core data-types CORE DATA-TYPES OF BIOINFORMATICS Molecular structure is a model, an abstract, mental representationthat can be described with the tools of systems theory Concepts of system, structure, function. Structure is an ensemble of elements and relations. 4 core data-types (models): sequence, 3D, network and text Models are represented by computers with dedicated data-structures, images and/or in a narrative form. Simplified and extended(annotated) descriptions. Database records contain a core data-types in machine-readable form and annotations in mostly human-readable forms. 2011.10.09.. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 36 2011.10.09..