Bioinformatics and Data Analysis

Bioinformatics and Data Analysis

Bioinformatics is an interdisciplinary field that combines general biology, molecular biology, cybernetics, genetics, chemistry, computer science, mathematics and statistics. Large-scale biological problems requiring the analysis of large amounts of data are solved by bioinformatics from a computational point of view. Bioinformatics mainly includes the study and development of computer methods and is aimed at obtaining, analyzing, storing, organizing and visualizing biological data.

Computational biology is often referred to in a similar context. This area focuses on the development of algorithms and mathematical modeling of social, behavioral and biological systems. Bioinformatics is considered an area within computational biology that mainly focuses on the statistical processing of biological data. Differences in approach from different angles: bioinformatics are biologists who specialize in using computing systems and tools to solve biological problems, and computational biologists are computer scientists, mathematicians, statisticians, and engineers who develop tools for such calculations.

Bioinformatics in a broad sense means working with any kind of biological data, including the study of electron micrographs, searching for keywords in the biological literature, and so on. If we consider bioinformatics as a set of approaches and methods for working with data, then, depending on the types of technical problems, it includes:

  • Development of algorithms and programs for more efficient work with data;
  • Storing and transferring information or working with databases.

However, bioinformatic methods of analysis are also inextricably linked to many scientific fields, which involve finding answers to specific biological questions.

In this case, the main directions can be distinguished on the basis of the objects under study:

  • Sequence bioinformatics;
  • Expression analysis;
  • Structural bioinformatics;
  • Exploring cellular organization;
  • Systems biology.

For each of the listed sections, you can select your own standard data types, methods of their processing, bioinformatics algorithms and databases.

Bioinformatics uses the methods of applied mathematics, statistics and informatics. Bioinformatics is used in biochemistry, biophysics, ecology and other fields. The most frequently used tools and technologies in this area are the programming languages ​​Python, R, Java, C #, C ++; markup language – XML; language of structured queries to databases – SQL; hardware and software architecture of parallel computing – CUDA; a package of applied programs for solving problems of technical calculations and the programming language of the same name used in this package – MATLAB, and spreadsheets.

Introduction

Bioinformatics has become an important part of many areas of biology. Bioinformatic methods of analysis make it possible to interpret large amounts of experimental data, which was practically impossible before the development of this field. For example, experimental molecular biology often uses bioinformatics techniques such as image and signal processing. In the field of genetics and genomics, bioinformatics assists in the functional annotation of genomes, detection and analysis of mutations. An important task is to study gene expression and ways of its regulation. In addition, bioinformatics tools make it possible to compare genomic data, which is a prerequisite for studying the principles of molecular evolution.

In general terms, bioinformatics helps to analyze and catalog biochemical pathways and networks, which are an important part of systems biology. In structural biology, it assists in the modeling of DNA, RNA and protein structures, as well as molecular interactions.

Recent advances in biological data processing have led to significant changes in the field of biomedicine. Thanks to the development of bioinformatics, scientists have the opportunity to identify the molecular mechanisms underlying both hereditary and acquired diseases, which helps in the development of effective treatments and more accurate tests for the diagnosis of diseases. The direction of research that allows predicting the effectiveness and adverse effects of drugs in patients is called pharmacogenetics, and it is also based on bioinformatic methods.

An important role of bioinformatics also lies in the analysis of biological literature and the development of biological and genetic ontologies for organizing biological data.

Main research areas

Genetic sequence analysis

Since the phage Phi-X174 was sequenced in 1977, the DNA sequences of an increasing number of organisms have been deciphered and stored in databases. These data are used to determine the sequences of proteins and regulatory sites. Comparison of genes within the same or different species can demonstrate the similarity of protein functions or relationships between species (thus phylogenetic trees can be constructed). With the increasing amount of data, it has long been impossible to manually analyze sequences. Nowadays, computer programs are used to search the genomes of thousands of organisms consisting of billions of base pairs. Programs can uniquely match (align) similar DNA sequences in genomes of different species; often such sequences have similar functions, and differences arise as a result of small mutations, such as substitutions of individual nucleotides, nucleotide insertions, and their “dropouts” (deletions). One of the variants of this alignment is used in the sequencing process itself. The so-called “fractional sequencing” technique (which was, for example, used by the Institute for Genetic Research [en] for sequencing the first bacterial genome, Haemophilus influenzae) instead of a complete sequence of nucleotides gives sequences of short DNA fragments (each about 600-800 nucleotides in length). The ends of the fragments overlap and, properly aligned, give the complete genome. This method quickly yields sequencing results, but assembling the fragments can be quite challenging for large genomes. In a project to decode the human genome, the assembly took several months of computer time. Now, this method is used for almost all genomes, and genome assembly algorithms are one of the most pressing problems in bioinformatics at the moment.

Another example of the application of computer sequence analysis is the automatic search for genes and regulatory sequences in the genome. Not all nucleotides in the genome are used to sequence proteins. For example, in the genomes of higher organisms, large segments of DNA clearly do not encode proteins and their functional role is unknown. The development of algorithms for identifying protein-coding regions of the genome is an important task of modern bioinformatics.

Bioinformatics helps link genomic and proteomic projects, for example, by helping to use DNA sequences to identify proteins.

Genome annotation

In the context of genomics, annotation is the process of labeling genes and other objects in a DNA sequence. The first genome annotation software system was created in 1995 by Owen White of the Institute for Genomic Research, who sequenced and analyzed the first decoded genome of a free-living organism, the bacterium Haemophilus influenzae. Dr. White built a system for finding genes (a piece of DNA that sets the sequence of a specific polypeptide or functional RNA), tRNA and other DNA objects and made the first designations of the functions of these genes. Most modern genome annotation systems work in a similar way, but programs available for analyzing genomic DNA, such as GeneMark, are used to find the genes encoding a protein in Haemophilus influenzae and are constantly changing and improving.

Computational evolutionary biology

Evolutionary biology examines the origin and emergence of species, as well as their development over time. Computer science helps evolutionary biologists in several ways:

  • study the evolution of a large number of organisms by measuring changes in their DNA, and not just in structure or physiology;
  • compare entire genomes (see BLAST), which allows the study of more complex evolutionary events, such as: gene duplication, horizontal gene transfer,
  • and predict bacterial specialized factors;
  • build computer models of populations to predict the behavior of the system over time;
  • track the appearance of publications containing information on a large number of species.

The area in computer science that uses genetic algorithms is often confused with computational evolutionary biology, but the two areas are not necessarily related. Work in this area uses specialized software to improve algorithms and computation and is based on evolutionary principles such as replication, diversification through recombination or mutation, and survival in natural selection.

Biodiversity assessment

The biological diversity of an ecosystem can be defined as the complete genetic totality of a certain environment, consisting of all living species, be it a biofilm in an abandoned mine, a drop of sea water, a handful of earth, or the entire biosphere of planet Earth. Databases are used to collect species names, descriptions, areas of distribution, genetic information. Specialized software is used to search, visualize and analyze information, and, more importantly, provide it to other people. Computer simulations model things like population dynamics, or calculate the overall genetic health of a crop in agronomy. One of the most important potentials of this area lies in the analysis of DNA sequences or complete genomes of entire endangered species, making it possible to memorize the results of a genetic experiment of nature in a computer and can be used again in the future, even if these species become completely extinct.

Methods for assessing other components of biodiversity – taxa (primarily species) and ecosystems – often fall outside the scope of bioinformatics. At present, the mathematical foundations of bioinformatic methods for taxa are presented in the framework of such a scientific direction as phenetics, or numerical taxonomy. Methods for analyzing the structure of ecosystems are considered by specialists in such areas as systems ecology, biocenometry.

Basic bioinformatics programs

  • ACT (Artemis Comparison Tool) – genomic analysis;
  • Arlequin – analysis of population genetic data;
  • Bioconductor is a large-scale FLOSS project providing many separate packages for bioinformatics research;
  • BioEdit – editor for multiple nucleotide and amino acid sequence alignment;
  • BioNumerics is a commercial universal software package;
  • BLAST – search for related sequences in the database of nucleotide and amino acid sequences;
  • Clustal – multiple nucleotide and amino acid sequence alignment;
  • DnaSP – DNA sequence polymorphism analysis;
  • FigTree – phylogenetic tree editor;
  • Genepop – population genetic analysis;
  • Genetix – population genetic analysis (the program is available only in French);
  • JalView – editor for multiple alignment of nucleotide and amino acid sequences;
  • MacClade is a commercial software for interactive evolutionary data analysis;
  • MEGA – Molecular Evolutionary Genetic Analysis;
  • Mesquite – Java Comparative Biology Program;
  • Muscle – Multiple nucleotide and amino acid sequence comparisons. Faster and more accurate than ClustalW;
  • PAUP – phylogenetic analysis using the parsimony method (and other methods);
  • PHYLIP – phylogenetic software package;
  • Phylo_win – phylogenetic analysis. The program has a graphical interface;
  • PopGene – analysis of genetic diversity of populations;
  • Populations – population genetic analysis;
  • PSI Protein Classifier – generalization of the results obtained using the PSI-BLAST program;
  • Seaview – Phylogenetic Analysis (GUI);
  • Sequin – sequence deposition at GenBank, EMBL, DDBJ;
  • SPAdes – collector of bacterial genomes;
  • SplitsTree – a program for building phylogenetic trees;
  • T-Coffee – Multiple progressive alignment of nucleotide and amino acid sequences. More sensitive than ClustalW / ClustalX;
  • UGENE is a free Russian-language tool, multiple alignment of nucleotide and amino acid sequences, phylogenetic analysis, annotation, work with databases;
  • Velvet – genome collector;
  • ZENBU – summarizing the results;
  • Structural Bioinformatics – includes the development of algorithms and programs for predicting the spatial structure of proteins.

Research topics in structural bioinformatics

  • X-ray structural analysis (XRD) of macromolecules;
  • Quality indicators of a model of a macromolecule constructed according to X-ray diffraction data;
  • Algorithms for calculating the surface of a macromolecule;
  • Algorithms for finding the hydrophobic nucleus of a protein molecule;
  • Algorithms for Finding the Structural Domains of Proteins;
  • Spatial alignment of protein structures;
  • Structural classifications of SCOP and CATH domains;
  • Molecular dynamics.

Why study bioinformatics?

Modern biology deals with gigantic amounts of data, to which the old methods are either inapplicable, or they simply cannot cope with the task of processing them. This is where bioinformatics comes in. In a general sense, bioinformatics is the use of computer, mathematical and statistical methods to solve biological problems. Nowadays biological research is very diverse, a number of new sciences have appeared, the so-called “omics” (genomics, transcriptomics, proteomics, metabolomics, and others), many of which have already taken their place in modern biology. There are also completely interdisciplinary scientific areas, for example, such a field as systems biology. She aims to combine everything into a single picture, studying and modeling interactions in living systems. Bioinformatics, like biology, has a fairly wide range of methods and sections.

If you are a biologist, then, for sure, you have already faced or will soon face a problem that requires bioinformatics to solve, for example:

  • genome assembly;
  • finding and studying gene functions;
  • gene expression prediction;
  • prediction of protein function;
  • search for genomic variants and associated phenotypes;
  • neonatal diagnosis for genetic diseases;
  • questions of evolutionary and comparative biology, modeling of evolution;
  • and even drug development.

Studying bioinformatics, in addition to practical skills in programming, statistical analysis, data processing and visualization of results, biologists can begin to communicate in the same language with technical specialists, correctly set problems, check the solutions obtained and work more effectively together with programmers and mathematicians to improve and create convenient and easy-to-use bioinformatics programs.

For computer scientists and mathematicians, bioinformatics is an opportunity:

  • to apply the knowledge of Computer Science to the most interesting and extensive subject area – biology and evolution
  • to solve interesting and complex algorithmic problems
  • to create software tools for biologists and physicians
  • to be useful and apply your knowledge to solve the most important problems of mankind related to health, quality and longevity of people
  • to make a real contribution to the development of life sciences in general

For both, bioinformatics helps to significantly accelerate professional growth, as well as interesting, complex and non-trivial scientific problems.