Coursera Learner working on a presentation with Coursera logo and
Coursera Learner working on a presentation with Coursera logo and

Bioinformatics /ˌbaɪ.oʊˌɪnfərˈmætɪks/ (About this soundlisten) which is an interdisciplinary field in which methods and software tools for understanding biological data are developed. Bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyse and understand biological data. It is also  used for in silico analysis of biological queries using mathematical and statistical techniques.

Biological studies using computer programming as part of their methodology are included in bioinformatics, as well as a specific pipeline of analyses that are repeatedly used, particularly in the field of genomics. Typical uses of bioinformatics include candidate genes and single nucleotide polymorphisms (SNPs). This detection is often performed with the aim of gaining a better understanding of the genetic basis of the disease, unique adaptations, desirable properties (especially in agricultural species), or differences in populations. In a less formal way, bioinformatics also seeks to understand organizational principles within nucleic acid and protein sequences, called proteomics.


Bioinformatics has become an essential part of many areas of biology. Within experimental molecular biology, Bioinformatics techniques like image and signal processing allow the extraction of useful results from large amounts of raw data. In genetics, they help to sequence and annotate genomes and their observed mutations. It plays a role in the textual extraction of the biological literature and in the development of biological and gene ontologies to organize and query biological data. It also plays a role in the analysis of the expression and regulation of genes and proteins. Bioinformatics tools help to compare, analyze and interpret genetic and genomic data and more generally to understand the evolutionary aspects of molecular biology. At a higher integrative level, it helps to analyze as well as catalogue the biological pathways and networks which are an essential part of systems biology. In the structural biology, it helps in the simulation and modeling of DNA, RNA, proteins and biomolecular interactions


Sequences of genetic material is often used in bioinformatics and is easier to manage with computers than manually.

As protein sequences became essential in molecular biology, the computers became available after Frederick Sanger determined the sequence of insulin in the early 1950s. The manual comparison of multiple sequences proved impractical. One pioneer in the field was Margaret Oakley Dayhoff, who compiled one of the first protein sequence databases, originally published as books and experimented methods of sequence alignment and molecular evolution.  Another pioneer in bioinformatics was Elvin A. Kabat, who pioneered the analysis of biological sequences in 1970 with his complete volumes of antibody sequences released with Tai Te Wu between 1980 and 1991


In order to study how normal cellular activities are altered in different disease states, biological data must be combined to form a complete picture of these activities. Therefore, the field of bioinformatics has evolved in such a way that the most urgent task is now the analysis and interpretation of various types of data. Among these are nucleotide and amino acid sequences, protein domains and protein structures. The actual data analysis and interpretation process is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:

Relationship with other fields

Bioinformatics is a similar scientific domain but distinct from biological computing, while it is often considered synonymous with computational biology. Biological computing utilizes bioengineering and biology to build biological computers, while bioinformatics uses computing to better understand biology. Computational biology and bioinformatics involve the analysis of biological data, in particular DNA, RNA and protein sequences. Bioinformatics has been growing explosively since the mid-1990s, largely driven by the Human Genome Project and rapid advances in the technology of DNA sequencing.

The analysis of biological research information to produce meaningful information involves writing and running software programs that use algorithms from graph theory, artificial intelligence, soft computing, data mining, image processing and computer simulation. Algorithms in turn depend on theoretical fundamentals like discrete mathematics, control theory, systems theory, information theory and statistics.

DNA Sequencing

Main article: DNA sequencing

The sequences must be obtained from the example of the Genbank memory bank before they can be analyzed. DNA sequencing is again a non-trivial problem as raw data can be noisy or plagued by weak signals. Algorithms have been developed for the base that require the various experimental approaches to DNA sequencing.

Sequencing assembly

The majority of DNA sequencing techniques produce short sequence fragments that must be assembled to obtain complete sequences of genes or genomes. For example, the so-called shotgun technique (which has been used by the Institute for Genomic Research (TIGR) to sequence the first bacterial genome, Haemophilus influenzae)[19] generates sequences of several thousand small DNA fragments (ranging from 35 to 900 long nucleotides, depending on the sequencing technology). The ends of these fragments overlap and, if correctly aligned by a genome assembly program, can be used to reconstruct the complete genome. Shotgun sequencing produces sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. With a genome as big as the human genome, it can take many days of CPU time on large memory and multi-processor computers to assemble the fragments, and the resulting assembly usually contains many gaps that need to be filled later. Sequencing shotgun is the method of choice for almost all genomes sequenced today [when?], and genome assembly algorithms are a critical area of bioinformatics research.

Genome annotation

Within the context of genomics, annotation is the procedure of marking genes and other biological characteristics in a DNA sequence. This process must become automated because the majority of genomes are too large to be annotated by hand, not to mention the desire to annotate as many genomes as possible, since the sequencing rate has ceased to represent a bottleneck. This annotation is made possible by the fact that genes do have recognizable start and end regions, although the exact sequence found in these regions can vary from one gene to another.

The first complete description of a complete genome annotation system was published in 1995 [19] by the Institute for Genomic Research team that performed the first complete sequencing and analysis of the genome of a free living organism, the bacterium Haemophilus influenzae [19] Owen White designed and built a software system to identify genes encoding all proteins, transfer RNAs, ribosomal RNAs (and other sites) and to perform initial functional assignments. Most current genome annotation systems work similarly, but the programs available for genomic DNA analysis, such as GeneMark program trained and used to find genes coding for proteins in Haemophilus influenzae, are constantly evolving and improving.

Following the objectives that the Human Genome Project left to achieve after its closure in 2003, a novel project developed by the National Human Genome Research Institute in the United States has appeared. The so-called ENCODE project is a collaborative data collection of the functional elements of the human genome using new generation DNA sequencing technologies and genomic tiling arrays, technologies capable of automatically generating large amounts of data at a dramatically reduced cost per base but with the same accuracy (basic call error) and fidelity (assembly error).

Computational evolutionary biology

The evolutionary biology is the study of the origin and descent of species, as well as their change over time. Computer science has assisted evolutionary biologists by allowing researchers to do so:

track the evolution of a great number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone,compare entire genomes, which allows the study of more complex evolutionary events, such as gene duplication, horizontal gene transfer and the prediction of important factors in bacterial speciation,

Building complex models of computational population genetics to predict the outcome of the system over time[20].

monitor and share information on an increasing number of species and organisms

The future work tries to rebuild the now more complex tree of life.

However, the research area within computer science using genetic algorithms is sometimes confused with computational evolutionary biology, but the two areas are not necessarily related.

Comparative genomics

At the core of genome comparative analysis is the determination of the correspondence between genes (orthological analysis) or other genome characteristics in different organisms. It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes. A variety of evolutionary events acting at various organizational levels model the evolution of the genome. At the lowest level, point mutations affect individual nucleotides. On a higher level, the large chromosome segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion[21]. Ultimately, entire genomes are involved in hybridization, polyploidisation and endosymbiosis processes, which often lead to rapid speciation. This complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who resort to a spectrum of algorithmic, statistical and mathematical techniques, ranging from exact, heuristic algorithms to fixed parameters and approximation algorithms for problems using thrift models to Monte Carlo Markov chain algorithms for Bayesian problem analysis based on probabilistic models.

Several of these studies are based on the detection of sequence omology to assign sequences to protein families[22].

Pan genomics

Genomics pan is a concept introduced in 2005 by Tettelin and Medini that eventually took root in bioinformatics. The genome Pan represents the complete gene repertoire of a particular taxonomic group: although initially applied to closely related strains of a species, it can be applied to a wider context such as genus, phylum etc. It is divided into two parts – the Core genome: a set of genes common to all genomes in the study (often vital domestic genes for survival) and the Dispensable/Flexible genome: a set of genes not present in all genomes in the study, but in one or some of them. A BPGA bioinformatics tool can be used to characterize the Pan Genome of bacterial species[23].

Genetics of the disease

The advent of next generation sequencing means that we are obtaining sufficient sequence data to map the genes of complex diseases such as infertility,[24] breast cancer[25] or Alzheimer’s disease.[26] Genomic association studies are a useful approach to identifying the mutations responsible for such complex diseases.[27] Through these studies, thousands of DNA variants associated with similar diseases and traits have been identified.[28] Furthermore, the possibility of using genes for prognosis, diagnosis or treatment is one of the most essential applications. Many studies are discussing both promising ways to choose the genes to use and the problems and pitfalls of using genes to predict the presence or prognosis of the disease[29].

With cancer, the genomes of affected cells are reorganized in a complex or even unpredictable way. Extensive sequencing efforts are used to identify previously unknown point mutations in a variety of cancer genes. Bioinformaticians are producing automated specialized systems to manage the volume of sequence data produced, and they are creating new algorithms and software to compare sequencing results with the growing collection of human genome sequences and germ polymorphisms. New physical detection technologies are employed, such as oligonucleotide microarrays to identify chromosome gains and losses (called comparative genomic hybridization), and single nucleotide polymorphism arrays to detect known point mutations. Together, such detection methods measure several hundred thousand sites across the genome, and when used at high speed to measure thousands of samples, it generates terabytes of data per experiment. Here, too, the enormous amounts and new types of data generate new opportunities for bioinformaticians. It is frequently discovered that the data contains considerable variability, or noise, and therefore methods of analysis of the Hidden Markov model and points of change are being developed to deduce real variations in the number of copies.