Graduate student and postdoctoral leaders from the Early Career Scientist Committees of the GSA.
"DNA" courtesy of via Flickr Wikimedia Commons

Written by members of the GSA Early Career Scientist Communication and Outreach Subcommittee: Angel F. Cisneros Caballero, Université Laval; Adelita Mendoza, PhD, Washington University; Narjes Alfuraiji, University of Manchester; Anna Bajur, Max Planck Institute of Molecular Cell Biology and Genetics

During the current global pandemic, public attention is increasingly falling on the process of drug discovery and development. How exactly do we find new treatments? And what does it take to bring them to the clinic? One powerful tool in this process that often escapes notice is bioinformatics—the use of computational resources to answer biological questions.

Exponential increases in computational power have revolutionized the way we do science. Over time, this has created entirely new fields of research, since we can now analyze more data efficiently and explore more complex algorithms and models1. Bioinformatics is one of the fields made possible by this technological achievement, and it has been critical for many recent scientific advances2

Bioinformatics comprises two interdisciplinary sub-fields that interface with computer science, mathematics, and biology: One is the research and development that scientists need to build the models modern biology requires. The other is computational biology, which is dedicated to understanding basic biological queries.

Bioinformatics is not just an academic field; it has many clinical applications. For example, we now have the technology to sequence genomes and identify genes involved in diseases, such as cancers. However, we can only do it accurately by looking at short segments at a time. Sequencing an organism’s genome becomes like a giant puzzle with thousands of pieces, and only bioinformatic methods allow us to assemble the pieces. 

Bioinformatics can also be used to guide drug design experiments and maximize the chances of finding active molecules. This new knowledge can eventually be used to develop therapies and vaccines to save human lives. Here, we will look at some examples of how we can use bioinformatics to discover molecular signposts for particular biological processes. These signs are known as biomarkers, and they are important in all types of clinical research. We will then take a closer look at how bioinformatics can use this information to come up with an application, such as a drug.  

Biomarkers of regeneration

Humans do not have the ability to regenerate limbs after amputation, but certain animals have this extraordinary ability, including planarian flatworms and axolotls. To understand these strong regenerative capabilities, scientists study fruit flies, flatworms, axolotls, and zebrafish. These species are powerful model systems to study tissue regeneration after amputation or damage. As in most biological fields, modern-day bioinformatics techniques are playing a key role in understanding how the genome responds to injury. 

Regeneration requires a real-time genomic response, which can be studied by looking at which genes are activated or repressed in individual cells with single-cell RNA sequencing. A recent study from Fincher et al. identified flatworm genes that were active after injury by analyzing all messenger RNA (the transcriptome) of individual lineage precursor cells with Drop-seq. This technique isolates single cells in droplets so that they can be separately analyzed and compared. This method is so powerful that researchers were able to detect the transcriptome from cell types with frequencies as low as ~10 cells per animal3.

Bioinformatic analyses allowed the cells to be clustered by gene expression groups in different tissue types, which then allowed researchers to build an atlas of genes expressed in the transcriptome after injury. 

In another example, Vizcaya-Molina et al. identified novel enhancers that regulate gene activation during different phases of recovery from injury in developing fruit flies. The researchers looked for accessible regions in the DNA (which are associated with higher gene activation) using a technique called ATAC sequencing. They confirmed that some regions of the transcriptome changed in response to injury, and they then wanted to know if those genes had common functions. With the help of bioinformatic databases, they found that many of those genes belonged to signaling pathways involved in cell growth and differentiation4

A study by Goldman et al. uncovered the genetic regulatory program that responds to injured cardiomyocytes in zebrafish. Inaccessible regions of DNA are tightly wrapped around proteins called histones. They looked at profiles of a replacement histone that indicates transcriptional accessibility, known as H3.3, to uncover gene regulatory elements involved in heart regeneration. This method allowed researchers to identify genes that were upregulated in response to injury. Later, during cardiomyocyte regeneration, they found an enrichment of enhancer elements that were “open” for transcription and then identified the specific sequence involved during regeneration5.

These examples show that bioinformatics helps to unlock the mysteries of genes that regulate regeneration after injury. Bioinformatics techniques are applicable to monitoring  real-time genomic response in individual cells, probing sections of accessible regions in the DNA in several organisms that are capable of regeneration. The greater computational power that bioinformatics provides will allow scientists to ask new questions that are important to the field of regeneration.  

Biomarkers of virulence factors

Bioinformatic tools are also important in finding biomarkers of infectious disease virulence, which can be appealing candidates for drugs. For instance, we can look for specific genes that drive the pathogenicity of a given microorganism, such as yeast. To do this, we can design strains that lack particular genes and evaluate if this makes them less pathogenic. Testing a large number of yeast strains is typically performed using competitive growth methodologies6For example, Han et al. evaluated growth of each mutant strain under controlled conditions of direct competition with other mutants, thus reducing the time and cost associated with screening each one individually. This enabled screening of a large number of strains to identify a drug target. 

An example of how functional genomics can be used to identify drug targets in pathogenic fungi has been carried out in Candida albicans with the C. albicans fitness test (CaFT). In this test, each isolate is assigned a unique identifier (barcode) that we can track computationally in order to observe if there were differences in fitness among heterozygote isolates. This enabled the researchers to screen for loss of gene function in the presence of antifungal agents, from which they identified the mechanism of action of novel compounds7.

Competitive fitness profiling was also used to evaluate the relative fitness of large pools of A. fumigatus mutants to identify those that are involved in virulence using a non-genetically barcoded library of mutants8. As a result, they reduced the total number of animals that are usually required to perform virulence screening. Tn–Seq is another technique used to assess the contribution of genes to fitness in Streptococcus pneumoniae. However, instead of deleting the gene, Tn-Seq inserts additional DNA within the gene9.

Similarly, changes in mutant frequency can be used to compare the fitness of the different mutants. By looking at which mutants grow most poorly, scientists can identify which genes are the most essential and consider them as potential drug targets. This is of particular interest in drug discovery programmes, since it is crucial to identify genes that are responsible or involved in pathogenicity to develop and design a novel therapy.  

Drug design

Once we have found the optimal drug target, we can turn to bioinformatics again to help us find a drug for it. A classic approach is to generate millions of molecules experimentally, test them, and register the ones that have an effect. However, this method is very time-consuming and resource-intensive, while the number of effective molecules can be low. Instead, we can use our models of molecular interactions to test molecules computationally and only test experimentally the ones that are predicted to be effective. This allows us to narrow down the set of molecules to test in an experiment while maximizing the chance of success. Indeed, Doman et al. showed that computational tests increase the efficiency of these experiments. When they screened a big library of molecules, only 0.02% of their tests were positive. However, when they used a computational analysis to  evaluate only the ones predicted to be effective, 35% of their tests were positive10. Thus, virtual screening saves a considerable amount of time and money by reducing the number of assays yet results in higher efficiency. In fact, there are several examples of drugs found through computational screening that have been approved by the FDA. These include dorzolamide to treat glaucoma, captopril to treat hypertension, and saquinavir to treat HIV11Moreover, these approaches are being used in the context of the current COVID-19 pandemic to find potential new treatments.

All potential drugs should be subjected to multiple stages of evaluation to assess their safety—first in preclinical tests with model organisms, and then in clinical studies in humans. Despite the promise of computational methods to help identify active molecules, most fail to pass these clinical studies because of unwanted side-effects. Thus, one of the newest endeavors in the field is the use of machine learning to add predictions on how likely a given molecule is to be toxic. Machine learning is a series of tools that find trends in known data to predict the results of future observations12.

Currently, these methods look at databases of molecules to extract their physical properties and health concerns associated with them. Then, they build models that link those properties to health concerns to derive general rules. These approaches have been very successful, with some models being able to identify toxic compounds with up to 95% accuracy.

Gaining access to greater computational power has allowed us to pursue new questions and develop further techniques to address them. This has had a notable impact on diverse fields, from basic science to applications in the clinic. The future of bioinformatics will certainly be exciting, as it will likely produce more and more results that have an impact on our daily lives.



  1. Edgar, T. W. & Manz, D. O. Research Methods for Cyber Security. (Syngress, 2017).
  2. Gauthier, J., Vincent, A. T., Charette, S. J. & Derome, N. A brief history of bioinformatics. Brief. Bioinform. (2018). doi:10.1093/bib/bby063
  3. Fincher, C. T., Wurtzel, O., de Hoog, T., Kravarik, K. M. & Reddien, P. W. Cell type transcriptome atlas for the planarian Schmidtea mediterranea. Science 360, (2018).
  4. Vizcaya-Molina, E. et al. Damage-responsive elements in Drosophila regeneration. Genome Research 28, 1852–1866 (2018).
  5. Goldman, J. A. et al. Resolving Heart Regeneration by Replacement Histone Profiling. Dev. Cell 40, 392–404.e5 (2017).
  6. Han, T. X., Xu, X.-Y., Zhang, M.-J., Peng, X. & Du, L.-L. Global fitness profiling of fission yeast deletion strains by barcode sequencing. Genome Biol. 11, R60 (2010).
  7. Xu, D. et al. Genome-wide fitness test and mechanism-of-action studies of inhibitory compounds in Candida albicans. PLoS Pathog. 3, e92 (2007).
  8. Macdonald, D. et al. Inducible Cell Fusion Permits Use of Competitive Fitness Profiling in the Human Pathogenic Fungus Aspergillus fumigatus. Antimicrob. Agents Chemother. 63, (2019).
  9. Solaimanpour, S., Sarmiento, F. & Mrázek, J. Tn-seq explorer: a tool for analysis of high-throughput sequencing data of transposon mutant libraries. PLoS One 10, e0126070 (2015).
  10. Doman, T. N. et al. Molecular docking and high-throughput screening for novel inhibitors of protein tyrosine phosphatase-1B. J. Med. Chem. 45, 2213–2221 (2002).
  11. Sliwoski, G., Kothiwale, S., Meiler, J. & Lowe, E. W. Computational Methods in Drug Discovery. Pharmacol. Rev. 66, 334–395 (2014).
  12. Yang, H., Sun, L., Li, W., Liu, G. & Tang, Y. In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts. Front Chem 6, 30 (2018).


The authors:






    Leave a comment