Bioinformatics—a scientific discipline that aims to curate, analyze, and distribute biological data—is facing a crisis: a deluge of data is overwhelming laboratories and existing infrastructure.
Biologists, especially those working in genome sciences, have recognized the importance of big data: in just two decades, the number of genome sequences has increased 10,000-fold (from 180,000 to 1.8 billion genomes) and the number of sequenced bases has increased 25,000-fold (from 640 million to 16 trillion bases). Such a rich collection of genome sequences rivals the esteemed Library of Alexandria, a prestigious collection of roughly half a million scrolls established in approximately 250 BCE.
Similar to the ancient Library of Alexandria, mystery shrouds the genomic library of today. Specifically, unraveling how the 1.8 billion genomes encode organismal complexity and their components—even in “simple” organisms like bacteria—remains a grand challenge. So, what stops us from understanding the link between the data we generate and their biological meaning? One major hurdle is both a challenge and an opportunity.
The necessary infrastructure of supercomputers and widely distributed analytical pipelines for processing ever-increasing datasets are lacking. As the number of genomes available continues to increase, even as this article is being read, scalable solutions are needed. Cloud-based platforms promise a solution to overcome this hurdle and usher in a new era of understanding in biosciences. We provide an overview of major hurdles the field faces and describe how cloud-based infrastructure may be the silver lining for a rapidly growing field.
The data deluge
Biology generates massive amounts of data every year; almost 40 petabytes, which is roughly equivalent to the entire written works of humankind from the beginning of recorded history in all languages. Instead of simple text files, the types of data generated in biological studies are diverse. There are genome sequences, transcript and protein abundances, growth curves, species presence and abundance in specific environments, and imaging, just to name a few.
One major challenge is that heterogeneous data types are often stored in different formats, require different suites of software for processing and analysis, generate different output file formats, and may require additional software for creating human-interpretable representations of the data. The number of data types (and amount of data) will continue to rise with the advent of new technologies. Curating, storing, and distributing colossal datasets in diverse formats will require innovative solutions.
One solution is a collaboration between academic institutions and bioindustries. Specifically, the latter may have established a computational infrastructure that exceeds what is available to some academic groups; for example, the Broad Institute of MIT and Harvard use cloud-based platforms to distribute data generated by diverse research consortia.
Cloud analytics
In the future, all analysis and interpretation of biological data will be done using cloud analytics. With resources that vastly exceed the personal computer, desktops and laptops are shifting from analysis hubs to portals linking researchers to cloud architectures. For academic labs, this will drive down hardware costs because a personal computer will only need enough memory to maintain a stable connection to the cloud. That means inexpensive laptops, tablets, and even Raspberry Pis can act as portals to the cloud. Academic labs will no longer face other costs and headaches, such as the maintenance and management of computing infrastructure.
Major research institutions have already migrated to cloud-based architectures. For example, the European Bioinformatics Institute uses Amazon Web Services’ Elastic Compute Cloud. Following increased demand, there are now numerous providers of cloud-based platforms: Rackspace, VMware, IBM, and Microsoft, among others. With the threat of slashed budgets for scientific research, these services are likely to become even more prominent in academia.
Overcoming (bioinformatics) supply chain issues
Despite advantages in data storage and analytic capacity, a major complexity remains: the development of toolkits and analytical workflows to carry out analyses. Let’s say a cancer biologist wants to investigate the genomic and transcriptomic signatures associated with pancreatic cancer. The researcher likely wants to automate a complete analysis, creating end-to-end bioinformatic processing and analysis to obtain meaningful results from raw data. Doing so requires multiple steps and the handling of diverse data formats. Suppose the researcher completed this herculean task by developing in-house software and a data management system. It would be an amazing feat, but how would it help a biologist studying, for example, colon cancer using a similar analysis for their experiment? This raises an issue of scale. Emailing codebases and describing workflows is a solution that can work for a few people, not many. However, platforms like GitHub offer developers a cloud-based distribution platform. Other distribution hubs like PyPi, Bioconda, and Bioconductor further help to disseminate software packages across the globe. User-friendly platforms like Galaxy, the CLC Workbench from Qiagen, and the console from LatchBio help researchers seamlessly stitch together software and more easily share workflows. Taken together, these advances make it easier for scientists to share their cloud-based work, leading to lower lab costs and a more accessible field of bioinformatics.
A bright future or dark days?
In the future, bioinformatics workflows will be available to academic and citizen scientists alike. With intuitively designed platforms, students in high school, or even elementary school, could conduct bioinformatic research. Imagine that: middle-grade science fairs could feature analysis of terabytes of data—that is amazing! For the readers skeptical of these claims, I urge you to consider the history of the microscope. The early days of microscopy required niche skillsets in lens manufacturing and engineering making microscopes a rare commodity. Since then, microscope manufacturing has improved resulting in lowered costs and allowing the masses to become microscopists. Case in point, a Stanford research group invented the Foldiscope, a paper microscope that has a magnification of 140x and costs less than a dollar. Bioinformatics is in the midst of the same revolution. With the appropriate distribution of tools and access portals to cloud-based infrastructures, everyone in the world can become a bioinformatician. Widely accessible resources, however, will pose new challenges.
In summary, as bioinformatics transitions to cloud-based infrastructures, researchers will find themselves empowered and enabled to conduct experiments all across the globe. Without careful consideration of the major problems, bioinformatics will stagnate or fail to uphold the tenets of scientific rigor and integrity. However, careful consideration of these issues in bioinformatics will steer the ongoing revolution toward an exciting and productive era of cloud-based computing systems, broadening the accessibility of bioinformatics research. The future of bioinformatics research is in the cloud. And behind the clouds, the sun is shining.
Jacob L. Steenwyk is a post-doctoral fellow in the laboratory of Howard Hughes Medical Institute Investigator Dr. Nicole King at the University of California, Berkeley. He studies genome function and evolution in animals and fungi and develops software for the life sciences.
Kyle Giffin is the co-founder and COO of LatchBio, a cloud infrastructure platform used by biotech companies and labs across the world. Previously at Berkeley, he studied computational & cognitive neuroscience, data science, and entrepreneurship, before leaving school to start Latch.