New Bolotie method can handle the barrage of sequencing data that posed a problem for conventional recombination algorithms.

Humanity has faced many pandemics throughout history, but never before have we tackled an active pandemic while so well equipped with genetic technology. In fact, when SARS-CoV-2 struck, the genetics community produced so much sequencing data so quickly that existing software couldn’t handle it all. To study how the viral genome was changing, researchers at Johns Hopkins University created new software capable of processing tens of thousands of individual genome sequences. Their results, published in GENETICS, identified 225 likely instances of recombination, a type of genetic swapping between different variants.

“This pandemic has shown us that genomics and genetics can play a very deciding role in how quickly we can understand a problem that’s been unknown to us,” says Ales Varabyou, an author on the study.

As COVID-19 spread, researchers around the world sought to understand the evolution of the SARS-CoV-2 virus as it spread through the population. Documenting how the virus evolved over time could not only shed light on where it might have come from, but also how best to stop it.

“People were searching to see if new variants emerged that would have an effect on susceptibility or vaccine efficacy,” says Christopher Pockrandt, another author on the study.

How variants arise

Viruses reproduce inside the cells of an infected individual before being transmitted to a new host. As the cell makes new copies of the viral genome, errors regularly creep in. Most of these changes don’t make much difference in the ability of the virus to infect people, so they don’t attract our attention. Some changes make the virus less efficient at spreading, and those variants tend to die out. Other changes catch on, however, and these may soon become a significant fraction of the samples taken from patients. This is where new named variants come from.

Even within a single named variant, like “beta” or “delta”, there’s a considerable amount of variation. “Pick two random people on the street that are infected with SARS-CoV-2,” says Pockrandt. “Even if they have the same variant, they most likely will not have exactly the same genomic sequence.” If an unlucky person happens to get infected by two different versions of SARS-CoV-2 at the same time, the cellular machinery that’s cranking out copies of the virus can mix up the two viruses and create an entirely new viral sequence containing sections from each of the two originals. This type of change is called recombination, and that’s what the Johns Hopkins team were searching for.

So many genomes, so little time

Existing software can scan viral genome sequences looking for the telltale signs of a recombination event. But in the past, the genome sequences being scanned had major differences. Often the viral samples were collected months or years apart, allowing more changes to accumulate. In the case of SARS-CoV-2, the genetics community swung into action immediately, and the widespread availability of inexpensive, rapid sequencing meant that new genomes were being produced daily.

“We had so many SARS-CoV-2 genomes sequenced in such a short time that when we wrote up the paper [in October 2020], there were already 300,000 genomes sequenced and assembled,” says Pockrandt. Besides the sheer volume of data to be processed, the researchers faced the problem that many of these sequences were very similar to one another. “Let’s say you sample all the positive people in a region and sequence the SARS-CoV-2 samples. If you do that again two weeks later, there will be very little change,” Pockrandt says. “A lot of the software relies on much higher sequence divergence to be able to detect those recombination events.”

That’s why the researchers decided to write their own software.

Creating Bolotie

A recombination event involves two “parent” viruses that get remixed into a third, “offspring” virus. The purpose of the software is to compare all the existing sequences and establish relationships between them—like a family tree. Previous approaches analyzed triads, comparing sequences to see whether two variants could be the “parents” of the third.

“If you have 100,000 sequences, there’s no way you will ever be able to process all this data,” says Pockrandt. “It works out to something like a quadrillion instances you’d have to check. So that was the challenge.”

Bolotie takes a different approach. A genome consists of a string of nucleotide bases, and each location can be one of four possible bases. The genomes can be grouped into four “clades” based on which nucleotide base is present at certain positions in the genome. Members of a given clade are all slightly different, but the differences within a clade are smaller than the differences between clades. Instead of analyzing all the possible sets of three, Bolotie searches for recombination events between clades rather than between individual genomes.

“We simplify the problem a little bit,” Varabyou says. “We look for interclade recombinations. We never say that a recombination happened specifically between this genome and that genome, but that it happened between some two genomes of these two clades.”

Tracking the current pandemic, and future ones

After analyzing the 300,000 viral genomes, the team identified 225 potential recombination events,  suggesting that recombination in SARS-CoV-2 is more common than previously reported. Still, most of the recombinant viruses did not establish a strong presence in the population. “One of the important findings is that recombination is not a very widespread event,” says Varabyou. “Over time, the number of recombinations did not start suddenly increasing exponentially, but stayed pretty constant at a low level.”

The new software will help detect future recombination events and could be a valuable tool for tracking the spread of different variants. It could also improve tracking of other disease-causing viruses, such as HIV or influenza.

“I hope this phenomenon of mass sequencing and data availability will only grow with time,” Varabyou says. He points out that although the HIV pandemic has been ongoing for 40 years, nowhere near as many HIV genomes have been sequenced as for SARS-CoV-2. “This tool was designed to answer this one specific question early on, but it’s a very interesting direction to take once things settle down and we start thinking about the future.”


Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie

Ales Varabyou, Christopher Pockrandt, Steven L Salzberg, Mihaela Pertea

GENETICS, Volume 218, Issue 3, July 2021, iyab074

Caroline Seydel is an independent science writer based in Los Angeles, CA. She has a MS in genetics from Stanford University. Her writing has appeared in Nature Biotechnology, Genetic Engineering News, and

View all posts by Caroline Seydel »