Today’s guest post was contributed by Caitlan Rossi, a scientific and medical writer. Her work can be found at caitlanrossi.com.
Even the most advanced genome sequencing technology faces the threat of read contamination. Samples from the wild often contain multiple organisms, including nontarget sequences that can compromise the quality of the assembly and downstream analyses. When reference-based decontamination methods don’t get the job done, how can geneticists learn to separate sequences from different sources to prevent misinterpretation?
New research in G3: Genes|Genomes|Genetics reports on what could become one of the leading tools for quality control in long-read genomic data. Author Claudia Weber uses data from the Darwin Tree of Life project, the goal of which is to sequence 70,000 eukaryotic genomes, to develop strategies to combat the computational challenges of screening long-read sequencing data for contamination.
Generally, to separate individual species from sequencing data, geneticists rely on databases containing reference genomes. But not every species has a suitable reference, especially eukaryotic organisms, and even when reference assemblies are conveniently available, many contain contaminants. With so many external factors at play, Weber showed that the key to better sequence filtering may be looking inward—to sequence composition.
Seeing that different species have heterogeneous sequence composition, Weber capitalized on these inherent differences. Using a variational autoencoder (VAE), a specialized machine-learning approach, she projected sequence composition into 2D. She also developed a technique to more effectively estimate coverage of various sequences without needing to assemble them first. Weber’s 2D representations allowed her to successfully identify and separate sequences from different sources, even when reference data were unavailable or had gaps. Some sequences stood out due to their distinct composition, making it easier to detect cobionts—symbionts, parasites, and the like—as well as contaminants. The results from this novel tool were consistent with the yield of a reference-based decontamination pipeline.
The study illustrates how a composition-based approach can work across taxa, drawing information from insects, fish, green algae, and plants. While these findings are primarily based on hundreds of high-quality HiFi read sets from insects, which were not highly contaminated, the method can still be extrapolated to more complex samples. A composition-based cobiont identification strategy is similarly useful among animals, plants, and fungi.
Using sequence composition to differentiate between organisms may become the genomic go-to when no closely related, accurately labeled references are at hand; indeed, the VAE can retrieve cobionts without a reference assembly. Where geneticists struggle to taxonomically assign sequences from these unrepresented or undersampled organisms, a composition-based alternative, with its limited dependance on reference datasets, can be extensively scaled and process even large genomes. While annotating the 2D pictures with added detail, like estimated coverage, allows for even quicker analysis, visualizing sequences in this way can offer valuable information on the contents of a sample even when only the most basic genomic features are available.
Weber notes that her tool works best as part of an integrated approach, combined with reference-based labels, and is not meant for automated binning or classification. In addition to working around gaps in databases, these low-dimensional representations can also highlight errors, drawing attention to assignments that seem inconsistent with their embeddings. Ultimately, Weber’s tool not only avoids computational bottlenecks and informs assembly, but it also leaves geneticists with a sensible approach: when genomic resources are low, take advantage of heterogeneity between species.
References
Disentangling cobionts and contamination in long-read genomic data using sequence composition
Claudia C Weber
G3: Genes|Genomes|Genetics. November 2024; 14(11).
DOI: 10.1093/g3journal/jkae187