Today’s guest post was contributed by Caitlan Rossi, a scientific and medical writer. Her work can be found at caitlanrossi.com.
Advances in technology have allowed geneticists to sequence a slew of unique animal, plant, and fungi to species over the past thirty years. Public databases currently house tens of thousands of eukaryotic genome assemblies, but a relative few include an estimate of the total genome size for their respective species. Genome size (or C-value) varies widely, even at the species level, largely due to noncoding DNA, which is often dismissed as “junk” DNA. The standard metrics used to characterize assemblies don’t get at size and chromosome number—the fundamental structure of genomes. Without this foundational information, a new study in GENETICS asks: “Are our genome assemblies good enough?”
To determine whether existing assemblies match the estimated genome size for their corresponding species, author Carl Hjelmen designed an R script to pull information from four NCBI databases: Assembly, BioSample, Sequence Read Archive (SRA), and Taxonomy. Starting from the >40,000 available eukaryotic genome assemblies, he analyzed the ~15,000 animal, plant, and fungi genomes that had existing size estimates. He also used karyotype databases to determine the haploid chromosome number for mammals, dipterans, coleopterans, amphibians, polyneopterans.
Taking into account Kingdom, the sequencing platform used, and common assembly statistics, Hjelmen devised a metric called “Proportional difference from genome size” to determine how closely a given assembly length came to matching the estimated genome size. If the assembly was within 10% of the estimate, he considered it “good.”
He found that almost half of the assemblies analyzed were outside of 10% of the genome size estimate for their species. Most were smaller than the estimates, suggesting that some assemblies are missing information. The larger the genome size, the more dramatic the deviation tended to be—which wasn’t surprising considering that larger eukaryotic genomes often carry more of that so-called “junk” DNA. (Nongenic DNA—a friendlier way to describe the regions of the genome that don’t code for proteins—might turn out to be more informative than its reputation would suggest, points out Hjelmen.)
Hjelmen also discovered a positive relationship between late-replicating heterochromatin and assembly/genome size deviation. When genomes contained more heterochromatin, the assembly was more likely to be missing DNA; he argues that this “lost information” should be highly sought after when studying populations and their health. And though the results were modest, long-read technologies appeared more likely to assemble genomes near that 10% cutoff.
This study points out the limitations of widely used genome metrics like N50 (which narrowly measures contiguity) and BUSCO value (which describes completeness of core sets of genes). To shrink this analytic gap, Hjelmen proposes a new structural unit: “PN50,” or proportional N50 value, which contextualizes N50 values by relating them to estimated genome size and haploid chromosome number. Adding PN50 to the current mix of metrics could increase the rigor of genome research, offering insight into the less-studied structural components of assemblies and supporting universal assembly comparison.
References
Genome size and chromosome number are critical metrics for accurate genome assembly assessment in Eukaryota
Carl E Hjelmen
GENETICS. August 2024; 227 (4).
DOI: 10.1093/genetics/iyae099