This post has been republished with author permission. The original post, published by the Oxford Big Data Institute, is available here.

Researchers at the Big Data Institute and colleagues have developed a new method for understanding the relationships between different DNA sequences and where they come from.

This information has widespread applications, from understanding the development of viruses, such as SARS-CoV-2, the strain of coronavirus that causes COVID-19, to precision medicine, an approach to disease treatment and prevention that takes into account individual genetic information. The study is published in GENETICS and is the featured paper in the September 2024 edition.  

Genetics is rapidly becoming part of our everyday lives. Nearly every week sees another newspaper headline about genetics and human ancestry, with huge datasets of DNA sequences routinely generated and used for medical study.

We can make sense of this genomic big data by working out the historical process that created it ‒ in other words, where the DNA sequences came from. If we take a small section of someone’s DNA we know it must have come from one of their two parents in the last generation, and previously from one of their four grandparents in the generation before that, and so on. This means we can represent the history of different sections of DNA by tracing them backwards through time.

If we do this for a large set of DNA sequences from different people, we can build up a set of genetic “family trees,” a genealogy of DNA sequences. This grand network of inheritance is sometimes called an ancestral recombination graph (ARG). Previous work by the same research group has shown that such networks can be used not only to illuminate the history of our genome, but also to compress DNA data and speed up genetic analyses.

Lead author and evolutionary geneticist at the Big Data Institute, Dr Yan Wong said, “There has been surprisingly little consensus on exactly how to represent such an ancestral recombination graph on a computer. In this study, we outline a simple and efficient encoding of genetic genealogies in which each ancestor can be thought of as a fragmentary length of DNA, or ‘ancestral genome’ at some point in the past. The history of today’s genetic sequences is traced back through those ancestral genomes, keeping track of which chunks of DNA were inherited from which ancestors.”

By using this simple scheme, recording genome-to-genome transmission of information, the study shows that the same genetic ancestry can be stored to different degrees of precision. This means relationships between different DNA sequences can be represented without having to know or guess the precise timing of joins and splits that underlie the true history of inheritance. The researchers also show that their description of genetic inheritance is flexible enough to deal with the wide variety of different methods that researchers currently use to reconstruct genetic history.

The approach allows scientists to store and analyze large amounts of genetic data on a standard laptop, and it generalizes to any species of life on earth. For example, it forms the basis of a “unified genealogy” of over 7,000 publicly available whole human genome sequences that the researchers released previously. They are currently creating a genetic genealogy of millions of SARS-CoV-2 genomes, collected over the span of the coronavirus pandemic, which will allow analysis of the recent history of the virus, pinpointing the emergence of novel mixed (or “recombinant”) strains. Dr Wong added, “We hope that this formal standard for how to represent genetic genealogies can help to unify the field of genetic history and make it easier for scientists to analyze, share and compare results. This will be crucial as we move into an era of genomic medicine, where genetic data will be used to diagnose and treat diseases, and where understanding the history of our genomes will be key to understanding our health and ancestry.”

References

Guest posts are contributed by members of our community. The views expressed in guest posts are those of the author(s) and are not necessarily endorsed by the Genetics Society of America. If you'd like to write a guest post, e-mail communications@genetics-gsa.org.

View all posts by Guest Author »