From sequence to centimeters: predicting height from genomes
Machine learning and access to ever-expanding databases improves genomic prediction of human traits.
In theory, a scientist could predict your height using just your genome sequence. In practice, though, this is still the stuff of science fiction. It’s not only your genes that affect height—environment also plays a role—but the larger problem is that height is affected by tens of thousands of individual genetic variations. This is also true of other complex traits, such as susceptibility to particular diseases. To get closer to accurate genomic prediction of human traits, geneticists are using new approaches to harness the vast amounts of sequence data becoming available. In GENETICS, Lello et al. describe a machine learning approach to the problem that allowed them to make predictions within a few centimeters of reality.
“To me, genomic prediction is the actual decoding of the genome,” says senior author Stephen Hsu from Michigan State University. A theoretical physicist by training, Hsu explains that his lab became interested in the problem of genomic prediction several years ago as the cost of genotyping continued to drop and more datasets became available. They had previously argued that they could predict complex traits, like height, if they only had enough data.The release of nearly 500,000 UK Biobank genotypes allowed them an opportunity to test this hypothesis.
A genomic prediction approach is quite different from the more familiar genome-wide association study (GWAS). GWAS methods test each SNP one at a time, looking for statistically significant contributions to the phenotype. In contrast, genomic prediction makes use of all SNPs at once in trying to build the best possible predictors.
The authors took the Biobank genotype and phenotype data and used a type of regression to identify the combination of SNPs that, taken together, best correlate with the trait of interest. Since only a subset of SNPs influence each trait—even the thousands of loci that control height are only a tiny fraction of the total number of SNPs identified —they also introduced a penalization factor that prevents the model from including unneeded SNPs. They were essentially trying to solve an optimization problem: identify the fewest number of variables (i.e. SNPs) that will allow for the best prediction about the outcome (i.e. trait).
Having generated their algorithm, the authors then put it to the test. They constructed models for height, heel bone density, and educational attainment, and they found that their algorithm worked well, particularly for height. For example, it produced a nearly 0.65 correlation with actual height, and predicted heights were usually within a few centimeters of actual heights. “Our predictor actually captures almost all the heritability that we could expect to find,” says Hsu.
With enough data, Hsu believes, accurate genomic prediction for complex traits will no longer be sci-fi. As more and more genotypes are obtained, Hsu predicts that this kind of prediction could be applied for most traits in as little as five years.
Accurate Genomic Prediction of Human Height
Louis Lello, Steven G. Avery, Laurent Tellier, Ana I. Vazquez, Gustavo de los Campos, Stephen D. H. Hsu
Genetics October 2018 210: 477-497; https://doi.org/10.1534/genetics.118.301267