Genome-wide prediction and association studies offer a powerful approach to connecting genotype to phenotype at a large scale, but performing genomic analyses in humans invokes genomic privacy concerns that complicate the sharing of data. In a study published in the March issue of GENETICS, Zhao and colleagues expand an existing encryption approach, offering a secure avenue to perform genomic analysis without compromising confidentiality.

In whole-genome analysis, such as genomic prediction and genome-wide association studies (GWAS), researchers use statistical methods to compare genetic variants across many genomes to calculate genetic effects and estimate heritability. Linear mixed models allow testing for associations in both continuous traits, such as height, blood pressure, and body mass index, and binary phenotypes, such as disease status. Information about covariates like age, sex, and family origin is critical to assess confounding effects originating from demographic factors. In these cases, linear mixed model analysis helps account for genetic relatedness among individuals, which is necessary to strengthen statistical inference for discoveries made from the genomics data.

Because of the inherent privacy and intellectual property concerns, direct sharing of raw genotype and phenotype data is often prohibited, for example in human research; researchers first anonymize sensitive information like individual ID numbers, sex, disease status, family relations between individuals, and other covariates before performing any calculations.

So then, in a research landscape that values open-access data principles like FAIR (findable, accessible, interoperable, and reusable), how can population geneticists make their data widely available without compromising the privacy of the individuals in question?

Several data encryption approaches that obscure sensitive information have been developed; the homomorphic encryption method for genotype and phenotype (HEGP) methodology encrypts genotype, phenotype, and covariate data in a way that cannot be linked back to original identifiers, thus maintaining data privacy. However, the HEGP methodology has only been proposed for single-marker regression in GWAS using linear mixed models. Thus, Zhao et al. extended the HEGP methodology for wider application in genome-to-phenome analyses and demonstrated that HEGP can be effectively applied to many popular mixed models for genomic analyses of quantitative traits, beyond single-marker regression.

The authors used the HEGP scheme to perform linear mixed model analysis without the need for data decryption before the analysis. They successfully measured random effects originating from covariates that matched the original sample data.

They also demonstrated the HEGP method’s usefulness in analyzing genotype-phenotype characterization from multiple studies. In genomics, certain traits are difficult and expensive to measure, which often leads to studies with lower sample sizes. Researchers usually need to analyze multiple underpowered studies together to increase statistical power. Zhao et al. showed their HEGP expansion can combine multiple datasets for joint genomic analyses while preserving data confidentiality.

In conclusion, geneticists have an encryption method available for genomic analyses that allows them to perform necessary statistical analyses without disclosing sensitive information, thereby avoiding privacy concerns altogether. 


Sejal Davla is a freelance science writer and data scientist with expertise in neuroscience and genetics. She is a motivated storyteller and works on projects at the intersection of science, data, and policy.

View all posts by Sejal Davla »