Unbalanced Sample Size Introduces Spurious Correlations to Genome-Wide Heterozygosity Analyses

Li Liu, Richard J. Caselli

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


Excess of heterozygosity (H) is a widely used measure of genetic diversity of a population. As high-throughput sequencing and genotyping data become readily available, it has been applied to investigating the associations of genome-wide genetic diversity with human diseases and traits. However, these studies often report contradictory results. In this paper, we present a meta-analysis of five whole-exome studies to examine the association of H scores with Alzheimer's disease. We show that the mean H score of a group is not associated with the disease status, but ot is associated with the sample size. Across all five studies, the group with more samples has a significantly lower H score than the group with fewer samples. To remove potential confounders in empirical data sets, we perform computer simulations to create artificial genomes controlled for the number of polymorphic loci, the sample size, and the allele frequency. Analyses of these simulated data confirm the negative correlation between the sample size and the H score. Furthermore, we find that genomes with a large number of rare variants also have inflated H scores. These biases altogether can lead to spurious associations between genetic diversity and the phenotype of interest. Based on these findings, we advocate that studies shall balance the sample sizes when using genome-wide H scores to assess genetic diversities of different populations, which helps improve the reproducibility of future research.

Original languageEnglish (US)
Pages (from-to)197-202
Number of pages6
JournalHuman Heredity
Issue number4-5
StatePublished - Jul 1 2020


  • Alzheimer's disease
  • Excess of heterozygosity
  • Genetic diversity
  • Genome analysis
  • Sample size bias

ASJC Scopus subject areas

  • Genetics
  • Genetics(clinical)


Dive into the research topics of 'Unbalanced Sample Size Introduces Spurious Correlations to Genome-Wide Heterozygosity Analyses'. Together they form a unique fingerprint.

Cite this