Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References

Daniel Hakim, Stephen Wandro, Karsten Zengler, Livia S. Zaramela, Brent Nowinski, Austin Swafford, Qiyun Zhu, Se Jin Song, Antonio Gonzalez, Daniel McDonald, Rob Knight

Research output: Contribution to journalArticlepeer-review

2 Scopus citations


Assigning taxonomy remains a challenging topic in microbiome studies, due largely to ambiguity of reads which overlap multiple reference genomes. With the Web of Life (WoL) reference database hosting 10,575 reference genomes and growing, the percentage of ambiguous reads will only increase. The resulting artifacts create both the illusion of co-occurrence and a long tail end of extraneous reference hits that confound interpretation. We introduce genome cover, the fraction of reference genome overlapped by reads, to distinguish these artifacts. We show how to dynamically predict genome cover by read count and examine our model in Staphylococcus aureus monoculture. Our modeling cleanly separates both S. aureus and true contaminants from the false artifacts of reference overlap. We next introduce saturated genome cover, the true fraction of a reference genome overlapped by sample contents. Genome cover may not saturate for low abundance or low prevalence bacteria. We assuage this worry with examination of a large human fecal data set. By compositing the metric across like samples, genome cover saturates even for rare species. We note that it is a threshold on saturated genome cover, not genome cover itself, which indicates a spurious reference hit or distant relative. We present Zebra, a method to compute and threshold the genome cover metric across like samples, a recurrence to estimate genome cover and confirm saturation, and provide guidance for choosing cover thresholds in real world scenarios. Standalone genome cover and integration into Woltka are available: https://github.com/biocore/zebra_filter, https://github.com/qiyunzhu/woltka.

Original languageEnglish (US)
Issue number5
StatePublished - Sep 2022


  • metagenomics
  • microbiome
  • read filtering

ASJC Scopus subject areas

  • Microbiology
  • Physiology
  • Biochemistry
  • Ecology, Evolution, Behavior and Systematics
  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computer Science Applications


Dive into the research topics of 'Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References'. Together they form a unique fingerprint.

Cite this