TY - JOUR
T1 - Using aggregated AI detector outcomes to eliminate false positives in STEM student writing
AU - Hyatt, Jon Philippe K.
AU - Bienenstock, Elisa Jayne
AU - Firetto, Carla M.
AU - Woods, Elizabeth R.
AU - Comus, Robert C.
N1 - Publisher Copyright:
© 2025 The Authors. Licensed under Creative Commons Attribution CC-BY-NC 4.0.
PY - 2025
Y1 - 2025
N2 - Generative artificial intelligence (AI) large language models have become sufficiently accessible and user-friendly to assist students with course work, studying tactics, and written communication. AI-generated writing is almost indistinguishable from human-derived work. Instructors must rely on intuition/experience and, recently, assistance from online AI detectors to help them distinguish between student- and AI-written material. Here, we tested the veracity of AI detectors for writing samples from a fact-heavy, lower-division undergraduate anatomy and physiology course. Student participants (n ¼ 190) completed three parts: a hand-written essay answering a prompt on the structure/function of the plasma membrane; creating an AI-generated answer to the same prompt; and a survey seeking participants’ views on the quality of each essay as well as general AI use. Randomly selected (n ¼ 50) participant-written and AI-generated essays were blindly uploaded onto four AI detectors; a separate and unique group of randomly selected essays (n ¼ 48) was provided to human raters (n ¼ 9) for classification assessment. For the majority of essays, human raters and the best-performing AI detectors (n ¼ 3) similarly identified their correct origin (84–95% and 93–98%, respectively) (P?> 0.05). Approximately 1.3% and 5.0% of the essays were detected as false positives (human writing incorrectly labeled as AI) by AI detectors and human raters, respectively. Surveys generally indicated that students viewed the AI-generated work as better than their own (P < 0.01). Using AI detectors in aggregate reduced the likelihood of detecting a false positive to nearly 0%, and this strategy was validated against human rater-labeled false positives. Taken together, our findings show that AI detectors, when used together, become a powerful tool to inform instructors. NEW & NOTEWORTHY We show how online artificial intelligence (AI) detectors can assist instructors in distinguishing between human- and AI-written work for written assignments. Although individual AI detectors may vary in their accuracy for correctly identifying the origin of written work, they are most effective when used in aggregate to inform instructors when human intuition gets it wrong. Using AI detectors for consensus detection reduces the false positive rate to nearly zero.
AB - Generative artificial intelligence (AI) large language models have become sufficiently accessible and user-friendly to assist students with course work, studying tactics, and written communication. AI-generated writing is almost indistinguishable from human-derived work. Instructors must rely on intuition/experience and, recently, assistance from online AI detectors to help them distinguish between student- and AI-written material. Here, we tested the veracity of AI detectors for writing samples from a fact-heavy, lower-division undergraduate anatomy and physiology course. Student participants (n ¼ 190) completed three parts: a hand-written essay answering a prompt on the structure/function of the plasma membrane; creating an AI-generated answer to the same prompt; and a survey seeking participants’ views on the quality of each essay as well as general AI use. Randomly selected (n ¼ 50) participant-written and AI-generated essays were blindly uploaded onto four AI detectors; a separate and unique group of randomly selected essays (n ¼ 48) was provided to human raters (n ¼ 9) for classification assessment. For the majority of essays, human raters and the best-performing AI detectors (n ¼ 3) similarly identified their correct origin (84–95% and 93–98%, respectively) (P?> 0.05). Approximately 1.3% and 5.0% of the essays were detected as false positives (human writing incorrectly labeled as AI) by AI detectors and human raters, respectively. Surveys generally indicated that students viewed the AI-generated work as better than their own (P < 0.01). Using AI detectors in aggregate reduced the likelihood of detecting a false positive to nearly 0%, and this strategy was validated against human rater-labeled false positives. Taken together, our findings show that AI detectors, when used together, become a powerful tool to inform instructors. NEW & NOTEWORTHY We show how online artificial intelligence (AI) detectors can assist instructors in distinguishing between human- and AI-written work for written assignments. Although individual AI detectors may vary in their accuracy for correctly identifying the origin of written work, they are most effective when used in aggregate to inform instructors when human intuition gets it wrong. Using AI detectors for consensus detection reduces the false positive rate to nearly zero.
KW - anatomy
KW - physiology
KW - undergraduate
UR - http://www.scopus.com/inward/record.url?scp=105003121391&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105003121391&partnerID=8YFLogxK
U2 - 10.1152/advan.00235.2024
DO - 10.1152/advan.00235.2024
M3 - Article
C2 - 40105702
AN - SCOPUS:105003121391
SN - 1043-4046
VL - 49
SP - 486
EP - 495
JO - Advances in Physiology Education
JF - Advances in Physiology Education
IS - 2
ER -