TY - JOUR
T1 - Bi-directional recurrent neural network models for geographic location extraction in biomedical literature
AU - Magge, Arjun
AU - Weissenbacher, Davy
AU - Sarker, Abeed
AU - Scotch, Matthew
AU - Gonzalez-Hernandez, Graciela
N1 - Funding Information:
Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under grant number R01AI117011 to MS and GG. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Funding Information:
AM designed and trained the neural networks, ran the experiments, performed the error analysis, and wrote most of the manuscript. DW and AS reviewed, restructured and contributed many sections and revisions of the manuscript. MS and GG provided overall guidance on the work and edited the final manuscript. The authors would also like to acknowledge Karen OConnor, Megan Rorison and Briana Trevino for their efforts in the annotation processes. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The authors are also grateful to ASU-BMI’s computing resources used for conducting the experiments in the paper.
Publisher Copyright:
© 2018 The Authors.
PY - 2019
Y1 - 2019
N2 - Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.
AB - Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.
KW - Deep Learning
KW - Named Entity Recognition
KW - Natural Language Processing
KW - Toponym Detection
KW - Toponym Disambiguation
KW - Toponym Resolution
UR - http://www.scopus.com/inward/record.url?scp=85062760249&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062760249&partnerID=8YFLogxK
M3 - Conference article
C2 - 30864314
AN - SCOPUS:85062760249
SN - 2335-6928
VL - 24
SP - 100
EP - 111
JO - Pacific Symposium on Biocomputing
JF - Pacific Symposium on Biocomputing
IS - 2019
T2 - 24th Pacific Symposium on Biocomputing, PSB 2019
Y2 - 3 January 2019 through 7 January 2019
ER -