Bi-directional recurrent neural network models for geographic location extraction in biomedical literature

Arjun Magge; Davy Weissenbacher; Abeed Sarker; Matthew Scotch; Graciela Gonzalez-Hernandez

Bi-directional recurrent neural network models for geographic location extraction in biomedical literature

Arjun Magge, Davy Weissenbacher, Abeed Sarker, Matthew Scotch, Graciela Gonzalez-Hernandez

Research output: Contribution to journal › Conference article › peer-review

Abstract

Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F₁ score of 0.94, disambiguation accuracy of 91% and an overall resolution F₁ score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.

Original language	English (US)
Pages (from-to)	100-111
Number of pages	12
Journal	Pacific Symposium on Biocomputing
Volume	24
Issue number	2019
State	Published - 2019
Event	24th Pacific Symposium on Biocomputing, PSB 2019 - Kohala Coast, United States Duration: Jan 3 2019 → Jan 7 2019

Keywords

Deep Learning
Named Entity Recognition
Natural Language Processing
Toponym Detection
Toponym Disambiguation
Toponym Resolution

ASJC Scopus subject areas

General Medicine

Cite this

@article{54c7e4fc8e09412a853c1b2a82f7d307,

title = "Bi-directional recurrent neural network models for geographic location extraction in biomedical literature",

abstract = "Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.",

keywords = "Deep Learning, Named Entity Recognition, Natural Language Processing, Toponym Detection, Toponym Disambiguation, Toponym Resolution",

author = "Arjun Magge and Davy Weissenbacher and Abeed Sarker and Matthew Scotch and Graciela Gonzalez-Hernandez",

note = "Funding Information: Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under grant number R01AI117011 to MS and GG. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Funding Information: AM designed and trained the neural networks, ran the experiments, performed the error analysis, and wrote most of the manuscript. DW and AS reviewed, restructured and contributed many sections and revisions of the manuscript. MS and GG provided overall guidance on the work and edited the final manuscript. The authors would also like to acknowledge Karen OConnor, Megan Rorison and Briana Trevino for their efforts in the annotation processes. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The authors are also grateful to ASU-BMI{\textquoteright}s computing resources used for conducting the experiments in the paper. Publisher Copyright: {\textcopyright} 2018 The Authors.; 24th Pacific Symposium on Biocomputing, PSB 2019 ; Conference date: 03-01-2019 Through 07-01-2019",

year = "2019",

language = "English (US)",

volume = "24",

pages = "100--111",

journal = "Pacific Symposium on Biocomputing",

issn = "2335-6928",

publisher = "World Scientific Publishing Co., Inc.",

number = "2019",

}

TY - JOUR

T1 - Bi-directional recurrent neural network models for geographic location extraction in biomedical literature

AU - Magge, Arjun

AU - Weissenbacher, Davy

AU - Sarker, Abeed

AU - Scotch, Matthew

AU - Gonzalez-Hernandez, Graciela

N1 - Funding Information: Research reported in this publication was supported by the National Institute of Allergy and Infectious Diseases (NIAID) of the National Institutes of Health (NIH) under grant number R01AI117011 to MS and GG. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Funding Information: AM designed and trained the neural networks, ran the experiments, performed the error analysis, and wrote most of the manuscript. DW and AS reviewed, restructured and contributed many sections and revisions of the manuscript. MS and GG provided overall guidance on the work and edited the final manuscript. The authors would also like to acknowledge Karen OConnor, Megan Rorison and Briana Trevino for their efforts in the annotation processes. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The authors are also grateful to ASU-BMI’s computing resources used for conducting the experiments in the paper. Publisher Copyright: © 2018 The Authors.

PY - 2019

Y1 - 2019

N2 - Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.

AB - Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.

KW - Deep Learning

KW - Named Entity Recognition

KW - Natural Language Processing

KW - Toponym Detection

KW - Toponym Disambiguation

KW - Toponym Resolution

UR - http://www.scopus.com/inward/record.url?scp=85062760249&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062760249&partnerID=8YFLogxK

M3 - Conference article

C2 - 30864314

AN - SCOPUS:85062760249

SN - 2335-6928

VL - 24

SP - 100

EP - 111

JO - Pacific Symposium on Biocomputing

JF - Pacific Symposium on Biocomputing

IS - 2019

T2 - 24th Pacific Symposium on Biocomputing, PSB 2019

Y2 - 3 January 2019 through 7 January 2019

ER -

Bi-directional recurrent neural network models for geographic location extraction in biomedical literature

Abstract

Keywords

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this