Efficient extraction of protein-protein interactions from full-text articles

Jörg Hakenberg; Robert Leaman; Nguyen Ha Vo; Siddhartha Jonnalagadda; Ryan Sullivan; Christopher Miller; Luis Tari; Chitta Baral; Graciela Gonzalez

doi:10.1109/TCBB.2010.51

Efficient extraction of protein-protein interactions from full-text articles

Jörg Hakenberg, Robert Leaman, Nguyen Ha Vo, Siddhartha Jonnalagadda, Ryan Sullivan, Christopher Miller, Luis Tari, Chitta Baral, Graciela Gonzalez

Research output: Contribution to journal › Article › peer-review

30 Scopus citations

Abstract

Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).

Original language	English (US)
Article number	5473210
Pages (from-to)	481-494
Number of pages	14
Journal	IEEE/ACM Transactions on Computational Biology and Bioinformatics
Volume	7
Issue number	3
DOIs	https://doi.org/10.1109/TCBB.2010.51
State	Published - 2010

Keywords

Biology and genetics
bioinformatics (genome or protein) databases
text analysis

ASJC Scopus subject areas

Biotechnology
Genetics
Applied Mathematics

Access to Document

10.1109/TCBB.2010.51

Cite this

@article{2d59984d8a18477fae150a7563b9a466,

title = "Efficient extraction of protein-protein interactions from full-text articles",

abstract = "Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).",

keywords = "Biology and genetics, bioinformatics (genome or protein) databases, text analysis",

author = "J{\"o}rg Hakenberg and Robert Leaman and {Ha Vo}, Nguyen and Siddhartha Jonnalagadda and Ryan Sullivan and Christopher Miller and Luis Tari and Chitta Baral and Graciela Gonzalez",

note = "Funding Information: Graciela Gonzalez, Robert Leaman, Christopher Miller, and Ryan Sullivan acknowledge support from the Science Foundation Arizona grant CAA 0277-08, the Arizona Alzheimer{\textquoteright}s Disease Data Management Core under NIH Grant NIA P30 AG-19610, and the State of Arizona Alzheimer{\textquoteright}s Disease Research Consortium. Parts of this research (Chitta Baral, Nguyen Ha Vo, Luis Tari, and J{\"o}rg Hakenberg) were funded by the grants from US National Science Foundation (NSF) 0412000, SFAZ CAA 0289-08, and NSF OCI 0950440. J{\"o}rg Hakenberg thanks the Fulton School of Engineering for support.",

year = "2010",

doi = "10.1109/TCBB.2010.51",

language = "English (US)",

volume = "7",

pages = "481--494",

journal = "IEEE/ACM Transactions on Computational Biology and Bioinformatics",

issn = "1545-5963",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "3",

}

TY - JOUR

T1 - Efficient extraction of protein-protein interactions from full-text articles

AU - Hakenberg, Jörg

AU - Leaman, Robert

AU - Ha Vo, Nguyen

AU - Jonnalagadda, Siddhartha

AU - Sullivan, Ryan

AU - Miller, Christopher

AU - Tari, Luis

AU - Baral, Chitta

AU - Gonzalez, Graciela

N1 - Funding Information: Graciela Gonzalez, Robert Leaman, Christopher Miller, and Ryan Sullivan acknowledge support from the Science Foundation Arizona grant CAA 0277-08, the Arizona Alzheimer’s Disease Data Management Core under NIH Grant NIA P30 AG-19610, and the State of Arizona Alzheimer’s Disease Research Consortium. Parts of this research (Chitta Baral, Nguyen Ha Vo, Luis Tari, and Jörg Hakenberg) were funded by the grants from US National Science Foundation (NSF) 0412000, SFAZ CAA 0289-08, and NSF OCI 0950440. Jörg Hakenberg thanks the Fulton School of Engineering for support.

PY - 2010

Y1 - 2010

N2 - Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).

AB - Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).

KW - Biology and genetics

KW - bioinformatics (genome or protein) databases

KW - text analysis

UR - http://www.scopus.com/inward/record.url?scp=77955439196&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77955439196&partnerID=8YFLogxK

U2 - 10.1109/TCBB.2010.51

DO - 10.1109/TCBB.2010.51

M3 - Article

C2 - 20498514

AN - SCOPUS:77955439196

SN - 1545-5963

VL - 7

SP - 481

EP - 494

JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics

JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics

IS - 3

M1 - 5473210

ER -

Efficient extraction of protein-protein interactions from full-text articles

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this