TY - JOUR
T1 - Mining e-cigarette adverse events in social media using Bi-LSTM recurrent neural network with word embedding representation
AU - Xie, Jiaheng
AU - Liu, Xiao
AU - Zeng, Daniel Dajun
N1 - Funding Information:
This work is supported by the US National Institutes of Health (grant no. 1R01DA037378-01) and National Science Foundation (grant nos. IIS-1553109 and IIS-1552860).
Publisher Copyright:
© The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved.
PY - 2018/1/1
Y1 - 2018/1/1
N2 - Objective: Recent years have seen increased worldwide popularity of e-cigarette use. However, the risks of e-cigarettes are underexamined. Most e-cigarette adverse event studies have achieved low detection rates due to limited subject sample sizes in the experiments and surveys. Social media provides a large data repository of consumers' e-cigarette feedback and experiences, which are useful for e-cigarette safety surveillance. However, it is difficult to automatically interpret the informal and nontechnical consumer vocabulary about e-cigarettes in social media. This issue hinders the use of social media content for e-cigarette safety surveillance. Recent developments in deep neural network methods have shown promise for named entity extraction from noisy text. Motivated by these observations, we aimed to design a deep neural network approach to extract e-cigarette safety information in social media. Methods: Our deep neural language model utilizes word embedding as the representation of text input and recognizes named entity types with the state-of-the-art Bidirectional Long Short-Term Memory (Bi-LSTM) Recurrent Neural Network. Results: Our Bi-LSTM model achieved the best performance compared to 3 baseline models, with a precision of 94.10%, a recall of 91.80%, and an F-measure of 92.94%. We identified 1591 unique adverse events and 9930 unique e-cigarette components (ie, chemicals, flavors, and devices) from our research testbed. Conclusion: Although the conditional random field baseline model had slightly better precision than our approach, our Bi-LSTM model achieved much higher recall, resulting in the best F-measure. Our method can be generalized to extract medical concepts from social media for other medical applications.
AB - Objective: Recent years have seen increased worldwide popularity of e-cigarette use. However, the risks of e-cigarettes are underexamined. Most e-cigarette adverse event studies have achieved low detection rates due to limited subject sample sizes in the experiments and surveys. Social media provides a large data repository of consumers' e-cigarette feedback and experiences, which are useful for e-cigarette safety surveillance. However, it is difficult to automatically interpret the informal and nontechnical consumer vocabulary about e-cigarettes in social media. This issue hinders the use of social media content for e-cigarette safety surveillance. Recent developments in deep neural network methods have shown promise for named entity extraction from noisy text. Motivated by these observations, we aimed to design a deep neural network approach to extract e-cigarette safety information in social media. Methods: Our deep neural language model utilizes word embedding as the representation of text input and recognizes named entity types with the state-of-the-art Bidirectional Long Short-Term Memory (Bi-LSTM) Recurrent Neural Network. Results: Our Bi-LSTM model achieved the best performance compared to 3 baseline models, with a precision of 94.10%, a recall of 91.80%, and an F-measure of 92.94%. We identified 1591 unique adverse events and 9930 unique e-cigarette components (ie, chemicals, flavors, and devices) from our research testbed. Conclusion: Although the conditional random field baseline model had slightly better precision than our approach, our Bi-LSTM model achieved much higher recall, resulting in the best F-measure. Our method can be generalized to extract medical concepts from social media for other medical applications.
KW - Bi-LSTM
KW - Deep neural network
KW - E-cigarette adverse event
KW - Recurrent neural network
KW - Word embedding
UR - http://www.scopus.com/inward/record.url?scp=85040535426&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85040535426&partnerID=8YFLogxK
U2 - 10.1093/jamia/ocx045
DO - 10.1093/jamia/ocx045
M3 - Article
C2 - 28505280
AN - SCOPUS:85040535426
SN - 1067-5027
VL - 25
SP - 72
EP - 80
JO - Journal of the American Medical Informatics Association
JF - Journal of the American Medical Informatics Association
IS - 1
M1 - ocx045
ER -