TY - JOUR
T1 - Mitigating the Impact of Data Sampling on Social Media Analysis and Mining
AU - Xu, Kuai
AU - Wang, Feng
AU - Wang, Haiyan
AU - Wang, Yufang
AU - Zhang, Ying
N1 - Funding Information:
Manuscript received August 13, 2019; revised December 30, 2019; accepted January 23, 2020. Date of publication February 18, 2020; date of current version April 3, 2020. This work was supported in part by the National Science Foundation under Grant ATD-1737861, in part by the Humanities and Social Sciences Research, Ministry of Education of China, under Grant 18YJCZH184, and in part by the Tianjin Natural Science Foundation under Grant 19JCQNJC14800. (Corresponding author: Kuai Xu.) Kuai Xu, Feng Wang, and Haiyan Wang are with the School of Mathematical and Natural Sciences, Arizona State University, Glendale, AZ 85306 USA (e-mail: kuai.xu@asu.edu).
Publisher Copyright:
© 2014 IEEE.
PY - 2020/4
Y1 - 2020/4
N2 - The last decade has witnessed the explosive growth of online social media in users and contents. Due to the unprecedented scale and the cascading power of the underlying social networks, social media has created a new paradigm for sharing information, broadcasting breaking news, and reporting real-time events by any user from anywhere at any time. Many popular social media sites including Twitter provide streaming data services by standard APIs to the broad researcher and developer communities. Given the sheer data volume, rapid velocity, and feature variety of online social media, these sites often supply only a sampled set of streaming data, rather than the full data set to reduce the resource cost of computations, storage, and network bandwidth. In light of the substantial impact of sampling in Twitter data stream, this article explores a combination of spectral clustering, locality-sensitive hashing (LSH), latent Dirichlet allocation (LDA) topic modeling, and differential equation modeling to mitigate the impact of sampling on social media data analysis, in particular on detecting real-world events and predicting information diffusion. Our extensive experiments demonstrate that our proposed method is able to detect effectively the real-time emerging events and predict accurately the cascading pattern of these events from the 1% sampled Twitter data stream. To the best of our knowledge, this article is the first effort to introduce a systematic methodology to study and mitigate the impact of data sampling on social media analysis and mining.
AB - The last decade has witnessed the explosive growth of online social media in users and contents. Due to the unprecedented scale and the cascading power of the underlying social networks, social media has created a new paradigm for sharing information, broadcasting breaking news, and reporting real-time events by any user from anywhere at any time. Many popular social media sites including Twitter provide streaming data services by standard APIs to the broad researcher and developer communities. Given the sheer data volume, rapid velocity, and feature variety of online social media, these sites often supply only a sampled set of streaming data, rather than the full data set to reduce the resource cost of computations, storage, and network bandwidth. In light of the substantial impact of sampling in Twitter data stream, this article explores a combination of spectral clustering, locality-sensitive hashing (LSH), latent Dirichlet allocation (LDA) topic modeling, and differential equation modeling to mitigate the impact of sampling on social media data analysis, in particular on detecting real-world events and predicting information diffusion. Our extensive experiments demonstrate that our proposed method is able to detect effectively the real-time emerging events and predict accurately the cascading pattern of these events from the 1% sampled Twitter data stream. To the best of our knowledge, this article is the first effort to introduce a systematic methodology to study and mitigate the impact of data sampling on social media analysis and mining.
KW - Big data
KW - Data sampling
KW - Social media analysis
UR - http://www.scopus.com/inward/record.url?scp=85079903238&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85079903238&partnerID=8YFLogxK
U2 - 10.1109/TCSS.2020.2970602
DO - 10.1109/TCSS.2020.2970602
M3 - Article
AN - SCOPUS:85079903238
SN - 2329-924X
VL - 7
SP - 546
EP - 555
JO - IEEE Transactions on Computational Social Systems
JF - IEEE Transactions on Computational Social Systems
IS - 2
M1 - 9001215
ER -