Extracting unknown words from Sina Weibo via data clustering

Kai Lei; Weiyang Zhang; Kai Zhang; Kuai Xu

doi:10.1109/ICC.2015.7248483

Extracting unknown words from Sina Weibo via data clustering

Kai Lei, Weiyang Zhang, Kai Zhang, Kuai Xu

Mathematical and Natural Sciences, School of (SMNS)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

4 Scopus citations

Abstract

Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today's Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.

Original language	English (US)
Title of host publication	2015 IEEE International Conference on Communications, ICC 2015
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1182-1187
Number of pages	6
ISBN (Electronic)	9781467364324
DOIs	https://doi.org/10.1109/ICC.2015.7248483
State	Published - Sep 9 2015
Event	IEEE International Conference on Communications, ICC 2015 - London, United Kingdom Duration: Jun 8 2015 → Jun 12 2015

Publication series

Name	IEEE International Conference on Communications
Volume	2015-September
ISSN (Print)	1550-3607

Other

Other	IEEE International Conference on Communications, ICC 2015
Country/Territory	United Kingdom
City	London
Period	6/8/15 → 6/12/15

ASJC Scopus subject areas

Computer Networks and Communications
Electrical and Electronic Engineering

Access to Document

10.1109/ICC.2015.7248483

Cite this

Lei, K., Zhang, W., Zhang, K., & Xu, K. (2015). Extracting unknown words from Sina Weibo via data clustering. In 2015 IEEE International Conference on Communications, ICC 2015 (pp. 1182-1187). Article 7248483 (IEEE International Conference on Communications; Vol. 2015-September). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICC.2015.7248483

Extracting unknown words from Sina Weibo via data clustering. / Lei, Kai; Zhang, Weiyang; Zhang, Kai et al.
2015 IEEE International Conference on Communications, ICC 2015. Institute of Electrical and Electronics Engineers Inc., 2015. p. 1182-1187 7248483 (IEEE International Conference on Communications; Vol. 2015-September).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Lei, K, Zhang, W, Zhang, K & Xu, K 2015, Extracting unknown words from Sina Weibo via data clustering. in 2015 IEEE International Conference on Communications, ICC 2015., 7248483, IEEE International Conference on Communications, vol. 2015-September, Institute of Electrical and Electronics Engineers Inc., pp. 1182-1187, IEEE International Conference on Communications, ICC 2015, London, United Kingdom, 6/8/15. https://doi.org/10.1109/ICC.2015.7248483

@inproceedings{631b68b62d2349688d60e1fc39c2159e,

title = "Extracting unknown words from Sina Weibo via data clustering",

abstract = "Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today's Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.",

author = "Kai Lei and Weiyang Zhang and Kai Zhang and Kuai Xu",

note = "Publisher Copyright: {\textcopyright} 2015 IEEE.; IEEE International Conference on Communications, ICC 2015 ; Conference date: 08-06-2015 Through 12-06-2015",

year = "2015",

month = sep,

day = "9",

doi = "10.1109/ICC.2015.7248483",

language = "English (US)",

series = "IEEE International Conference on Communications",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1182--1187",

booktitle = "2015 IEEE International Conference on Communications, ICC 2015",

}

TY - GEN

T1 - Extracting unknown words from Sina Weibo via data clustering

AU - Lei, Kai

AU - Zhang, Weiyang

AU - Zhang, Kai

AU - Xu, Kuai

PY - 2015/9/9

Y1 - 2015/9/9

N2 - Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today's Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.

AB - Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today's Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.

UR - http://www.scopus.com/inward/record.url?scp=84953729067&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84953729067&partnerID=8YFLogxK

U2 - 10.1109/ICC.2015.7248483

DO - 10.1109/ICC.2015.7248483

M3 - Conference contribution

AN - SCOPUS:84953729067

T3 - IEEE International Conference on Communications

SP - 1182

EP - 1187

BT - 2015 IEEE International Conference on Communications, ICC 2015

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - IEEE International Conference on Communications, ICC 2015

Y2 - 8 June 2015 through 12 June 2015

ER -

Extracting unknown words from Sina Weibo via data clustering

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this