Extracting unknown words from Sina Weibo via data clustering

Kai Lei, Weiyang Zhang, Kai Zhang, Kuai Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations


Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today's Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.

Original languageEnglish (US)
Title of host publication2015 IEEE International Conference on Communications, ICC 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages6
ISBN (Electronic)9781467364324
StatePublished - Sep 9 2015
EventIEEE International Conference on Communications, ICC 2015 - London, United Kingdom
Duration: Jun 8 2015Jun 12 2015

Publication series

NameIEEE International Conference on Communications
ISSN (Print)1550-3607


OtherIEEE International Conference on Communications, ICC 2015
Country/TerritoryUnited Kingdom

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Electrical and Electronic Engineering


Dive into the research topics of 'Extracting unknown words from Sina Weibo via data clustering'. Together they form a unique fingerprint.

Cite this