Enriching short text representation in microblog for clustering

Jiliang Tang, Xufei Wang, Huiji Gao, Xia Hu, Huan Liu

Research output: Contribution to journalArticlepeer-review

76 Scopus citations


Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbreviations, and coined acronyms and words exacerbate the problems of synonymy and polysemy, and bring about new challenges to data mining applications such as text clustering and classification. To address these issues, we dissect some potential causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed approach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investigate potential factors that contribute to the improved performance.

Original languageEnglish (US)
Pages (from-to)88-101
Number of pages14
JournalFrontiers of Computer Science in China
Issue number1
StatePublished - Feb 2012


  • matrix factorization
  • multi-language knowledge
  • short texts
  • social media
  • text representation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)


Dive into the research topics of 'Enriching short text representation in microblog for clustering'. Together they form a unique fingerprint.

Cite this