Work-in-progress: Automated named entity extraction for tracking censorship of current events

Antonio M. Espinoza, Jedidiah R. Crandall

Research output: Contribution to conferencePaperpeer-review

7 Scopus citations

Abstract

Tracking Internet censorship is challenging because what content the censors target can change daily, even hourly, with current events. The process must be automated because of the large amount of data that needs to be processed. Our focus in this paper is on automated probing of keyword-based Internet censorship, where natural language processing techniques are used to generate keywords to probe for censorship with. In this paper we present a named entity extraction framework that can extract the names of people, places, and organizations from text such as a news story. Previous efforts to automate the study of keyword-based Internet censorship have been based on semantic analysis of existing bodies of text, such as Wikipedia, and so could not extract meaningful keywords from the news to probe with. We have used a maximum entropy approach for named entity extraction, because of its flexibility. Our preliminary results suggest that this approach gives good results with only a rudimentary understanding of the target language. This means that the approach is very flexible, and while our current implementation is for Chinese we anticipate that extending the framework to other languages such as Arabic, Farsi, and Spanish will be straightforward because of the maximum entropy approach. In this paper we present some testing results as well as some preliminary results from probing China’s GET request censorship and search engine filtering using this framework.

Original languageEnglish (US)
StatePublished - 2011
Externally publishedYes
Event1st USENIX Workshop on Free and Open Communications on the Internet, FOCI 2011, co-located with USENIX Security 2011 - San Francisco, United States
Duration: Aug 8 2011 → …

Conference

Conference1st USENIX Workshop on Free and Open Communications on the Internet, FOCI 2011, co-located with USENIX Security 2011
Country/TerritoryUnited States
CitySan Francisco
Period8/8/11 → …

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'Work-in-progress: Automated named entity extraction for tracking censorship of current events'. Together they form a unique fingerprint.

Cite this