Adversarial Text Purification: A Large Language Model Approach for Defense

  • Raha Moraffah
  • , Shubh Khandelwal
  • , Amrita Bhattacharjee
  • , Huan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

Adversarial purification is a defense mechanism for safeguarding classifiers against adversarial attacks without knowing the type of attacks or training of the classifier. These techniques characterize and eliminate adversarial perturbations from the attacked inputs, aiming to restore purified samples that retain similarity to the initially attacked ones and are correctly classified by the classifier. Due to the inherent challenges associated with characterizing noise perturbations for discrete inputs, adversarial text purification has been relatively unexplored. In this paper, we investigate the effectiveness of adversarial purification methods in defending text classifiers. We propose a novel adversarial text purification that harnesses the generative capabilities of Large Language Models (LLMs) to purify adversarial text without the need to explicitly characterize the discrete noise perturbations. We utilize prompt engineering to exploit LLMs for recovering the purified samples for given adversarial examples such that they are semantically similar and correctly classified. Our proposed method demonstrates remarkable performance over various classifiers, improving their accuracy under the attack by over 65% on average.

Original languageEnglish (US)
Title of host publicationAdvances in Knowledge Discovery and Data Mining - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024, Proceedings
EditorsDe-Nian Yang, Xing Xie, Vincent S. Tseng, Jian Pei, Jen-Wei Huang, Jerry Chun-Wei Lin
PublisherSpringer Science and Business Media Deutschland GmbH
Pages65-77
Number of pages13
ISBN (Print)9789819722648
DOIs
StatePublished - 2024
Event28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024 - Taipei, Taiwan, Province of China
Duration: May 7 2024May 10 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14649 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference28th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2024
Country/TerritoryTaiwan, Province of China
CityTaipei
Period5/7/245/10/24

Keywords

  • Adversarial Purification
  • Large Language Model
  • Textual Adversarial Defenses
  • Textual Adversarial Defenses

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Adversarial Text Purification: A Large Language Model Approach for Defense'. Together they form a unique fingerprint.

Cite this