Computationally-efficient voice activity detection based on deep neural networks

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations


Voice activity detection (VAD) is among the first preprocessing steps in most speech processing applications. While there are several very low-power analog solutions, the more recent deep neural network (DNN) based solutions have superior VAD performance in even complex noisy backgrounds at the expense of increase in computations. In this paper, we propose a computationally-efficient network architecture, ResCap+, for high performance VAD. ResCap+ operates on small-sized sequences and is built with residual blocks in a convolutional neural network to encode the characteristics of the input spectrum, and a capsule network with LSTM cells to capture the temporal relationship between these sequences. We evaluate the model using the AMI meeting corpus and show that it outperforms a state-of-the-art DNN-based model on accuracy with ≈55× less computation cost. We also present initial hardware performance results on a low-power programmable architecture, Transmuter, and show that it can process every 40ms input audio sequence with a delay of 15.17ms resulting in real-time performance.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE Workshop on Signal Processing Systems, SiPS 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages6
ISBN (Electronic)9781665401449
StatePublished - 2021
Event2021 IEEE Workshop on Signal Processing Systems, SiPS 2021 - Coimbra, Portugal
Duration: Oct 19 2021Oct 21 2021

Publication series

NameIEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation
ISSN (Print)1520-6130


Conference2021 IEEE Workshop on Signal Processing Systems, SiPS 2021


  • Capsule network
  • Deep neural network
  • Low-power architecture
  • Voice activity detection

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Signal Processing
  • Applied Mathematics
  • Hardware and Architecture


Dive into the research topics of 'Computationally-efficient voice activity detection based on deep neural networks'. Together they form a unique fingerprint.

Cite this