Restoring degraded speech via a modified diffusion model

Jianwei Zhang; Suren Jayasuriya; Visar Berisha

doi:10.21437/Interspeech.2021-1889

Restoring degraded speech via a modified diffusion model

Jianwei Zhang, Suren Jayasuriya, Visar Berisha

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

6 Scopus citations

Abstract

There are many deterministic mathematical operations (e.g. compression, clipping, downsampling) that degrade speech quality considerably. In this paper we introduce a neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal. DiffWave, a recently published diffusion-based vocoder, has shown state-of-the-art synthesized speech quality and relatively shorter waveform generation times, with only a small set of parameters. We replace the mel-spectrum upsampler in DiffWave with a deep CNN upsampler, which is trained to alter the degraded speech mel-spectrum to match that of the original speech. The model is trained using the original speech waveform, but conditioned on the degraded speech mel-spectrum. Post-training, only the degraded mel-spectrum is used as input and the model generates an estimate of the original speech. Our model results in improved speech quality (original DiffWave model as baseline) on several different experiments. These include improving the quality of speech degraded by LPC-10 compression, AMR-NB compression, and signal clipping. Compared to the original DiffWave architecture, our scheme achieves better performance on several objective perceptual metrics and in subjective comparisons. Improvements over baseline are further amplified in a out-of-corpus evaluation setting.

Original language	English (US)
Title of host publication	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Publisher	International Speech Communication Association
Pages	2753-2757
Number of pages	5
ISBN (Electronic)	9781713836902
DOIs	https://doi.org/10.21437/Interspeech.2021-1889
State	Published - 2021
Event	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic Duration: Aug 30 2021 → Sep 3 2021

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	4
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Country/Territory	Czech Republic
City	Brno
Period	8/30/21 → 9/3/21

Keywords

Diffusion model
Lossy transformation
Restoring speech
Speech enhancement
Vocoder

ASJC Scopus subject areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modeling and Simulation

Access to Document

10.21437/Interspeech.2021-1889

Cite this

Zhang, J., Jayasuriya, S., & Berisha, V. (2021). Restoring degraded speech via a modified diffusion model. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 (pp. 2753-2757). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 4). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2021-1889

Restoring degraded speech via a modified diffusion model. / Zhang, Jianwei; Jayasuriya, Suren ; Berisha, Visar.
22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. p. 2753-2757 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 4).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Zhang, J, Jayasuriya, S & Berisha, V 2021, Restoring degraded speech via a modified diffusion model. in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 4, International Speech Communication Association, pp. 2753-2757, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, Brno, Czech Republic, 8/30/21. https://doi.org/10.21437/Interspeech.2021-1889

Zhang J, Jayasuriya S , Berisha V. Restoring degraded speech via a modified diffusion model. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association. 2021. p. 2753-2757. (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH). doi: 10.21437/Interspeech.2021-1889

Zhang, Jianwei ; Jayasuriya, Suren ; Berisha, Visar. / Restoring degraded speech via a modified diffusion model. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021. International Speech Communication Association, 2021. pp. 2753-2757 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH).

@inproceedings{78bdadde45e4459e9db1373efd7f119a,

title = "Restoring degraded speech via a modified diffusion model",

abstract = "There are many deterministic mathematical operations (e.g. compression, clipping, downsampling) that degrade speech quality considerably. In this paper we introduce a neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal. DiffWave, a recently published diffusion-based vocoder, has shown state-of-the-art synthesized speech quality and relatively shorter waveform generation times, with only a small set of parameters. We replace the mel-spectrum upsampler in DiffWave with a deep CNN upsampler, which is trained to alter the degraded speech mel-spectrum to match that of the original speech. The model is trained using the original speech waveform, but conditioned on the degraded speech mel-spectrum. Post-training, only the degraded mel-spectrum is used as input and the model generates an estimate of the original speech. Our model results in improved speech quality (original DiffWave model as baseline) on several different experiments. These include improving the quality of speech degraded by LPC-10 compression, AMR-NB compression, and signal clipping. Compared to the original DiffWave architecture, our scheme achieves better performance on several objective perceptual metrics and in subjective comparisons. Improvements over baseline are further amplified in a out-of-corpus evaluation setting.",

keywords = "Diffusion model, Lossy transformation, Restoring speech, Speech enhancement, Vocoder",

author = "Jianwei Zhang and Suren Jayasuriya and Visar Berisha",

note = "Funding Information: This work was partially supported by ONR Contract N000142012330, and by NIH NIDCD R01 DC006859. Publisher Copyright: Copyright {\textcopyright} 2021 ISCA.; 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 ; Conference date: 30-08-2021 Through 03-09-2021",

year = "2021",

doi = "10.21437/Interspeech.2021-1889",

language = "English (US)",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "2753--2757",

booktitle = "22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021",

}

TY - GEN

T1 - Restoring degraded speech via a modified diffusion model

AU - Zhang, Jianwei

AU - Jayasuriya, Suren

AU - Berisha, Visar

PY - 2021

Y1 - 2021

N2 - There are many deterministic mathematical operations (e.g. compression, clipping, downsampling) that degrade speech quality considerably. In this paper we introduce a neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal. DiffWave, a recently published diffusion-based vocoder, has shown state-of-the-art synthesized speech quality and relatively shorter waveform generation times, with only a small set of parameters. We replace the mel-spectrum upsampler in DiffWave with a deep CNN upsampler, which is trained to alter the degraded speech mel-spectrum to match that of the original speech. The model is trained using the original speech waveform, but conditioned on the degraded speech mel-spectrum. Post-training, only the degraded mel-spectrum is used as input and the model generates an estimate of the original speech. Our model results in improved speech quality (original DiffWave model as baseline) on several different experiments. These include improving the quality of speech degraded by LPC-10 compression, AMR-NB compression, and signal clipping. Compared to the original DiffWave architecture, our scheme achieves better performance on several objective perceptual metrics and in subjective comparisons. Improvements over baseline are further amplified in a out-of-corpus evaluation setting.

AB - There are many deterministic mathematical operations (e.g. compression, clipping, downsampling) that degrade speech quality considerably. In this paper we introduce a neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal. DiffWave, a recently published diffusion-based vocoder, has shown state-of-the-art synthesized speech quality and relatively shorter waveform generation times, with only a small set of parameters. We replace the mel-spectrum upsampler in DiffWave with a deep CNN upsampler, which is trained to alter the degraded speech mel-spectrum to match that of the original speech. The model is trained using the original speech waveform, but conditioned on the degraded speech mel-spectrum. Post-training, only the degraded mel-spectrum is used as input and the model generates an estimate of the original speech. Our model results in improved speech quality (original DiffWave model as baseline) on several different experiments. These include improving the quality of speech degraded by LPC-10 compression, AMR-NB compression, and signal clipping. Compared to the original DiffWave architecture, our scheme achieves better performance on several objective perceptual metrics and in subjective comparisons. Improvements over baseline are further amplified in a out-of-corpus evaluation setting.

KW - Diffusion model

KW - Lossy transformation

KW - Restoring speech

KW - Speech enhancement

KW - Vocoder

UR - http://www.scopus.com/inward/record.url?scp=85119196384&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85119196384&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2021-1889

DO - 10.21437/Interspeech.2021-1889

M3 - Conference contribution

AN - SCOPUS:85119196384

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 2753

EP - 2757

BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

PB - International Speech Communication Association

T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021

Y2 - 30 August 2021 through 3 September 2021

ER -

Restoring degraded speech via a modified diffusion model

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this