2019 Conference on Implantable Auditory Prostheses

14-19 July 2019

Granlibakken, Lake Tahoe

Page 146

M55: USING AUTOMATIC SPEECH RECOGNITION AND SPEECH SYNTHESIS TO

IMPROVE SPEECH INTELLIGIBILITY FOR COCHLEAR IMPLANT USERS IN

REVERBERANT AND NOISY LISTENING ENVIRONMENTS

Kevin M. Chu, Leslie M. Collins, Boyla O. Mainsah

Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA

Cochlear implant (CI) recipients have difficulty understanding speech in listening environments

with reverberation and noise. One speech enhancement strategy, proposed by Hazrati et al.

(2015), used automatic speech recognition (ASR) to convert a reverberant speech signal into a

hypothesized word sequence. The decoded sequence was then used as the input to a speech

synthesizer to regenerate an anechoic speech signal. Hazrati et al. (2015) trained the Kaldi ASR

toolkit (Povey et al., 2011) to recognize speech in both anechoic and reverberant conditions

using various recorded room impulse responses (RIRs) and reverberation times. Compared to

the unmitigated reverberant condition, the approach improved the intelligibility for CI users in

reverberant listening environments. However, this strategy cannot be implemented in real-time

because it uses a non-causal ASR model, which requires knowledge of the future signal as well

as the identity of the speaker.

If a causal ASR model is used instead, the strategy proposed by Hazrati et al. (2015) has the

potential for real-time implementation. In this study, we implemented the recognition-synthesis

approach using a causal ASR model that does not depend on future acoustic information.

Because the accuracy of the recognized word sequence greatly influences t

he listener’s

intelligibility, this research focused primarily on developing an ASR model that is robust to both

reverberation and noise. A speech recognition model was trained on multi-condition sentences

with different reverberation times and noise types. To evaluate the performance, the trained

model was used to decode sentences from a separate testing database, and the hypothesized

word sequences were compared with the known text transcriptions. Results will be presented for

ASR experiments, as well as the performance of normal hearing subjects on listening tasks

using vocoded speech.

This research is supported by NIH grant R01DC014290-04.