2019 Conference on Implantable Auditory Prostheses
14-19 July 2019
Granlibakken, Lake Tahoe
Page 146
M55: USING AUTOMATIC SPEECH RECOGNITION AND SPEECH SYNTHESIS TO
IMPROVE SPEECH INTELLIGIBILITY FOR COCHLEAR IMPLANT USERS IN
REVERBERANT AND NOISY LISTENING ENVIRONMENTS
Kevin M. Chu, Leslie M. Collins, Boyla O. Mainsah
Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA
Cochlear implant (CI) recipients have difficulty understanding speech in listening environments
with reverberation and noise. One speech enhancement strategy, proposed by Hazrati et al.
(2015), used automatic speech recognition (ASR) to convert a reverberant speech signal into a
hypothesized word sequence. The decoded sequence was then used as the input to a speech
synthesizer to regenerate an anechoic speech signal. Hazrati et al. (2015) trained the Kaldi ASR
toolkit (Povey et al., 2011) to recognize speech in both anechoic and reverberant conditions
using various recorded room impulse responses (RIRs) and reverberation times. Compared to
the unmitigated reverberant condition, the approach improved the intelligibility for CI users in
reverberant listening environments. However, this strategy cannot be implemented in real-time
because it uses a non-causal ASR model, which requires knowledge of the future signal as well
as the identity of the speaker.
If a causal ASR model is used instead, the strategy proposed by Hazrati et al. (2015) has the
potential for real-time implementation. In this study, we implemented the recognition-synthesis
approach using a causal ASR model that does not depend on future acoustic information.
Because the accuracy of the recognized word sequence greatly influences t
he listener’s
intelligibility, this research focused primarily on developing an ASR model that is robust to both
reverberation and noise. A speech recognition model was trained on multi-condition sentences
with different reverberation times and noise types. To evaluate the performance, the trained
model was used to decode sentences from a separate testing database, and the hypothesized
word sequences were compared with the known text transcriptions. Results will be presented for
ASR experiments, as well as the performance of normal hearing subjects on listening tasks
using vocoded speech.
This research is supported by NIH grant R01DC014290-04.