Mikkel N. Schmidt and Rasmus K. Olsson

Abstract: We apply machine learning techniques to the problem of separating multiple speech sources from a single microphone recording. The method of choice is a sparse non-negative matrix factorization algorithm, which in an unsupervised manner can learn sparse representations of the data. This is applied to the learning of personalized dictionaries from a speech corpus, which in turn are used to separate the audio stream into its components. We show that computational savings can be achieved by segmenting the training data on a phoneme level. To split the data, a conventional speech recognizer is used. The performance of the unsupervised and supervised adaptation schemes result in significant improvements in terms of the target-to-masker ratio.

Demonstration: Here are a few audio demonstrations of the method described in the paper. The mixtures are at 0dB.

Mixture

Speaker 1Speaker 2

Different Gender

Same Gender


Files:
 imm4511.pdf
 interspeech2006poster.pdf
Cite:
Mikkel N. Schmidt and Rasmus K. Olsson, Single-Channel Speech Separation using Sparse Non-Negative Matrix Factorization, International Conference on Spoken Language Processing (INTERSPEECH), 2006
BibTeX:
@inproceedings{schmidt06speechseparation,
   title = "Single-Channel Speech Separation using Sparse Non-Negative Matrix Factorization",
   author = "Mikkel N. Schmidt and Rasmus K. Olsson",
   booktitle = "International Conference on Spoken Language Processing (INTERSPEECH)",
   month = "Sep",
   year = "2006"
}
 
 
Mikkel N. Schmidt | Technical University of Denmark | Email: mns(a)imm.dtu.dk