Mikkel N. Schmidt and Rasmus K. Olsson

Abstract: In this work we address the problem of separating multiple speakers from a single microphone recording. We estimate a real valued time-frequency representation of the speech sources linearly from features derived from an observed mixture. We use sparse and non-negative encodings of the speech mixture in terms of pre-learned speaker dependent dictionaries as features. Comparing with direct separation in the feature space and with linear estimation using the mixture itself as the features, the method leads to better separation in terms of the signal-toerror ratio.

Demonstration: Here are a few audio samples demonstrating the algorithm described in the paper.

Opposite gender mixtures:

Speaker

Female (4)

Female (7)

Female (11)

Female (15)

Male (1)

Mix = Male + Female

Mix = Male + Female Mix = Male + Female Mix = Male + Female

Male (2)

Mix = Male + Female Mix = Male + Female Mix = Male + Female Mix = Male + Female

Male (3)

Mix = Male + Female Mix = Male + Female Mix = Male + Female Mix = Male + Female

Male (5)

Mix = Male + Female Mix = Male + Female Mix = Male + Female Mix = Male + Female

Male-male mixtures:

Speaker

Male (2)

Male (3)

Male (5)

Male (1)

Mix = Male + Male Mix = Male + Male Mix = Male + Male

Male (2)

Mix = Male + Male Mix = Male + Male

Male (3)

Mix = Male + Male

Female-female mixtures:

Speaker

Female (7)

Female (11)

Female (15)

Female (4)

Mix = Female + Female Mix = Female + Female Mix = Female + Female

Female (7)

Mix = Female + Female Mix = Female + Female

Female (11)

Mix = Female + Female

Files:
 imm4996.pdf
Cite:
Mikkel N. Schmidt and Rasmus K. Olsson, Feature Space Reconstruction for Single-Channel Speech Separation, 2007
BibTeX:
@techreport{schmidt07a,
   title = "Feature Space Reconstruction for Single-Channel Speech Separation",
   author = "Mikkel N. Schmidt and Rasmus K. Olsson",
   year = "2007"
}
 
 
Mikkel N. Schmidt | Technical University of Denmark | Email: mns(a)imm.dtu.dk