Abstract
Singing voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, we used the 3 directional Inception-Resnet structure in the U-shaped encoding and decoding network to improve the utilization of the spatial and spectral information of the spectrograms.
Multi-objectives were used to train the model: magnitude consistency loss, phase consistency loss, and magnitude correlation consistency loss. We recorded the singing voice and accompaniment derived from the MIR-1k datasets with NAO robots and synthesized the 10-channel datasets for training the model. The experimental results show that the proposed model trained by multi-objective reaches an average NSDR of 11.55db on the test datasets, which outperforms the comparison model.
Author summary The mixture in the real singing voice separation is always mixed with noise and distortion. In this paper, the acoustic signals with distortion and noise perceived by the robot are used to study the separation of singing voices in real scenes. This paper described how to synthesize the training datasets, proposed a 3 directional Inception-ResUNet structure for multichannel singing voice separation, and adopted multi-objectives including magnitude correlation consistency loss to train the model. The experimental results showed that the magnitude correlation consistency loss reduces distortions, the proposed model achieves better performance than the compared models.
Competing Interest Statement
The authors have declared no competing interest.