EmoMix Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Abstract

There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. To control the intensity, we apply the Neutral and specific primary emotion noise combined in varying degrees. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control.

Model Architecture

The overview architecture for EmoMix and the green part represents the extended sampling process of mixed emotion synthesis at run-time. The dotted arrows are only used during training. The SER mdoels are pre-trained and their parameters are frozen.

Experiment

Mixed Emotion Synthesis

We use the formula “Mixed Emotion = Base + Mixed-in” in following table to represent mixed emotions for easy understanding.

Disappointment = Sad + Surprise
Outrage = Angry + Surprise
Excitement = Happy + Surprise

Mixed Emotion	Base	Mixed-in	Mixed Emotion
Disappointment
Disappointment
Outrage
Excitement

Primary Emotion Intensity Control

We apply Neutral and primary mixture to control primary emotion intensity.

Emotion	Weak	Medium	Strong
Neutral-Surprise
Neutral-Happy
Neutral-Sad
Neutral-Angry