1. Speech-side prosody control

Speech-side prosody embedding can control prosody of specific frame. The red line denotes the 1st dimension of the prosody embedding. The green line denotes the 2nd dimension of the prosody embedding.

Text: I had a dear friend, once a brown terrier, "skye" they called her. No Adjustment

SpectrogramProsody Graph

Adjusted 1st dimension (pitch)

SpectrogramProsody Graph

Adjusted 2nd dimension (amplitude)

SpectrogramProsody Graph

Text: But when it came to breaking in, that was a bad time for me. No Adjustment

SpectrogramProsody Graph

Adjusted 1st dimension (pitch)

SpectrogramProsody Graph

Adjusted 2nd dimension (amplitude)

SpectrogramProsody Graph

Text: I know nothing about it, but Fanny must teach me. No Adjustment

SpectrogramProsody Graph

Adjusted 1st dimension (pitch)

SpectrogramProsody Graph

Adjusted 2nd dimension (amplitude)

SpectrogramProsody Graph

2. Text-side prosody control

Text-side prosody embedding can control prosody of specific phoneme. The red line denotes the 1st dimension of the prosody embedding. The blue line denotes the 2nd dimension of the prosody embedding. The yellow line denotes the 3rd dimension of the prosody embedding.

Text: I had a dear friend, once a brown terrier, "skye" they called her. No Adjustment

SpectrogramProsody Graph

Adjusted 1st dimension (amplitude, length)

SpectrogramProsody Graph

Adjusted 2nd dimension (pitch)

Prosody Graph

Adjusted 3rd dimension (pitch, length)

SpectrogramProsody Graph

Text: But when it came to breaking in, that was a bad time for me. No Adjustment

SpectrogramProsody Graph

Adjusted 1st dimension (amplitude, length)

SpectrogramProsody Graph

Adjusted 2nd dimension (pitch)

SpectrogramProsody Graph

Adjusted 3rd dimension (pitch, length)

SpectrogramProsody Graph

Text: I know nothing about it, but Fanny must teach me. No Adjustment

SpectrogramProsody Graph

Adjusted 1st dimension (amplitude, length)

SpectrogramProsody Graph

Adjusted 2nd dimension (pitch)

SpectrogramProsody Graph

Adjusted 3rd dimension (pitch, length)

SpectrogramProsody Graph

3. Effect of the normalized prosody embedding

When we use prosody embedding for prosody transfer, the result tends to show reference speech's speaker identity. For example, when the reference speaker is female and target speaker is male, the generated speech had higher pitch than the male speaker's normal pitch. We show that the normalized prosody embedding could prevent this problem.

Text: He stopped, and Philip nodded at the horrified question in his eyes. Reference speech (American female)

Target speaker (American male)

Transferred speech (without normalization)

Transferred speech (with normalization)

Text: Well, said York, if they come here they must wear the bearing rein. Reference speech (American female)

Target speaker (American male)

Transferred speech (without normalization)

Transferred speech (with normalization)

Text: He was taken last night in the yard, and could scarcely crawl home. Reference speech (American female)

Target speaker (Korean male)

Transferred speech (without normalization)

Transferred speech (with normalization)

4. Singing voice transfer

The result of prosody transfer applied for the singing voice.

Text: Sweet dreams are made of these. Friendly Assistants who work hard to please Original song

Target speaker (American female)

Global style token

Speech-side control

Text-side control

5. BTS "Fake Love" covered by Fake Trump

N
Typecast US Inc. 400 Concar Dr, San Mateo, CA 94402, USA
Copyright © 2025 Typecast US Inc. All Rights Reserved.
language
English