This page shows the samples in the paper "PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts".
Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to serve as the linguistic content, which minimizes residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, which can adapt the duration of the same linguistic content to different styles. The subjective and objective evaluation results demonstrate the effectiveness of our proposed system.
Proposed Approach

1: Style Voice Conversion by Reference Speech
Source | Reference | MixEmo | StyleSpeech | Proposed | |
2: Style Voice Conversion by Natural Language Prompts
Source | Prompts | Converted | |
一个专业且具有权威性的风格,声音柔和有磁性,成熟男性 (A style that is both professional and authoritative, with a magnetic voice, exuding a sense of mature masculinity) |
声音动人且具有权威性的女生 (A captivating and professional voice emanating from a lady) |
妙龄少女流利地说,有质疑的语气,情感上则是有些惊讶 (A young girl spoke fluently, with a questioning tone, and was emotionally a little surprised) |
慢慢地说,语气低沉平静, 青年男性音色 (Speak slowly, in a low and calm tone, with a young male voice) |
女性音色讲话很有气无力,很不认可冷漠,显得特别厌恶 (A female voice is very weak in speech, very disapproving of indifference, and seems particularly disgusting) |
青年男性声音高昂,语气冷酷,显得有点愤怒的样子 (A young man's voice is high, his tone was cold, and he looked a little angry) |
青年男性带着颤抖的语气,语气忐忑且惊慌,表现出害怕的样子 (A young male speak with a trembling tone, worried and panicked, showing fear) |
妙龄女性说话状态是气愤的,说话人的语气很强硬辱骂,有仇恨的情绪 (A young lady is angry when she speaks, and the speaker's tone is strong, abusive, and hateful) |
青年男性气愤的情绪,表达了说话人的失望和愤懑 (The angry emotion of a young man expresses the speaker's disappointment and resentment) |
成熟男性缓慢的说,语气很威严 (A mature man said slowly, with a very dignified tone) |
说话状态是笑的,说话人的语气很轻快烂漫,有欢乐的情绪,年轻女性音色 (The speaker is smiling, the speaker's tone is brisk and cheerful, the mood is joyful, and the voice is young and female) |
青年男性说话流利,语气果断且冷酷,表达出内心的惊讶 (The young male spoke fluently, with a decisive and cold tone, expressing his inner surprise) |
带有哭腔的说,有质疑慌张的语气,情感上则是惧怕的,年轻女性的音色 (A tearful tone, with a questioning and panic tone, and was emotionally frightened, with the timbre of a young woman) |
语速缓慢,字字清晰,语气忐忑不安,表现出难过的样子,青年男性 (Speech is slow, words are clear, tone is uneasy, and looks sad, young male) |
年轻女性高声地说,这一句的语气是否定激动的,表达了气愤的情感 (Young woman said loudly, the tone of this sentence is negative and excited, expressing the emotion of anger) |
声音低沉,一字一顿,且具有权威性的成熟男生 (A mature boy with a low voice, precise wording, and authoritativeness) |
成熟男性,声音柔和有磁性,以专业权威的风格说 (Mature male, soft and magnetic voice, speaking in a professional and authoritative style) |
3: Ablation Study
Source | Reference | Proposed | Without Prosody | Variants with PPG | |
4: Comparison of Artificially Prompts (Following ICASSP Reviewers' Suggestions)
Source | Prompts | Converted | |
成熟男子缓缓说道,语气十分凝重 (A man said slowly, with a very dignified tone) |
年轻女子语速很快,语气很随意 (A young woman said fast, with a very informal tone) |
成熟男子缓缓说道,语气十分凝重 (A man said slowly, with a very dignified tone) |
成熟男子语速很快,语气十分凝重 (A mature man said fast, with a very dignified tone) |