Festival unit selection voice
Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. Multisyn is an open-source toolkit for building unit selection voice with any speech corpus. This post gives detailed instructions on how to use Multisyn to build an unit selection model and Festival for final waveform synthesis.
To build a new voice with Festival Multisyn, follow the step-by-step procedure given below:
1. Install tools
You might be familiar with most of these tools, but there are some differences in the way we setup these tools.
- A version of speech tools with python wrappers has to be installed in order to work with Multisyn.
- Latest version of Festival has to be installed in order to use hybrid unit selection.
Therefore, we recommend installing a fresh copy of these tools following the scripts provided in Merlin.
To install speech tools, Festival and Multisyn:
To install HTK:
Make sure you install all these tools without any errors and check environment variables before proceeding further.
At this point, make sure you have data ready:
- a directory containing audio files with file extension
- a text file with transcriptions in the typical festival format.
For demo purpose, we use AWB corpus from CMU Arctic Database.
Let’s create a working directory and download the AWB corpus:
mkdir multisyn_voice wget http://festvox.org/cmu_arctic/cmu_arctic/packed/cmu_us_awb_arctic-0.95-release.zip unzip -q cmu_us_awb_arctic-0.95-release.zip
Let’s setup a directory for model building:
mkdir cstr_edi_awb_multisyn cd cstr_edi_awb_multisyn source $MULTISYN_BUILD/multisyn_build.sh $MULTISYN_BUILD/bin/setup
Let’s copy the audio files and text file:
cp ../cmu_us_awb_arctic/wav/* wav/ cp ../cmu_us_awb_arctic/etc/txt.done.data utts.data
3. Prepare initial labels
At this point, we have to chose a lexicon and a phoneset. The available options are:
- CMU lexicon: can be freely obtained from here
- Unisyn lexicon: can be freely obtained by signing a license from here
- Combilex: has a commercial-license and can be obtained from here
Based on the lexicon you have chosen, copy the files
resources directory in MULTISYN_BUILD.
cp $MULTISYN_BUILD/resources/phone_list.unilex-rpx alignment/phone_list cp $MULTISYN_BUILD/resources/phone_substitutions.unilex-rpx alignment/phone_substitutions
Create postlex rules and my lexicon files before preparing initial labels.
echo "postlex_apos_s_check postlex_the_vs_thee postlex_intervoc_r postlex_a" > postlex_rules touch my_lexicon.scm $MULTISYN_BUILD/bin/make_initial_phone_labs utts.data utts.mlf unilex-rpx postlex_rules my_lexicon.scm
4. Add noise
$MULTISYN_BUILD/bin/add_noise wav utts.data
5. Prepare MFCC
$MULTISYN_BUILD/bin/make_mfccs alignment wav utts.data
cd alignment $MULTISYN_BUILD/bin/make_mfcc_list ../utts.data train.scp ../mfcc ln -s ../utts.mlf aligned.0.mlf $MULTISYN_BUILD/bin/do_alignment . cd.. $MULTISYN_BUILD/bin/break_mlf alignment/aligned.4.mlf lab
7. Extract pitchmarks
Based on the speaker gender (male/female), please chose the option
-f while extracting pitchmarks.
$MULTISYN_BUILD/bin/make_pm_wave -m pm wav utts.data $MULTISYN_BUILD/bin/make_pm_fix pm utts.data
8. Find power factors
Ideally all of the above labelling steps should probably be done with normalised waveforms. However as correct labelling is needed to normalise them, that is not possible.
If you want to normalise your waveforms then do this:
$MULTISYN_BUILD/bin/find_powerfactors lab utts.data $MULTISYN_BUILD/bin/make_wav_powernorm wav_fn wav utts.data
Repeat steps 4-7 with normalized audio files
9. Mark bad energy phones
$MULTISYN_BUILD/bin/make_frame_ene utts.data $MULTISYN_BUILD/bin/Get_lr_ene utts.data $MULTISYN_BUILD/bin/Flag_bad_energy utts.data
10. Calculate duration
$MULTISYN_BUILD/bin/phone_lengths dur lab utts.data
11. Build utts
$MULTISYN_BUILD/bin/build_utts utts.data unilex-rpx postlex_rules
12. Final alignment
cd alignment $MULTISYN_BUILD/bin/do_final_alignment ../utts.data unilex-rpx ../postlex_rules n cd ..
13. Compute F0
$MULTISYN_BUILD/bin/make_f0 -f wav_fn utts.data
14. Prepare coefs
$MULTISYN_BUILD/bin/make_norm_join_cost_coefs coef f0 mfcc utts.data $MULTISYN_BUILD/bin/strip_join_cost_coefs coef coef_stripped utt utts.data
15. Prepare lpc
$MULTISYN_BUILD/bin/make_lpc wav utts.data ## as LPC extraction code does internal normalization
Build unit-selection model
Setup voice directory in Festival:
mkdir $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn mkdir $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb mkdir $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/festvox
Copy required files into the voice directory:
cp -r wav_fn $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/wav cp -r coef $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/coef cp -r f0 $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/f0 cp -r pm $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/pm cp -r utt $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/utt cp -r utts.data $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/utts.data
Copy pauses from
multisyn_build/resources/pauses into the respective directories e.g.,
cp -r awb_pauses.data $FESTDIR/lib/voices/unilex/cstr_edi_awb_multisyn/awb/
Synthesis with Festival
Make festival speak “Hello world!” with new voice:
festival> (voice_cstr_edi_awb_multisyn) festival> (SayText "Hello world!") festival> (utt.save.wave (utt.synth (Utterance Text "Hello world!" )) "hello_world.wav")