Festival unit selection voice
Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. Multisyn is an open-source toolkit for building unit selection voice with any speech corpus. This post gives detailed instructions on how to use Multisyn to build an unit selection model and Festival for final waveform synthesis.
Tools required
To build a new voice with Festival Multisyn, follow the step-by-step procedure given below:
Step-by-step procedure
1. Install tools
You might be familiar with most of these tools, but there are some differences in the way we setup these tools.
- A version of speech tools with python wrappers has to be installed in order to work with Multisyn.
- Latest version of Festival has to be installed in order to use hybrid unit selection.
Therefore, we recommend installing a fresh copy of these tools following the scripts provided in Merlin.
To install speech tools, Festival and Multisyn:
bash compile_unit_selection_tools.sh
To install HTK:
bash compile_htk.sh
Make sure you install all these tools without any errors and check environment variables before proceeding further.
2. setup
At this point, make sure you have data ready:
- a directory containing audio files with file extension
.wav
- a text file with transcriptions in the typical festival format.
For demo purpose, we use AWB corpus from CMU Arctic Database.
Let’s create a working directory and download the AWB corpus:
mkdir multisyn_voice
wget http://festvox.org/cmu_arctic/cmu_arctic/packed/cmu_us_awb_arctic-0.95-release.zip
unzip -q cmu_us_awb_arctic-0.95-release.zip
Let’s setup a directory for model building:
mkdir cstr_edi_awb_multisyn
cd cstr_edi_awb_multisyn
source $MULTISYN_BUILD/multisyn_build.sh
$MULTISYN_BUILD/bin/setup
Let’s copy the audio files and text file:
cp ../cmu_us_awb_arctic/wav/* wav/
cp ../cmu_us_awb_arctic/etc/txt.done.data utts.data
3. Prepare initial labels
$MULTISYN_BUILD/bin/setup_alignment
At this point, we have to chose a lexicon and a phoneset. The available options are:
- CMU lexicon: can be freely obtained from here
- Unisyn lexicon: can be freely obtained by signing a license from here
- Combilex: has a commercial-license and can be obtained from here
Based on the lexicon you have chosen, copy the files phone_list
and phone_substitutions
from resources
directory in MULTISYN_BUILD.
cp $MULTISYN_BUILD/resources/phone_list.unilex-rpx alignment/phone_list
cp $MULTISYN_BUILD/resources/phone_substitutions.unilex-rpx alignment/phone_substitutions
Create postlex rules and my lexicon files before preparing initial labels.
echo "postlex_apos_s_check postlex_the_vs_thee postlex_intervoc_r postlex_a" > postlex_rules
touch my_lexicon.scm
$MULTISYN_BUILD/bin/make_initial_phone_labs utts.data utts.mlf unilex-rpx postlex_rules my_lexicon.scm
4. Add noise
$MULTISYN_BUILD/bin/add_noise wav utts.data
5. Prepare MFCC
$MULTISYN_BUILD/bin/make_mfccs alignment wav utts.data
6. Force-alignment
cd alignment
$MULTISYN_BUILD/bin/make_mfcc_list ../utts.data train.scp ../mfcc
ln -s ../utts.mlf aligned.0.mlf
$MULTISYN_BUILD/bin/do_alignment .
cd..
$MULTISYN_BUILD/bin/break_mlf alignment/aligned.4.mlf lab
7. Extract pitchmarks
Based on the speaker gender (male/female), please chose the option -m
or -f
while extracting pitchmarks.
$MULTISYN_BUILD/bin/make_pm_wave -m pm wav utts.data
$MULTISYN_BUILD/bin/make_pm_fix pm utts.data
8. Find power factors
Ideally all of the above labelling steps should probably be done with normalised waveforms. However as correct labelling is needed to normalise them, that is not possible.
If you want to normalise your waveforms then do this:
$MULTISYN_BUILD/bin/find_powerfactors lab utts.data
$MULTISYN_BUILD/bin/make_wav_powernorm wav_fn wav utts.data
Repeat steps 4-7 with normalized audio files wav_fn
9. Mark bad energy phones
$MULTISYN_BUILD/bin/make_frame_ene utts.data
$MULTISYN_BUILD/bin/Get_lr_ene utts.data
$MULTISYN_BUILD/bin/Flag_bad_energy utts.data
10. Calculate duration
$MULTISYN_BUILD/bin/phone_lengths dur lab utts.data
11. Build utts
$MULTISYN_BUILD/bin/build_utts utts.data unilex-rpx postlex_rules
12. Final alignment
cd alignment
$MULTISYN_BUILD/bin/do_final_alignment ../utts.data unilex-rpx ../postlex_rules n
cd ..
13. Compute F0
$MULTISYN_BUILD/bin/make_f0 -f wav_fn utts.data
14. Prepare coefs
$MULTISYN_BUILD/bin/make_norm_join_cost_coefs coef f0 mfcc utts.data
$MULTISYN_BUILD/bin/strip_join_cost_coefs coef coef_stripped utt utts.data
15. Prepare lpc
$MULTISYN_BUILD/bin/make_lpc wav utts.data ## as LPC extraction code does internal normalization
Build unit-selection model
Setup voice directory in Festival:
mkdir $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn
mkdir $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb
mkdir $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/festvox
Copy required files into the voice directory:
cp -r wav_fn $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/wav
cp -r coef $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/coef
cp -r f0 $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/f0
cp -r pm $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/pm
cp -r utt $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/utt
cp -r utts.data $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/utts.data
Copy pauses from multisyn_build/resources/pauses
into the respective directories e.g.,
cp -r awb_pauses.data $FESTDIR/lib/voices/unilex/cstr_edi_awb_multisyn/awb/
Synthesis with Festival
$FESTDIR/bin/festival
Make festival speak “Hello world!” with new voice:
festival> (voice_cstr_edi_awb_multisyn)
festival> (SayText "Hello world!")
festival> (utt.save.wave (utt.synth (Utterance Text "Hello world!" )) "hello_world.wav")