Festival unit selection voice

Festival offers a general framework for building speech synthesis systems as well as including examples of various modules. Multisyn is an open-source toolkit for building unit selection voice with any speech corpus. This post gives detailed instructions on how to use Multisyn to build an unit selection model and Festival for final waveform synthesis.

Tools required

To build a new voice with Festival Multisyn, follow the step-by-step procedure given below:

Step-by-step procedure

1. Install tools

You might be familiar with most of these tools, but there are some differences in the way we setup these tools.

A version of speech tools with python wrappers has to be installed in order to work with Multisyn.
Latest version of Festival has to be installed in order to use hybrid unit selection.

Therefore, we recommend installing a fresh copy of these tools following the scripts provided in Merlin.

To install speech tools, Festival and Multisyn:

bash compile_unit_selection_tools.sh

To install HTK:

bash compile_htk.sh

Make sure you install all these tools without any errors and check environment variables before proceeding further.

2. setup

At this point, make sure you have data ready:

a directory containing audio files with file extension .wav
a text file with transcriptions in the typical festival format.

For demo purpose, we use AWB corpus from CMU Arctic Database.

Let’s create a working directory and download the AWB corpus:

mkdir multisyn_voice
wget http://festvox.org/cmu_arctic/cmu_arctic/packed/cmu_us_awb_arctic-0.95-release.zip
unzip -q cmu_us_awb_arctic-0.95-release.zip

Let’s setup a directory for model building:

mkdir cstr_edi_awb_multisyn
cd cstr_edi_awb_multisyn
source $MULTISYN_BUILD/multisyn_build.sh
$MULTISYN_BUILD/bin/setup

Let’s copy the audio files and text file:

cp ../cmu_us_awb_arctic/wav/* wav/
cp ../cmu_us_awb_arctic/etc/txt.done.data utts.data

3. Prepare initial labels

$MULTISYN_BUILD/bin/setup_alignment

At this point, we have to chose a lexicon and a phoneset. The available options are:

CMU lexicon: can be freely obtained from here
Unisyn lexicon: can be freely obtained by signing a license from here
Combilex: has a commercial-license and can be obtained from here

Based on the lexicon you have chosen, copy the files phone_list and phone_substitutions from resources directory in MULTISYN_BUILD.

cp $MULTISYN_BUILD/resources/phone_list.unilex-rpx alignment/phone_list
cp $MULTISYN_BUILD/resources/phone_substitutions.unilex-rpx alignment/phone_substitutions

Create postlex rules and my lexicon files before preparing initial labels.

echo "postlex_apos_s_check postlex_the_vs_thee postlex_intervoc_r postlex_a" > postlex_rules
touch my_lexicon.scm
$MULTISYN_BUILD/bin/make_initial_phone_labs utts.data utts.mlf unilex-rpx postlex_rules my_lexicon.scm

4. Add noise

$MULTISYN_BUILD/bin/add_noise wav utts.data

5. Prepare MFCC

$MULTISYN_BUILD/bin/make_mfccs alignment wav utts.data

6. Force-alignment

cd alignment
$MULTISYN_BUILD/bin/make_mfcc_list  ../utts.data train.scp ../mfcc
ln -s ../utts.mlf aligned.0.mlf
$MULTISYN_BUILD/bin/do_alignment .
cd..
$MULTISYN_BUILD/bin/break_mlf alignment/aligned.4.mlf lab

7. Extract pitchmarks

Based on the speaker gender (male/female), please chose the option -m or -f while extracting pitchmarks.

$MULTISYN_BUILD/bin/make_pm_wave -m pm wav utts.data
$MULTISYN_BUILD/bin/make_pm_fix pm utts.data

8. Find power factors

Ideally all of the above labelling steps should probably be done with normalised waveforms. However as correct labelling is needed to normalise them, that is not possible.

If you want to normalise your waveforms then do this:

$MULTISYN_BUILD/bin/find_powerfactors lab utts.data
$MULTISYN_BUILD/bin/make_wav_powernorm wav_fn wav utts.data

Repeat steps 4-7 with normalized audio files wav_fn

9. Mark bad energy phones

$MULTISYN_BUILD/bin/make_frame_ene utts.data
$MULTISYN_BUILD/bin/Get_lr_ene utts.data
$MULTISYN_BUILD/bin/Flag_bad_energy utts.data

10. Calculate duration

$MULTISYN_BUILD/bin/phone_lengths dur lab utts.data

11. Build utts

$MULTISYN_BUILD/bin/build_utts utts.data unilex-rpx postlex_rules

12. Final alignment

cd alignment
$MULTISYN_BUILD/bin/do_final_alignment ../utts.data unilex-rpx ../postlex_rules n
cd ..

13. Compute F0

$MULTISYN_BUILD/bin/make_f0 -f wav_fn utts.data

14. Prepare coefs

$MULTISYN_BUILD/bin/make_norm_join_cost_coefs coef f0 mfcc utts.data
$MULTISYN_BUILD/bin/strip_join_cost_coefs coef coef_stripped utt utts.data

15. Prepare lpc

$MULTISYN_BUILD/bin/make_lpc wav utts.data  ## as LPC extraction code does internal normalization

Build unit-selection model

Setup voice directory in Festival:

mkdir $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn
mkdir $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb
mkdir $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/festvox

Copy required files into the voice directory:

cp -r wav_fn $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/wav
cp -r coef $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/coef 
cp -r f0 $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/f0
cp -r pm $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/pm
cp -r utt $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/utt
cp -r utts.data $FESTDIR/lib/voices/english/cstr_edi_awb_multisyn/awb/utts.data

Copy pauses from multisyn_build/resources/pauses into the respective directories e.g.,

cp -r awb_pauses.data $FESTDIR/lib/voices/unilex/cstr_edi_awb_multisyn/awb/

Synthesis with Festival

$FESTDIR/bin/festival

Make festival speak “Hello world!” with new voice:

festival> (voice_cstr_edi_awb_multisyn)
festival> (SayText "Hello world!")
festival> (utt.save.wave (utt.synth (Utterance Text "Hello world!" )) "hello_world.wav")

Written by Srikanth Ronanki on August 21, 2016