Skip to content

Commit f52c66d

Browse files
committed
init v0.2.0
1 parent 2d97e71 commit f52c66d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+118052
-114918
lines changed

CITATION.cff

+2-2
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ authors:
55
given-names: "Keon"
66
orcid: "https://orcid.org/0000-0001-9028-1018"
77
title: "Comprehensive-Transformer-TTS"
8-
version: 0.1.1
8+
version: 0.2.0
99
doi: 10.5281/zenodo.5526991
10-
date-released: 2021-09-25
10+
date-released: 2022-02-18
1111
url: "https://github.com/keonlee9420/Comprehensive-Transformer-TTS"

README.md

+52-8
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@
99
- [x] [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) (Kitaev et al., 2020)
1010
- [x] [Attention Is All You Need](https://arxiv.org/abs/1706.03762) (Vaswani et al., 2017)
1111

12+
### Prosody Modelings (WIP)
13+
- [x] [DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021](https://arxiv.org/abs/2110.12612) (Liu et al., 2021)
14+
- [x] [Rich Prosody Diversity Modelling with Phone-level Mixture Density Network](https://arxiv.org/abs/2102.00851) (Du et al., 2021)
15+
1216
### Supervised Duration Modelings
1317
- [x] [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558) (Ren et al., 2020)
1418

@@ -28,18 +32,26 @@
2832
|Conformer|18903MiB / 24220MiB|7m 4s
2933
|Reformer|10293MiB / 24220MiB|10m 16s
3034
|Transformer|7909MiB / 24220MiB|4m 51s
35+
|Transformer_fs2|11571MiB / 24220MiB|4m 53s
3136

3237
Toggle the type of building blocks by
3338
```yaml
3439
# In the model.yaml
35-
block_type: "transformer" # ["transformer", "fastformer", "lstransformer", "conformer", "reformer"]
40+
block_type: "transformer_fs2" # ["transformer_fs2", "transformer", "fastformer", "lstransformer", "conformer", "reformer"]
41+
```
42+
43+
Toggle the type of prosody modelings by
44+
```yaml
45+
# In the model.yaml
46+
prosody_modeling:
47+
model_type: "none" # ["none", "du2021", "liu2021"]
3648
```
3749
3850
Toggle the type of duration modelings by
3951
```yaml
4052
# In the model.yaml
4153
duration_modeling:
42-
learn_alignment: True # for unsupervised modeling, False for supervised modeling
54+
learn_alignment: True # True for unsupervised modeling, and False for supervised modeling
4355
```
4456
4557
# Quickstart
@@ -55,7 +67,7 @@ Also, `Dockerfile` is provided for `Docker` users.
5567
5668
## Inference
5769
58-
You have to download the [pretrained models](https://drive.google.com/drive/folders/1xEOVbv3PLfGX8EgEkzg1014c9h8QMxQ-?usp=sharing) and put them in `output/ckpt/DATASET/`. The models are trained with unsupervised duration modeling under transformer building block.
70+
You have to download the [pretrained models](https://drive.google.com/drive/folders/1xEOVbv3PLfGX8EgEkzg1014c9h8QMxQ-?usp=sharing) and put them in `output/ckpt/DATASET/`. The models are trained under unsupervised duration modeling with "transformer_fs2" building block.
5971
6072
For a **single-speaker TTS**, run
6173
```
@@ -109,7 +121,7 @@ Any of both **single-speaker TTS** dataset (e.g., [Blizzard Challenge 2013](http
109121
110122
For the forced alignment, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
111123
Pre-extracted alignments for the datasets are provided [here](https://drive.google.com/drive/folders/1fizpyOiQ1lG2UDaMlXnT3Ll4_j6Xwg7K?usp=sharing).
112-
You have to unzip the files in `preprocessed_data/DATASET/TextGrid/`. Alternately, you can [run the aligner by yourself](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html).
124+
You have to unzip the files in `preprocessed_data/DATASET/TextGrid/`. Alternately, you can [run the aligner by yourself](https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/workflows/index.html).
113125
114126
After that, run the preprocessing script by
115127
```
@@ -136,15 +148,22 @@ tensorboard --logdir output/log
136148
to serve TensorBoard on your localhost.
137149
The loss curves, synthesized mel-spectrograms, and audios are shown.
138150
139-
![](./img/tensorboard_loss.png)
140-
![](./img/tensorboard_spec.png)
141-
![](./img/tensorboard_audio.png)
151+
## LJSpeech
152+
153+
![](./img/tensorboard_loss_ljs.png)
154+
![](./img/tensorboard_spec_ljs.png)
155+
![](./img/tensorboard_audio_ljs.png)
156+
157+
## VCTK
158+
159+
![](./img/tensorboard_loss_vctk.png)
160+
![](./img/tensorboard_spec_vctk.png)
161+
![](./img/tensorboard_audio_vctk.png)
142162
143163
# Notes
144164
145165
- Both phoneme-level and frame-level variance are supported in both supervised and unsupervised duration modeling.
146166
- Note that there are no pre-extracted phoneme-level variance features in unsupervised duration modeling.
147-
- Convolutional embedding is used as [StyleSpeech](https://github.com/keonlee9420/StyleSpeech) for phoneme-level variance in unsupervised duration modeling. Otherwise, bucket-based embedding is used as [FastSpeech2](https://github.com/ming024/FastSpeech2).
148167
- Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
149168
- Two options for embedding for the **multi-speaker TTS** setting: training speaker embedder from scratch or using a pre-trained [philipperemy's DeepSpeaker](https://github.com/philipperemy/deep-speaker) model (as [STYLER](https://github.com/keonlee9420/STYLER) did). You can toggle it by setting the config (between `'none'` and `'DeepSpeaker'`).
150169
- DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.
@@ -155,6 +174,27 @@ The loss curves, synthesized mel-spectrograms, and audios are shown.
155174
156175
- For vocoder, **HiFi-GAN** and **MelGAN** are supported.
157176
177+
### Updates Log
178+
- Feb.18, 2022 (v0.2.0): Update data preprocessor and variance adaptor & losses following [keonlee9420's DiffSinger](https://github.com/keonlee9420/DiffSinger) / Add various prosody modeling methods
179+
1. Prepare two different types of data pipeline in preprocessor to maximize unsupervised/supervised duration modelings
180+
2. Adopt wavelet for pitch modeling & loss
181+
3. Add fine-trained duration loss
182+
4. Apply `var_start_steps` for better model convergence, especially under unsupervised duration modeling
183+
5. Remove dependency of energy modeling on pitch variance
184+
6. Add "transformer_fs2" building block, which is more close to the original FastSpeech2 paper
185+
7. Add two types of prosody modeling methods
186+
8. Loss camparison on validation set:
187+
- LJSpeech - blue: v0.1.1 / green: v0.2.0
188+
<p align="center">
189+
<img src="./img/loss_comparison_ljs.png" width="80%">
190+
</p>
191+
192+
- VCTK - skyblue: v0.1.1 / orange: v0.2.0
193+
<p align="center">
194+
<img src="./img/loss_comparison_vctk.png" width="80%">
195+
</p>
196+
- Sep.21, 2021 (v0.1.1): Initialize with [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2)
197+
158198
# Citation
159199
160200
Please cite this repository by the "[Cite this repository](https://github.blog/2021-08-19-enhanced-support-citations-github/)" of **About** section (top right of the main page).
@@ -166,4 +206,8 @@ Please cite this repository by the "[Cite this repository](https://github.blog/2
166206
- [lucidrains' long-short-transformer](https://github.com/lucidrains/long-short-transformer)
167207
- [sooftware's conformer](https://github.com/sooftware/conformer)
168208
- [lucidrains' reformer-pytorch](https://github.com/lucidrains/reformer-pytorch)
209+
- [sagelywizard's pytorch-mdn](https://github.com/sagelywizard/pytorch-mdn)
210+
- [keonlee9420's Robust_Fine_Grained_Prosody_Control](https://github.com/keonlee9420/Robust_Fine_Grained_Prosody_Control)
211+
- [keonlee9420's Cross-Speaker-Emotion-Transfer](https://github.com/keonlee9420/Cross-Speaker-Emotion-Transfer)
212+
- [keonlee9420's DiffSinger](https://github.com/keonlee9420/DiffSinger)
169213
- [NVIDIA's NeMo](https://github.com/NVIDIA/NeMo): Special thanks to [Onur Babacan](https://github.com/babua) and [Rafael Valle](https://github.com/rafaelvalle) for unsupervised duration modeling.

audio/stft.py

+81
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,21 @@
22
import torch.nn.functional as F
33
import numpy as np
44
from scipy.signal import get_window
5+
import librosa
56
from librosa.util import pad_center, tiny
67
from librosa.filters import mel as librosa_mel_fn
8+
import pyloudnorm as pyln
79

810
from audio.audio_processing import (
911
dynamic_range_compression,
1012
dynamic_range_decompression,
1113
window_sumsquare,
1214
)
15+
from audio.tools import (
16+
librosa_pad_lr,
17+
amp_to_db,
18+
normalize,
19+
)
1320

1421

1522
class STFT(torch.nn.Module):
@@ -176,3 +183,77 @@ def mel_spectrogram(self, y):
176183
energy = torch.norm(magnitudes, dim=1)
177184

178185
return mel_output, energy
186+
187+
188+
class FastSpeechSTFT(torch.nn.Module):
189+
def __init__(
190+
self,
191+
fft_size,
192+
hop_size,
193+
win_length,
194+
num_mels,
195+
sample_rate,
196+
fmin,
197+
fmax,
198+
window='hann',
199+
eps=1e-10,
200+
loud_norm=False,
201+
min_level_db=-100,
202+
):
203+
super(FastSpeechSTFT, self).__init__()
204+
self.fft_size = fft_size
205+
self.hop_size = hop_size
206+
self.win_length = win_length
207+
self.num_mels = num_mels
208+
self.sample_rate = sample_rate
209+
self.fmin = fmin
210+
self.fmax = fmax
211+
self.window = window
212+
self.eps = eps
213+
self.loud_norm = loud_norm
214+
self.min_level_db = min_level_db
215+
216+
def mel_spectrogram(self, wav, return_linear=False):
217+
"""Computes mel-spectrograms from a batch of waves
218+
PARAMS
219+
------
220+
wav: Variable(torch.FloatTensor) with shape (B, T) in range [-1, 1]
221+
222+
RETURNS
223+
-------
224+
mel_output: torch.FloatTensor of shape (B, n_mel_channels, T)
225+
"""
226+
if self.loud_norm:
227+
meter = pyln.Meter(self.sample_rate) # create BS.1770 meter
228+
loudness = meter.integrated_loudness(wav)
229+
wav = pyln.normalize.loudness(wav, loudness, -22.0)
230+
if np.abs(wav).max() > 1:
231+
wav = wav / np.abs(wav).max()
232+
233+
# get amplitude spectrogram
234+
x_stft = librosa.stft(wav, n_fft=self.fft_size, hop_length=self.hop_size,
235+
win_length=self.win_length, window=self.window, pad_mode="constant")
236+
spc = np.abs(x_stft) # (n_bins, T)
237+
238+
# get mel basis
239+
fmin = 0 if self.fmin == -1 else self.fmin
240+
fmax = sample_rate / 2 if self.fmax == -1 else self.fmax
241+
mel_basis = librosa.filters.mel(self.sample_rate, self.fft_size, self.num_mels, self.fmin, self.fmax)
242+
mel = mel_basis @ spc
243+
244+
# get log scaled mel
245+
mel = np.log10(np.maximum(self.eps, mel))
246+
247+
l_pad, r_pad = librosa_pad_lr(wav, self.fft_size, self.hop_size, 1)
248+
wav = np.pad(wav, (l_pad, r_pad), mode='constant', constant_values=0.0)
249+
wav = wav[:mel.shape[1] * self.hop_size]
250+
251+
# get energy
252+
energy = np.sqrt(np.exp(mel) ** 2).sum(-1)
253+
254+
if not return_linear:
255+
return wav, mel, energy
256+
else:
257+
spc = amp_to_db(spc)
258+
spc = normalize(spc, self.min_level_db)
259+
return wav, mel, energy, spc

audio/tools.py

+21
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,24 @@ def inv_mel_spec(mel, out_filename, _stft, griffin_iters=60):
3232
audio = audio.cpu().numpy()
3333
audio_path = out_filename
3434
write(audio_path, _stft.sampling_rate, audio)
35+
36+
37+
def librosa_pad_lr(x, fsize, fshift, pad_sides=1):
38+
'''compute right padding (final frame) or both sides padding (first and final frames)
39+
'''
40+
assert pad_sides in (1, 2)
41+
# return int(fsize // 2)
42+
pad = (x.shape[0] // fshift + 1) * fshift - x.shape[0]
43+
if pad_sides == 1:
44+
return 0, pad
45+
else:
46+
return pad // 2, pad // 2 + pad % 2
47+
48+
49+
# Conversions
50+
def amp_to_db(x):
51+
return 20 * np.log10(np.maximum(1e-5, x))
52+
53+
54+
def normalize(S, min_level_db):
55+
return (S - min_level_db) / -min_level_db

config/LJSpeech/model.yaml

+52-6
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,46 @@
1-
block_type: "transformer"
1+
block_type: "transformer_fs2" # ["transformer_fs2", "transformer", "fastformer", "lstransformer", "conformer", "reformer"]
22

33
duration_modeling:
44
learn_alignment: True
55
aligner_temperature: 0.0005
66

7+
prosody_modeling:
8+
model_type: "none" # ["none", "du2021", "liu2021"]
9+
10+
# Du et al., 2021
11+
# This is only supported under supervised duration modeling (learn_alignment: False)
12+
du2021:
13+
extractor_kernel_size: 9
14+
predictor_kernel_size: [9, 5]
15+
predictor_num_gaussians: 20
16+
predictor_dropout: 0.2
17+
18+
# Liu et al., 2021
19+
# This is only tested under supervised duration modeling (learn_alignment: False)
20+
liu2021:
21+
bottleneck_size_u: 256
22+
bottleneck_size_p: 4
23+
ref_enc_filters: [32, 32, 64, 64, 128, 128]
24+
ref_enc_size: [3, 3]
25+
ref_enc_strides: [1, 2] # '1' is to keep the sequence length
26+
ref_enc_pad: [1, 1]
27+
ref_enc_gru_size: 32
28+
ref_attention_dropout: 0.
29+
token_num: 32
30+
predictor_kernel_size: 3 # [9, 5] for non-parallel predictor / 3 for parallel predictor
31+
predictor_dropout: 0.5
32+
33+
transformer_fs2:
34+
encoder_layer: 4
35+
encoder_head: 2
36+
encoder_hidden: 256
37+
decoder_layer: 6
38+
decoder_head: 2
39+
decoder_hidden: 256
40+
ffn_kernel_size: 9
41+
encoder_dropout: 0.1
42+
decoder_dropout: 0.1
43+
744
transformer:
845
encoder_layer: 4
946
encoder_head: 2
@@ -37,18 +74,27 @@ reformer:
3774

3875
variance_predictor:
3976
filter_size: 256
40-
kernel_size: 3
77+
predictor_grad: 0.1
78+
predictor_layers: 2
79+
predictor_kernel: 5
80+
cwt_hidden_size: 128
81+
cwt_std_scale: 0.8
82+
dur_predictor_layers: 2
83+
dur_predictor_kernel: 3
4184
dropout: 0.5
85+
ffn_padding: "SAME"
86+
ffn_act: "gelu"
4287

4388
variance_embedding:
44-
kernel_size: 9
45-
pitch_quantization: "linear" # support 'linear' or 'log', 'log' is allowed only if the pitch values are not normalized during preprocessing
89+
use_pitch_embed: True
90+
pitch_n_bins: 300
91+
use_energy_embed: True
92+
energy_n_bins: 256
4693
energy_quantization: "linear" # support 'linear' or 'log', 'log' is allowed only if the energy values are not normalized during preprocessing
47-
n_bins: 256
4894

4995
multi_speaker: False
5096

51-
max_seq_len: 1000
97+
max_seq_len: 1000 # max sequence length of LJSpeech is 870
5298

5399
vocoder:
54100
model: "HiFi-GAN" # support 'HiFi-GAN', 'MelGAN'

config/LJSpeech/preprocess.yaml

+9-2
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ preprocessing:
1212
text_cleaners: ["english_cleaners"]
1313
language: "en"
1414
audio:
15+
trim_top_db: 23
1516
sampling_rate: 22050
1617
max_wav_value: 32768.0
1718
stft:
@@ -23,8 +24,14 @@ preprocessing:
2324
mel_fmin: 0
2425
mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder
2526
pitch:
26-
feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
27-
normalization: True
27+
pitch_type: "cwt" # support 'frame', 'ph', 'cwt'
28+
pitch_norm: "log" # support 'standard', 'log'
29+
pitch_norm_eps: 0.000000001
30+
pitch_ar: False
31+
with_f0: True
32+
with_f0cwt: True
33+
use_uv: True
34+
cwt_scales: -1
2835
energy:
2936
feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
3037
normalization: True

config/LJSpeech/train.yaml

+15
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,28 @@ optimizer:
1717
warm_up_step: 4000
1818
anneal_steps: [300000, 400000, 500000]
1919
anneal_rate: 0.3
20+
loss:
21+
noise_loss: "l1"
22+
dur_loss: "mse"
23+
pitch_loss: "l1"
24+
cwt_loss: "l1"
25+
# cwt_add_f0_loss: false
26+
lambda_f0: 1.0
27+
lambda_uv: 1.0
28+
lambda_ph_dur: 1.0
29+
lambda_word_dur: 1.0
30+
lambda_sent_dur: 1.0
2031
step:
2132
total_step: 900000
2233
log_step: 100
2334
synth_step: 1000
2435
val_step: 1000
2536
save_step: 25000
37+
var_start_steps: 50000
2638
duration:
2739
binarization_start_steps: 6000
2840
binarization_loss_enable_steps: 18000
2941
binarization_loss_warmup_steps: 10000
42+
prosody:
43+
gmm_mdn_beta: 0.02
44+
prosody_loss_enable_steps: 100000

0 commit comments

Comments
 (0)