You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
learn_alignment: True # for unsupervised modeling, False for supervised modeling
54
+
learn_alignment: True #True for unsupervised modeling, and False for supervised modeling
43
55
```
44
56
45
57
# Quickstart
@@ -55,7 +67,7 @@ Also, `Dockerfile` is provided for `Docker` users.
55
67
56
68
## Inference
57
69
58
-
You have to download the [pretrained models](https://drive.google.com/drive/folders/1xEOVbv3PLfGX8EgEkzg1014c9h8QMxQ-?usp=sharing) and put them in `output/ckpt/DATASET/`. The models are trained with unsupervised duration modeling under transformer building block.
70
+
You have to download the [pretrained models](https://drive.google.com/drive/folders/1xEOVbv3PLfGX8EgEkzg1014c9h8QMxQ-?usp=sharing) and put them in `output/ckpt/DATASET/`. The models are trained under unsupervised duration modeling with "transformer_fs2" building block.
59
71
60
72
For a **single-speaker TTS**, run
61
73
```
@@ -109,7 +121,7 @@ Any of both **single-speaker TTS** dataset (e.g., [Blizzard Challenge 2013](http
109
121
110
122
For the forced alignment, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
111
123
Pre-extracted alignments for the datasets are provided [here](https://drive.google.com/drive/folders/1fizpyOiQ1lG2UDaMlXnT3Ll4_j6Xwg7K?usp=sharing).
112
-
You have to unzip the files in `preprocessed_data/DATASET/TextGrid/`. Alternately, you can [run the aligner by yourself](https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html).
124
+
You have to unzip the files in `preprocessed_data/DATASET/TextGrid/`. Alternately, you can [run the aligner by yourself](https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/workflows/index.html).
The loss curves, synthesized mel-spectrograms, and audios are shown.
138
150
139
-

140
-

141
-

151
+
## LJSpeech
152
+
153
+

154
+

155
+

156
+
157
+
## VCTK
158
+
159
+

160
+

161
+

142
162
143
163
# Notes
144
164
145
165
- Both phoneme-level and frame-level variance are supported in both supervised and unsupervised duration modeling.
146
166
- Note that there are no pre-extracted phoneme-level variance features in unsupervised duration modeling.
147
-
- Convolutional embedding is used as [StyleSpeech](https://github.com/keonlee9420/StyleSpeech) for phoneme-level variance in unsupervised duration modeling. Otherwise, bucket-based embedding is used as [FastSpeech2](https://github.com/ming024/FastSpeech2).
148
167
- Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
149
168
- Two options for embedding for the **multi-speaker TTS** setting: training speaker embedder from scratch or using a pre-trained [philipperemy's DeepSpeaker](https://github.com/philipperemy/deep-speaker) model (as [STYLER](https://github.com/keonlee9420/STYLER) did). You can toggle it by setting the config (between `'none'` and `'DeepSpeaker'`).
150
169
- DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.
@@ -155,6 +174,27 @@ The loss curves, synthesized mel-spectrograms, and audios are shown.
155
174
156
175
- For vocoder, **HiFi-GAN** and **MelGAN** are supported.
157
176
177
+
### Updates Log
178
+
- Feb.18, 2022 (v0.2.0): Update data preprocessor and variance adaptor & losses following [keonlee9420's DiffSinger](https://github.com/keonlee9420/DiffSinger) / Add various prosody modeling methods
179
+
1. Prepare two different types of data pipeline in preprocessor to maximize unsupervised/supervised duration modelings
180
+
2. Adopt wavelet for pitch modeling & loss
181
+
3. Add fine-trained duration loss
182
+
4. Apply `var_start_steps` for better model convergence, especially under unsupervised duration modeling
183
+
5. Remove dependency of energy modeling on pitch variance
184
+
6. Add "transformer_fs2" building block, which is more close to the original FastSpeech2 paper
- Sep.21, 2021 (v0.1.1): Initialize with [ming024's FastSpeech2](https://github.com/ming024/FastSpeech2)
197
+
158
198
# Citation
159
199
160
200
Please cite this repository by the "[Cite this repository](https://github.blog/2021-08-19-enhanced-support-citations-github/)" of **About** section (top right of the main page).
@@ -166,4 +206,8 @@ Please cite this repository by the "[Cite this repository](https://github.blog/2
- [NVIDIA's NeMo](https://github.com/NVIDIA/NeMo): Special thanks to [Onur Babacan](https://github.com/babua) and [Rafael Valle](https://github.com/rafaelvalle) for unsupervised duration modeling.
0 commit comments