TTS Model comparisons with an inhouse single speaker dataset
Here are audio samples synthesized from different TTS architectures. The dataset is 10 hours , 42 minutes of a single female speaker. It was recorded in a studio environment for the best output.
sentences
- Sample #01: Amaanyi poliisi g’etadde ku muyiggo
- Sample #02: Musono ki ogwo ogw’enviiri?
- Sample #03: Omuganda amanyi nti omuntu alina emmeeme bbiri.
- Sample #04: Tolya wadde okubaaga ekinyonyi ekifudde oba ekirwadde.
- Sample #05: Obulwadde buno businga kuba bwa mutawaana mu bifo ebinnyuukirivu ennyo, omutera okutonnya enkuba ennyingi ate ng’ebitooke bisimbiddwa kumuukumu.
Ground truth
Some sample original recordings from the inhouse dataset.
Sample | Text | Audio |
---|---|---|
01 | Okwogera mu ngeri ey’amakulu mu mbeera yonna. | |
02 | Omu ku booluganda b’abavubuka bano, Francis Obbo bwe yagambye. | |
03 | Amawanga agerinaanye getaaga okukwatagana. | |
04 | Kirungi nnyo omulunzi okufuuyira embuzi ze okusobola okuzigobako enkwa. |
English Hifigan v2 vocoder by coqui
Details about the model: (todo: link) Tacotron2 + DDC: 302k steps trained + Hifigan_v2 coqui vocoder
Sample | Text | Audio |
---|---|---|
01 | Amaanyi poliisi g'etadde ku muyiggo | |
02 | Musono ki ogwo ogw’enviiri? | |
03 | Omuganda amanyi nti omuntu alina emmeeme bbiri. | |
04 | Tolya wadde okubaaga ekinyonyi ekifudde oba ekirwadde. | |
05 | Obulwadde buno businga kuba bwa mutawaana mu bifo ebinnyuukirivu ennyo, omutera okutonnya enkuba ennyingi ate ng’ebitooke bisimbiddwa kumuukumu. |
WaveRNN
todo
WaveGrad
todo
HifiGAN
todo
End to End TTS architecture - VITS
VITS: 63k steps trained
Sample | Text | Audio |
---|---|---|
01 | Amaanyi poliisi g'etadde ku muyiggo | |
02 | Musono ki ogwo ogw’enviiri? | |
03 | Omuganda amanyi nti omuntu alina emmeeme bbiri. | |
04 | Tolya wadde okubaaga ekinyonyi ekifudde oba ekirwadde. | |
05 | Obulwadde buno businga kuba bwa mutawaana mu bifo ebinnyuukirivu ennyo, omutera okutonnya enkuba ennyingi ate ng’ebitooke bisimbiddwa kumuukumu. |
GlowTTS / Waveglow
todo
Speedy Speech
For better RTF ( It is the fastest model on coqui) todo
Fast Speech
todo