This comparison showcases outputs from a fine-tuned VITS model, pre-trained on female Luganda speech from the Common Voice dataset and further fine-tuned on 2.8 hours of professionally recorded studio data from a single female Luganda speaker. The samples below illustrate how different checkpoints perform on the same input sentences.
Sentence | Model 1 | Model 2 | Model 3 |
---|