How to Train & Install F5 TTS - New Language and Single Speaker Voice Clone

How to Train & Install F5 TTS - New Language and Single Speaker Voice Clone

Jarods Journey

2 месяца назад

20,068 Просмотров

Ссылки и html тэги не поддерживаются


Комментарии:

@markbondurant6434
@markbondurant6434 - 14.11.2024 21:39

Why it's useless for audio books: Some things need to be pitched and some things can have multiple interpretations. For instance, "Your majesty" comes out as a squeak. Certainly that could work in some situations, but not all. In conversation, some voices need their pitch changed to denote gender and emotional state. The thing needs meta tags to indicate this information. Pauses are inconsistent and there's no way to adjust this. Metatags would be an easy implementation here. And there will certainly be no pregnant silences.

Ответить
@m-j107
@m-j107 - 14.11.2024 23:27

Can you make a guide how to turn a male voice into an attractive female voice that also does a lot of purring, breathing and meowing while reading the text and also ideally it does ASMR sounds like scratching and tapping and smacking / chewing.

Ответить
@GrimGearheart
@GrimGearheart - 15.11.2024 00:14

What's the latest and greatest for real time voice changers? Is it still w-okada?

Ответить
@احمدصبيح-خ7و
@احمدصبيح-خ7و - 15.11.2024 03:12

I trained the program on 20 hours of Arabic audio - and it took me a whole day of training - and the result was that the program is very weak in pronouncing words and does not pronounce them well - I think it needs a large number of epochs - because I put only 20 epochs - with an 8 GB graphics card - I do not know if there is another way to make it successful or not

Ответить
@MUSTAFK
@MUSTAFK - 15.11.2024 23:46

Are there any Language Models i can train on AMD? Got a RX6800 but all tutorials seem to require a Nvidia GPU... i have 16 gigs of vram bruh come on :(

Ответить
@mikhailv4686
@mikhailv4686 - 16.11.2024 00:54

Is the quota of 30 hours per week on kaggle enough to train a 25-hour dataset on 2 t4 video cards?

Ответить
@mikhailv4686
@mikhailv4686 - 16.11.2024 00:57

can you add to the f5 fork the ability to save the result of a workout on google drive or yandex drive? The daily quota for kaggle is 12 hours, I need a way to resume training. Will you add this to the fork?

Ответить
@atIonxp
@atIonxp - 17.11.2024 00:01

How did you solve the memory (not GPU VRAM) problem? I have faced the same issue.

Ответить
@timstevens3361
@timstevens3361 - 21.11.2024 03:57

i want
tts n stt in one api
free, no keys n
offline

Ответить
@CoreyHarker
@CoreyHarker - 21.11.2024 16:53

Hey Jarod! Is there a way to use F5 TTS in an audio-to-audio type inference?

Ответить
@AndrasEliassen
@AndrasEliassen - 21.11.2024 17:48

Does your dataset need audio from only 1 speaker? Or can I use a 100 hours dataset with 500 speakers, male, female, young and old?

Ответить
@SudipPaul-u3x
@SudipPaul-u3x - 22.11.2024 14:58

Will it work on the Intel arc a750?

Ответить
@przemowu4168
@przemowu4168 - 22.11.2024 18:38

Hey. Nice tutorial, but could you please also make some videos on how to tune pretrained model to make it more accurate and how to implement new model to interface for further use? Thanks in advance.

Ответить
@martinbobis6764
@martinbobis6764 - 23.11.2024 06:42

anyone tried to train with 8 or 12 gb gpu?

Ответить
@flubbyblobfish3253
@flubbyblobfish3253 - 23.11.2024 18:38

Hi. Would this be able to train a model from scratch instead of finetuning the model bases given? Thanks.

Ответить
@SyamsQbattar
@SyamsQbattar - 23.11.2024 19:17

Unfortunately, it does not support Indonesian language.

Ответить
@CozinhaEsperta
@CozinhaEsperta - 24.11.2024 03:48

Does this mean I can translate what a person says into another language?

Ответить
@yikifooler
@yikifooler - 24.11.2024 20:17

Dude can you please give me that Japanese trained model for free, Please.

Ответить
@GriZlee1234
@GriZlee1234 - 24.11.2024 20:30

Why u making video in english with japanese audio samples....It has no sense as we dont understand the full process. You could make it in japanese and in english separately....

Ответить
@YannMetalhead
@YannMetalhead - 26.11.2024 13:25

Good video.

Ответить
@empireones
@empireones - 28.11.2024 21:58

How to fix this? OSError: [WinError 126] The specified module could not be found. Error loading "D:\F5 TTS\F5-TTS\venv\Lib\site-packages\torch\lib\fbgemm.dll

Ответить
@mercuryin1
@mercuryin1 - 28.11.2024 23:29

My question is, I want to fine tune any cloned voice to be used as tts with piper and home assistant, I don´t need to fine tune a full language, is the same process than the one explained here or can´t be done because the models can´t be exported to be used within piper ? Thanks

Ответить
@dathuynh-l4k
@dathuynh-l4k - 29.11.2024 05:43

Can you tell me when I have finished training all the models, I can exit this f5tts program. I will turn off the computer and the next day I want to enter f5tts, what should I do?

Ответить
@dathuynh-l4k
@dathuynh-l4k - 30.11.2024 08:20

CAN YOU TELL ME. AFTER USING IT, F5 TTS SHUT IT DOWN. THE NEXT DAY I WANT TO OPEN IT, WHAT SHOULD I DO? WHEN I OPEN IT, I RUN AGAIN FROM THE BEGINNING AND IT LOST MY PREVIOUS TRAINING DATA

Ответить
@dathuynh-l4k
@dathuynh-l4k - 01.12.2024 07:47

Error: No audio files found in the specified path : D:\F5-TTS\src\f5_tts\..\..\data\vn_thairadio_pinyin\wavs. i am having this error can you help me

Ответить
@sopriojang
@sopriojang - 04.12.2024 16:14

can you make video how to add new language, cuz at hugging face already many finetune model from another language

Ответить
@acesonfire
@acesonfire - 05.12.2024 16:36

Sorry if this is a noob question, I am new to all of this, but why does my gradio interface look different than yours? Mine is black and orange.

Ответить
@vektor3dx539
@vektor3dx539 - 08.12.2024 16:40

It won't transcribe for me. I get "transcribe complete samples : 0. error files : 148" with a 3-sec wav file in the wavs folder, and an empty metadata.csv. Reinstalled from scratch, including ffmpeg and pathing it. Same thing. If anyone knows what this is, please let me know.

Ответить
@anifireff4855
@anifireff4855 - 09.12.2024 17:22

I want you assisstance please. I am actually Trying to build my own AI ASSISTANT and for that i am using Edge-TTS module to get real time TextToSpeech Output. but I came across the f5-tts. I want to know that After successfully training the model with a custom voice. How can I deploy and program it into my AI AISSTANT so that it can speak basic prompts like "I am doing this task sir" , "Consider this task done, sir" with the Trained model from F5-tts in real time . I want to know how to DEPLOY>

Ответить
@elplayeravefenix2280
@elplayeravefenix2280 - 10.12.2024 06:17

Any TTS that can work with spanish - latam language?

Ответить
@Dreadscapes
@Dreadscapes - 10.12.2024 16:55

what should the minimum requirements to use this? I'm on a laptop.

Ответить
@tee-hee9553
@tee-hee9553 - 10.12.2024 17:49

I want to ask what is the requirement for these application to work ?

Ответить
@Smartlearningstation-r3p
@Smartlearningstation-r3p - 10.12.2024 19:06

Nice. Can a local installation be integrated with Python to synthesize text into audio? For instance, in Python, I want to provide a sample voice and raw text, and the text would be synthesized using the sample voice, then saved as an MP3 file.

Ответить
@8561
@8561 - 11.12.2024 19:26

Do you have any suggestions for finetuning parameters/solutions when the finetune is producing "beep" noises? I am getting banding of frequency close to 0 when generating audio from the finetune and not sure what could be causing this.

Ответить
@kurotesuta
@kurotesuta - 12.12.2024 11:32

What I like its that the audio can be generated from python, will give it a try since my other alternative is not working great on a mac, lol

Ответить
@tsensuke5259
@tsensuke5259 - 14.12.2024 19:03

Can it change a recorded voice from a man to a woman?

Ответить
@ZenphylEdits
@ZenphylEdits - 15.12.2024 06:28

Can this work on an rx 6600? If not what is the equivalent?

Ответить
@darthvader4899
@darthvader4899 - 16.12.2024 14:11

can you fine tune with 2 -3 minutes audio data? I just want to train for one specific voice and will use reference audio from fine-tuning dataset.

Ответить
@AL_Momen_F
@AL_Momen_F - 19.12.2024 06:24

I don't know what to say, but my device's specs are four or five times lower than yours. To train it on two hours of recordings with 300 epochs, it took me 40 hours. If I had trained it on 10 hours, it might have taken me a month.

Ответить
@bigdaddy5303
@bigdaddy5303 - 22.12.2024 06:10

The most impressive thing is just how well it does accents, notably rhe australian accent. Every other voice cloner ive tried just ends up sounding american. This nails every accent.

Ответить
@DavidHager1
@DavidHager1 - 24.12.2024 18:27

So can I fine-tune an existing model to better sound like one person, like fune-tune the default F5-TTS model by training it further with a specific speakers data?
Or can I only create a new model from scratch like you did with that particular speakers data?

Ответить
@JosédaSilva-z5y
@JosédaSilva-z5y - 29.12.2024 20:51

Any suggestion for this error ? Thanks in advance.

size mismatch for transformer.text_embed.text_embed.weight: copying a param with shape torch.Size([2546, 512]) from checkpoint, the shape in current model is torch.Size([2546, 100]).
size mismatch for transformer.input_embed.proj.weight: copying a param with shape torch.Size([1024, 712]) from checkpoint, the shape in current model is torch.Size([1024, 300]).
Keyboard interruption in main thread... closing server.

Ответить
@JosédaSilva-z5y
@JosédaSilva-z5y - 30.12.2024 01:39

Firstly, thank you very much, great tutorial, magistral teacher!!! Everything works as you teach. Now, is there a tutorial showing how use it to generate long texts? Great new year for all!!!

Ответить
@JafsonAdam
@JafsonAdam - 30.12.2024 10:25

Hello, you mean you use 10 hours and 180k without pretrain model? At first you said 7000 hours.

Ответить
@CharlesVamadeva
@CharlesVamadeva - 02.01.2025 08:24

hey Jarod. I realize each reference voice clip cannot be more than 15 secs. However, how many 15 sec audio voice clips of the same voice can be uploaded? And does having multiple reference voice clips improve the quality of the final synthesized voice that's generated ?

Ответить
@BigTiddyGothGrappler
@BigTiddyGothGrappler - 02.01.2025 11:54

please fucking god stop saying so at the start of every sentence

Ответить
@MREDZ
@MREDZ - 12.01.2025 10:25

Hey, after installing and running the inference using the given URL, I am wondering how do I launch the inference once again after closing the cmd and inference web browser tab. There is no .bat file that I can see :(

Ответить