SelVA: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
.
βββ weights/
β βββ video_enc_sup_5.pth # text-conditioned video encoder
β βββ generator_small_16k_sup_5.pth # v2a generator
βββ ext_weights/
βββ synchformer_state_dict.pth # pretrained Synchformer (24-01-04T16-39-21)
βββ best_netG.pt # BigVGAN vocoder
βββ v1-16.pth # vae 16kHz
βββ v1-44.pth # vae 44kHz
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support