SelVA: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

arXiv githubio githubio

.
β”œβ”€β”€ weights/
β”‚   β”œβ”€β”€ video_enc_sup_5.pth # text-conditioned video encoder
β”‚   └── generator_small_16k_sup_5.pth # v2a generator
└── ext_weights/
    β”œβ”€β”€ synchformer_state_dict.pth # pretrained Synchformer (24-01-04T16-39-21)
    β”œβ”€β”€ best_netG.pt # BigVGAN vocoder
    β”œβ”€β”€ v1-16.pth # vae 16kHz
    └── v1-44.pth # vae 44kHz
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support