A placeholder. Generating speech, without a speaker.
Kyle Kastner's suggestions
VoCo seems to be a classic concatenative synthesis method for doing “voice cloning” which generally will work on small datasets but won't really generalize beyond the subset of sound tokens you already have, I did a blog post on a really simple version of this (http://kastnerkyle.github.io/posts/bad-speech-synthesis-made-simple/) .
There's another cool webdemo of how this works (http://jungle.horse/#) . Improving concatenative results to get VoCo level is mostly a matter of better features, and doing a lot of work on the DSP side to fix obvious structural errors, along with probably adding a language model to improve transitions and search.
You can see an example of concatenative audio for “music transfer” here (http://spectrum.mat.ucsb.edu/~b.sturm/sand/VLDCMCaR/VLDCMCaR.html)
I personally think Apple's hybrid approach has a lot more potential than plain VoCo for style transfer (https://machinelearning.apple.com/2017/08/06/siri-voices.html) - I like this paper a lot!
For learning about the prerequisites to Lyrebird, I recommend Alex Graves monograph (https://arxiv.org/abs/1308.0850) , then watching Alex Graves' lecture which shows the extension to speech (https://www.youtube.com/watch?v=-yX1SYeDHbg&t=37m00s) , and maybe checking out our paper char2wav (https://mila.quebec/en/publication/char2wav-end-to-end-speech-synthesis/) . There's a lot of background we couldn't fit in 4 pages for the workshop, but reading Graves' past work should cover most of that, along with WaveNet and SampleRNN (https://arxiv.org/abs/1612.07837). Lyrebird itself is proprietary, but going through these works should give you a lot of ideas about techniques to try.
I wouldn't recommend GAN for audio, unless you are already quite familiar with GAN in general. It is very hard to get any generative model working on audio, let alone GAN.
char2wav shows how to conditionalize an acoustic model
lyrebird How siri does it
wavegan does do GANs for audio bit the sequences are quite shout. Kastner might be right about long sequences wth GANs. But it does have online demo
cyclegan is pure voice conversion.
A handy data set of speech on youtube It's not clear where to download it from. The dataset it is based on, AVA doesn't have the speech part.
Apparently Audioset also?