Generative stochastic models for audio. Analyse audio using machine listening methods to decompose it into features, maybe over a sparse basis, as in learning gamelan, and possibly of low dimension due to some sparsification maybe including with some stochastic dependence, e.g. a random field or regression model of some kind. Then simulate features from that stochastic model. Depending what your cost function was and how good your model fit was and how you smoothed your data, this might produce something acoustically indistinguishable from the source, or have performed concatenative synthesis from a sparse basis dictionary, or have produced a parametric synthesizer software package.
There is a lot of funny business with machine learning for polyphonic audio. For a start, a naive linear-algebra-style decomposition doesn’t perform great because human acoustic perception is messy. e.g. all white noise sounds the same to us, but deterministic models need a large basis to minutely approximate it in the \(L_2\) norm. Our phase sensitivity is frequency dependent. Adjacent frequencies mask each other. Many other things I don’t know about. One could use cost functions based on psychoacoustic cochlear models, but those are tricky to synthesize from, (although possible if perhaps unsatisfying with a neural network). There are also classic alternate psychoacoustic decompositions such as the Mel Frequency Cepstral Transform, but these are even harder to invert.
A.k.a concatenative synthesis. I’m publishing in this area.
Non-negative matrix factorisation approaches
Authors such as (Virtanen 2007; Bertin, Badeau, and Vincent 2010; Vincent, Bertin, and Badeau 2008; Févotte, Bertin, and Durrieu 2008; Smaragdis and Brown 2003) popularised using non-negative matrix factorisations to identify the “activations” of power spectrograms for music analysis. It didn’t take long for this to be used in resynthesis tasks, by e.g. Aarabi and Peeters (2018), Buch, Quinton, and Sturm (2017) (source, site), Driedger and Pratzlich (2015) (site), Hoffman, Blei, and Cook (2010)).
That first step might be to find some model which can approximately capture the cyclic and disordered components of the signal. Indeed Metamorph and smstools, based on a “sinusoids+noise” model do this kind of decomposition, but they mostly use it for resynthesis in limited ways, not simulating realisations from the fit model some possibly stochastic process. There is an implementation in csound called ATS which looks interesting?
Some non-parametric conditional wavelet density sounds more fun to me, maybe as a Markov random field - although what exact generative model I would fit here is still opaque. The sequence probably possesses multiple at scales, and there is evidence that music might have a recursive grammatical structure which would be hard to learn even if we had a perfect decomposition.
What is Loris?
Existing generative models for audio have predominantly aimed to directly model time-domain waveforms. MelNet instead aims to model the frequency content of an audio signal. MelNet can be used to model audio unconditionally, making it capable of tasks such as music generation. It can also be conditioned on text and speaker, making it applicable to tasks such as text-to-speech and voice conversion.
Matt Vitelli on music generation from MP3s (source).
Soundtracking audio from video.
Alex Graves on RNN predictive synthesis.
Parag Mittal on RNN style transfer.
Andy Sarrof, Musical Audio Synthesis Using Autoencoding Neural Nets. (code)
Neural style transfer for audio is crying out to be done, but I’ve only seen more traditional techniques. (UPDATE: It’s happening these days, but google it for yourself as I’m busy.)
Pixelrnn turns out to be good at music Dadabots have successfully weaponised samplernn and it’s cute.
Jlin and Holly Herndon](http://cdm.link/2018/12/jlin-holly-herndon-and-spawn-find-beauty-in-ais-flaws/) show off some artistic use of messed-up neural nets.
Aarabi, Hadrien Foroughmand, and Geoffroy Peeters. 2018. “Music Retiler: Using NMF2D Source Separation for Audio Mosaicing.” In Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion, 27:1–27:7. AM’18. New York, NY, USA: ACM. https://doi.org/10.1145/3243274.3243299.
Anonymous. 2018a. “Autoencoder-Based Music Translation,” September. https://openreview.net/forum?id=HJGkisCcKm.
———. 2018b. “Modulated Variational Auto-Encoders for Many-to-Many Musical Timbre Transfer,” September. https://openreview.net/forum?id=HJgOl3AqY7.
———. 2018c. “Synthnet: Learning Synthesizers End-to-End,” September. https://openreview.net/forum?id=H1lUOsA9Fm.
Bertin, N., R. Badeau, and E. Vincent. 2010. “Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription.” IEEE Transactions on Audio, Speech, and Language Processing 18 (3): 538–49. https://doi.org/10.1109/TASL.2010.2041381.
Blaauw, Merlijn, and Jordi Bonada. 2017. “A Neural Parametric Singing Synthesizer,” April. http://arxiv.org/abs/1704.03809.
Boyes, Graham. 2011. “Dictionary-Based Analysis/Synthesis and Structured Representations of Musical Audio.” McGill University. http://mt.music.mcgill.ca/~boyesg/GBoyes_MAthesis-Final.pdf.
Buch, Michael, Elio Quinton, and Bob L Sturm. 2017. “NichtnegativeMatrixFaktorisierungnutzendesKlangsynthesenSystem (NiMFKS): Extensions of NMF-Based Concatenative Sound Synthesis.” In Proceedings of the 20th International Conference on Digital Audio Effects, 7. Edinburgh.
Caetano, Marcelo, and Xavier Rodet. 2013. “Musical Instrument Sound Morphing Guided by Perceptually Motivated Features.” IEEE Transactions on Audio, Speech, and Language Processing 21 (8): 1666–75. https://doi.org/10.1109/TASL.2013.2260154.
Chazan, Dan, and Ron Hoory. 2006. Feature-domain concatenative speech synthesis. 7035791B2. U.S. patent, issued April 25, 2006. https://patents.google.com/patent/US7035791B2/en.
Coleman, Graham, and Jordi Bonada. 2008. “Sound Transformation by Descriptor Using an Analytic Domain.” In Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008, 7.
Coleman, Graham, Jordi Bonada, and Esteban Maestre. 2011. “Adding Dynamic Smoothing to Mixture Mosaicing Synthesis.”
Coleman, Graham Keith. 2015. “Descriptor Control of Sound Transformations and Mosaicing Synthesis.” http://repositori.upf.edu/handle/10230/27367.
Coleman, Graham, Esteban Maestre, Jordi Bonada, Esteban Maestre, and Jordi Bonada. 2010. “Augmenting Sound Mosaicing with Descriptor-Driven Transformation.” In Proceedings of DAFx-10, 4.
Collins, Nick, and Bob L. Sturm. 2011. “Sound Cross-Synthesis and Morphing Using Dictionary-Based Methods.” In International Computer Music Conference. http://vbn.aau.dk/files/77310007/dbmcrossynth.pdf.
Cont, Arshia, Shlomo Dubnov, and Gerard Assayag. 2007. “GUIDAGE: A Fast Audio Query Guided Assemblage.” In. ICMA. https://hal.inria.fr/hal-00839071/document.
Di Liscia, Oscar Pablo. 2013. “A Pure Data Toolkit for Real-Time Synthesis of ATS Spectral Data.” http://lac.linuxaudio.org/2013/papers/26.pdf.
Driedger, Jonathan, and Thomas Pratzlich. 2015. “Let It Bee – Towards NMF-Inspired Audio Mosaicing.” In Proceedings of ISMIR, 7. Malaga. http://ismir2015.uma.es/articles/13_Paper.pdf.
Dudley, Homer. 1955. “Fundamentals of Speech Synthesis.” Journal of the Audio Engineering Society 3 (4): 170–85. http://www.aes.org/e-lib/inst/browse.cfm?elib=9.
———. 1964. “Thirty Years of Vocoder Research.” The Journal of the Acoustical Society of America 36 (5): 1021–1. https://doi.org/10.1121/1.2143221.
Elbaz, Dan, and Michael Zibulevsky. 2017. “Perceptual Audio Loss Function for Deep Learning.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China. http://arxiv.org/abs/1708.05987.
Engel, Jesse, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. 2017. “Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.” In PMLR. http://arxiv.org/abs/1704.01279.
Févotte, Cédric, Nancy Bertin, and Jean-Louis Durrieu. 2008. “Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis.” Neural Computation 21 (3): 793–830. https://doi.org/10.1162/neco.2008.04-08-771.
Godsill, Simon J, and Ali Taylan Cemgil. 2005. “Probabilistic Phase Vocoder and Its Application to Interpolation of Missing Values in Audio Signals.” In 2005 13th European Signal Processing Conference, 4. https://www.eurasip.org/Proceedings/Eusipco/Eusipco2005/defevent/papers/cr1319.pdf.
Goodwin, M., and M. Vetterli. 1997. “Atomic Decompositions of Audio Signals.” In 1997 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, 1997. https://doi.org/10.1109/ASPAA.1997.625601.
Hazel, Steven. 2001. “Soundmosaic.” Web Page.
Hoffman, Matt, and Perry R Cook. 2006. “Feature-Based Synthesis: A Tool for Evaluating, Designing, and Interacting with Music IR Systems.” In Proceedings of ISMIR, 2. http://soundlab.cs.princeton.edu/publications/2006_ismir_fbs.pdf.
———. n.d. “Feature-Based Synthesis: Mapping Acoustic and Perceptual Features onto Synthesis Parameters,” 4.
Hoffman, Matt, and Perry R. Cook. 2007. “Real-Time Feature-Based Synthesis for Live Musical Performance.” In Proceedings of the 7th International Conference on New Interfaces for Musical Expression, 309. New York, New York: ACM Press. https://doi.org/10.1145/1279740.1279807.
Hoffman, Matthew D, David M Blei, and Perry R Cook. 2010. “Bayesian Nonparametric Matrix Factorization for Recorded Music.” In International Conference on Machine Learning, 8. http://soundlab.cs.princeton.edu/publications/2010_icml_gapnmf.pdf.
Hohmann, V. 2002. “Frequency Analysis and Synthesis Using a Gammatone Filterbank.” Acta Acustica United with Acustica 88 (3): 433–42.
Kersten, S., and P. Purwins. 2012. “Fire Texture Sound Re-Synthesis Using Sparse Decomposition and Noise Modelling.” In International Conference on Digital Audio Effects (DAFx12). http://mtg.upf.edu/node/2553.
Lazier, Ari, and Perry Cook. 2003. “Mosievius: Feature Driven Interactive Audio Mosaicing,” 6.
Masri, Paul, Andrew Bateman, and Nishan Canagarajah. 1997a. “A Review of Time–Frequency Representations, with Application to Sound/Music Analysis–Resynthesis.” Organised Sound 2 (03): 193–205. https://doi.org/10.1017/S1355771898009042.
———. 1997b. “The Importance of the Time–Frequency Representation for Sound/Music Analysis–Resynthesis.” Organised Sound 2 (03): 207–14. https://doi.org/10.1017/S1355771898009054.
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2017. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In Proceedings of International Conference on Learning Representations (ICLR) 2017. http://arxiv.org/abs/1612.07837.
Mor, Noam, Lior Wolf, Adam Polyak, and Yaniv Taigman. 2018. “A Universal Music Translation Network,” May. http://arxiv.org/abs/1805.07848.
Morise, Masanori. 2016. “D4C, a Band-Aperiodicity Estimator for High-Quality Speech Synthesis.” Speech Commun. 84 (C): 57–65. https://doi.org/10.1016/j.specom.2016.09.001.
Morise, Masanori, Fumiya Yokomori, and Kenji Ozawa. 2016. “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications.” IEICE Transactions on Information and Systems E99.D (7): 1877–84. https://doi.org/10.1587/transinf.2015EDP7457.
Müller, M., D. P. W. Ellis, A. Klapuri, and G. Richard. 2011. “Signal Processing for Music Analysis.” IEEE Journal of Selected Topics in Signal Processing 5 (6): 1088–1110. https://doi.org/10.1109/JSTSP.2011.2112333.
O’Leary, Seán, and Axel Röbel. 2016. “A Montage Approach to Sound Texture Synthesis.” IEEE/ACM Trans. Audio, Speech and Lang. Proc. 24 (6): 1094–1105. http://dl.acm.org/citation.cfm?id=2992726.2992735.
Ó Nuanáin, Cárthach, Sergi Jordà Puig, and Perfecto Herrera Boyer. 2016. “An Interactive Software Instrument for Real-Time Rhythmic Concatenative Synthesis.” http://repositori.upf.edu/handle/10230/32951.
Pascual, Santiago, Joan Serrà, and Antonio Bonafonte. 2019. “Towards Generalized Speech Enhancement with Generative Adversarial Networks,” April. http://arxiv.org/abs/1904.03418.
Salamon, Justin, Joan Serrà, and Emilia Gómez. 2013. “Tonal Representations for Music Retrieval: From Version Identification to Query-by-Humming.” International Journal of Multimedia Information Retrieval 2 (1): 45–58. https://doi.org/10.1007/s13735-012-0026-0.
Sarroff, Andy M., and Michael Casey. 2014. “Musical Audio Synthesis Using Autoencoding Neural Nets.” In. Ann Arbor, MI: Michigan Publishing, University of Michigan Library. http://www.smc-conference.org/smc-icmc-2014/papers/images/VOL_2/1411.pdf.
Scholler, S., and H. Purwins. 2011. “Sparse Approximations for Drum Sound Classification.” IEEE Journal of Selected Topics in Signal Processing 5 (5): 933–40. https://doi.org/10.1109/JSTSP.2011.2161264.
Schwarz, Diemo. 2011. “State of the Art in Sound Texture Synthesis.” In Proceedings of DAFx-11, 221–31. http://recherche.ircam.fr/pub/dafx11/Papers/30_e.pdf.
———. 2005. “Current Research in Concatenative Sound Synthesis.” In International Computer Music Conference (ICMC), 1–1. Barcelona, Spain. https://hal.archives-ouvertes.fr/hal-01161337.
Serra, Xavier, and Julius Smith. 1990. “Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic Plus Stochastic Decomposition.” Computer Music Journal 14 (4): 12–24. https://doi.org/10.2307/3680788.
Simon, Ian, Sumit Basu, David Salesin, and Maneesh Agrawala. 2005. “Audio Analogies: Creating New Music from an Existing Performance by Concatenative Synthesis.” In Proceedings of the 2005 International Computer Music Conference, 65–72. http://research.microsoft.com/en-us/um/redmond/groups/cue/compmusic/audioanalogies_icmc2005.pdf.
Smaragdis, P., and J. C. Brown. 2003. “Non-Negative Matrix Factorization for Polyphonic Music Transcription.” In Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on., 177–80. https://doi.org/10.1109/ASPAA.2003.1285860.
Sturm, Bob L., Laurent Daudet, and Curtis Roads. 2006. “Pitch-Shifting Audio Signals Using Sparse Atomic Approximations.” In Proceedings of the 1st ACM Workshop on Audio and Music Computing Multimedia, 45–52. AMCMM ’06. New York, NY, USA: ACM. https://doi.org/10.1145/1178723.1178730.
Sturm, Bob L., Curtis Roads, Aaron McLeran, and John J. Shynk. 2009. “Analysis, Visualization, and Transformation of Audio Signals Using Dictionary-Based Methods.” Journal of New Music Research 38 (4): 325–41. https://doi.org/10.1080/09298210903171178.
Su, Shih-Yang, Cheng-Kai Chiu, Li Su, and Yi-Hsuan Yang. 2017. “Automatic Conversion of Pop Music into Chiptunes for 8-Bit Pixel Art.” In. https://lemonatsu.github.io/pdf/su17icassp.pdf.
Tenenbaum, J. B., and W. T. Freeman. 2000. “Separating Style and Content with Bilinear Models.” Neural Computation 12 (6): 1247–83. https://doi.org/10.1162/089976600300015349.
Turner, Richard E., and Maneesh Sahani. 2014. “Time-Frequency Analysis as Probabilistic Inference.” IEEE Transactions on Signal Processing 62 (23): 6171–83. https://doi.org/10.1109/TSP.2014.2362100.
Vasquez, Sean, and Mike Lewis. 2019. “MelNet: A Generative Model for Audio in the Frequency Domain,” June. http://arxiv.org/abs/1906.01083.
Verhelst, Werner, and Marc Roelands. 1993. “An Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High Quality Time-Scale Modification of Speech.” In Proceedings of ICASSP, 554–57. ICASSP’93. Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/ICASSP.1993.319366.
Verma, Prateek, and Julius O. Smith. 2018. “Neural Style Transfer for Audio Spectograms.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). http://arxiv.org/abs/1801.01589.
Verma, T. S., and T. H. Y. Meng. 1998. “An Analysis/Synthesis Tool for Transient Signals That Allows a Flexible Sines+transients+noise Model for Audio.” In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181), 6:3573–6. Seattle, WA, USA: IEEE. https://doi.org/10.1109/ICASSP.1998.679647.
———. 1999. “Sinusoidal Modeling Using Frame-Based Perceptually Weighted Matching Pursuits.” In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), 981–84 vol.2. Phoenix, AZ, USA: IEEE. https://doi.org/10.1109/ICASSP.1999.759861.
Vincent, E., N. Bertin, and R. Badeau. 2008. “Harmonic and Inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch Transcription.” In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 109–12. https://doi.org/10.1109/ICASSP.2008.4517558.
Virtanen, T. 2007. “Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria.” IEEE Transactions on Audio, Speech, and Language Processing 15 (3): 1066–74. https://doi.org/10.1109/TASL.2006.885253.
Wager, S., L. Chen, M. Kim, and C. Raphael. 2017. “Towards Expressive Instrument Synthesis Through Smooth Frame-by-Frame Reconstruction: From String to Woodwind.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 391–95. https://doi.org/10.1109/ICASSP.2017.7952184.
Wyse, L. 2017. “Audio Spectrogram Representations for Processing with Convolutional Neural Networks.” In Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE]). http://arxiv.org/abs/1706.09559.
Zhou, Cong, Michael Horgan, Vivek Kumar, Cristina Vasco, and Dan Darcy. 2018. “Voice Conversion with Conditional SampleRNN,” August. http://arxiv.org/abs/1808.08311.
Zils, A, and F Pachet. 2001. “Musical Mosaicing.” In Proceedings of DAFx-01, 2:135. Limerick, Ireland. http://csl.sony.fr/downloads/papers/2001/zils-dafx2001.pdf.