Generative stochastic models for audio. Analyse audio using machine listening methods to decompose it into features, maybe in an sparse basis, as in learning gamelan and possibly of low dimension due to some sparsification maybe including with some stochastic dependence, e.g. a random field or regression model of some kind. Then simulate features from that stochastic model. Depending what your cost function was and how good your model fit was and how you smoothed your data, this might produce something acoustically indistinguishable from the source, or have performed concatenative synthesis from a sparse basis dictionary, or have produced a parametric synthesizer software package.
There is a lot of funny business with machine learning for polyphonic audio. For a start, a naive linear-algebra-style decomposition doesn't perform great because human acoustic perception is messy. e.g. all white noise sounds the same to us, but deterministic models need a large basis to minutely approximate it in the norm. Our phase sensitivity is frequency dependent. Adjacent frequencies mask each other. Many other things I don't know about. One could use cost functions based on psychoacoustic cochlear models, but those are tricky to synthesize from, (although possible if perhaps unsatisfying with a neural network). There are also classic alternate psychoacoustic decompositions such as the Mel Frequency Cepstral Transform, but these are even harder to invert.
I'm publishing in this area.
That first step might be to find some model which can approximately capture the cyclic and disordered components of the signal. Indeed Metamorph and smstools, based on a “sinusoids+noise” model do this kind of decomposition, but they mostly use it for resynthesis in limited ways, not re-simulating from a distribution of possible stochastic processes. OR am I missing something? there is an implementation in csound called ATS which looks worth playing with also.
Some non-parametric conditional wavelet density sounds more fun to me, maybe as a Markov random field - although what exact generative model I would fit here is still opaque. The sequence probably possesses multiple at scales, and there is evidence that music might have a recursive grammatical structure which would be hard to learn even if we had a perfect decomposition.
What is Loris?
- OlRö16: Seán O’Leary, Axel Röbel (2016) A Montage Approach to Sound Texture Synthesis. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 24(6), 1094–1105.
- BlBo17: Merlijn Blaauw, Jordi Bonada (2017) A neural parametric singing synthesizer. ArXiv:1704.03809 [Cs].
- Dili13: Oscar Pablo DI LISCIA (2013) A Pure Data toolkit for real-time synthesis of ATS spectral data.
- MaBC97a: Paul Masri, Andrew Bateman, Nishan Canagarajah (1997a) A review of time–frequency representations, with application to sound/music analysis–resynthesis. Organised Sound, 2(03), 193–205. DOI
- VeMe98: T.S. Verma, T.H.Y. Meng (1998) An analysis/synthesis tool for transient signals that allows a flexible sines+transients+noise model for audio. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181) (Vol. 6, pp. 3573–3576). Seattle, WA, USA: IEEE DOI
- GoVe97: M. Goodwin, M. Vetterli (1997) Atomic decompositions of audio signals. In 1997 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, 1997. DOI
- SBSA05: Ian Simon, Sumit Basu, David Salesin, Maneesh Agrawala (2005) Audio analogies: Creating new music from an existing performance by concatenative synthesis. In Proceedings of the 2005 International Computer Music Conference (pp. 65–72).
- Wyse17: L. Wyse (2017) Audio Spectrogram Representations for Processing with Convolutional Neural Networks. In Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE]).
- SCSY17: Shih-Yang Su, Cheng-Kai Chiu, Li Su, Yi-Hsuan Yang (2017) Automatic conversion of pop music into chiptunes for 8-bit pixel art.
- Boye11: Graham Boyes (2011) Dictionary-based analysis/synthesis and structured representations of musical audio. McGill University
- KePu12: S. Kersten, P. Purwins (2012) Fire Texture Sound Re-Synthesis Using Sparse Decomposition and Noise Modelling. In International Conference on Digital Audio Effects (DAFx12).
- Hohm02: V. Hohmann (2002) Frequency analysis and synthesis using a Gammatone filterbank. Acta Acustica United with Acustica, 88(3), 433–442.
- CoDA07: Arshia Cont, Shlomo Dubnov, Gerard Assayag (2007) GUIDAGE: A fast audio query guided assemblage. . ICMA
- SaCa14: Andy M. Sarroff, Michael Casey (2014) Musical audio synthesis using autoencoding neural nets. . Ann Arbor, MI: Michigan Publishing, University of Michigan Library
- ZiPa01: A Zils, F Pachet (2001) Musical mosaicing. In Proceedings of DAFx-01 (Vol. 2, p. 135). Limerick, Ireland
- ERRD17: Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, Mohammad Norouzi (2017) Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. In PMLR.
- ElZi17: Dan Elbaz, Michael Zibulevsky (2017) Perceptual audio loss function for deep learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
- MKGK17: Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, … Yoshua Bengio (2017) SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. In Proceedings of International Conference on Learning Representations (ICLR) 2017.
- ScPu11: S. Scholler, H. Purwins (2011) Sparse Approximations for Drum Sound Classification. IEEE Journal of Selected Topics in Signal Processing, 5(5), 933–940. DOI
- SeSm90: Xavier Serra, Julius Smith (1990) Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal, 14(4), 12–24. DOI
- Schw11: Diemo Schwarz (2011) State of the art in sound texture synthesis. In Proceedings of DAFx-11 (pp. 221–231).
- MaBC97b: Paul Masri, Andrew Bateman, Nishan Canagarajah (1997b) The importance of the time–frequency representation for sound/music analysis–resynthesis. Organised Sound, 2(03), 207–214. DOI
- SaSG13: Justin Salamon, Joan Serrà, Emilia Gómez (2013) Tonal representations for music retrieval: from version identification to query-by-humming. International Journal of Multimedia Information Retrieval, 2(1), 45–58. DOI
- ZHKV18: Cong Zhou, Michael Horgan, Vivek Kumar, Cristina Vasco, Dan Darcy (2018) Voice Conversion with Conditional SampleRNN. ArXiv:1808.08311 [Cs, Eess].