Generative stochastic models for audio. Analyse audio using machine listening methods to decompose it into features, maybe over a sparse basis, as in learning gamelan, and possibly of low dimension due to some sparsification maybe including with some stochastic dependence, e.g. a random field or regression model of some kind. Then simulate features from that stochastic model. Depending what your cost function was and how good your model fit was and how you smoothed your data, this might produce something acoustically indistinguishable from the source, or have performed concatenative synthesis from a sparse basis dictionary, or have produced a parametric synthesizer software package.
There is a lot of funny business with machine learning for polyphonic audio. For a start, a naive linear-algebra-style decomposition doesn’t perform great because human acoustic perception is messy. e.g. all white noise sounds the same to us, but deterministic models need a large basis to minutely approximate it in the \(L_2\) norm. Our phase sensitivity is frequency dependent. Adjacent frequencies mask each other. Many other things I don’t know about. One could use cost functions based on psychoacoustic cochlear models, but those are tricky to synthesize from, (although possible if perhaps unsatisfying with a neural network). There are also classic alternate psychoacoustic decompositions such as the Mel Frequency Cepstral Transform, but these are even harder to invert.
I’m publishing in this area.
Non-negative matrix factorisation approaches
Authors such as Virtanen, 2007, Févotte, Bertin, & Durrieu, 2008, Vincent, Bertin, & Badeau, 2008 and Smaragdis & Brown, 2003 Bertin, Badeau, & Vincent, 2010 popularised using non-negative matrix factorisations to identify the “activations” of power spectrograms for music analysis. It didn’t take long for this to be used to resynthesis tasks, by e.g. Aarabi & Peeters, 2018, Buch, Quinton, & Sturm, 2017 (source, site), Driedger & Pratzlich, 2015 (site), Hoffman, Blei, & Cook, 2010.
That first step might be to find some model which can approximately capture the cyclic and disordered components of the signal. Indeed Metamorph and smstools, based on a “sinusoids+noise” model do this kind of decomposition, but they mostly use it for resynthesis in limited ways, not simulating realisations from the fit model some possibly stochastic process. There is an implementation in csound called ATS which looks interesting?
Some non-parametric conditional wavelet density sounds more fun to me, maybe as a Markov random field - although what exact generative model I would fit here is still opaque. The sequence probably possesses multiple at scales, and there is evidence that music might have a recursive grammatical structure which would be hard to learn even if we had a perfect decomposition.
What is Loris?
- AaPe18: Aarabi, H. F., & Peeters, G. (2018) Music Retiler: Using NMF2D Source Separation for Audio Mosaicing.In Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion (pp. 27:1–27:7). New York, NY, USA: ACM DOI
- Anon18a: Anonymous. (2018a) Autoencoder-based Music Translation.
- Anon18b: Anonymous. (2018b) Modulated variational auto-encoders for many-to-many musical timbre transfer.
- Anon18c: Anonymous. (2018c) Synthnet: Learning synthesizers end-to-end.
- BeBV10: Bertin, N., Badeau, R., & Vincent, E. (2010) Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription.IEEE Transactions on Audio, Speech, and Language Processing, 18(3), 538–549. DOI
- BlBo17: Blaauw, M., & Bonada, J. (2017) A neural parametric singing synthesizer.ArXiv:1704.03809 [Cs].
- Boye11: Boyes, G. (2011) Dictionary-based analysis/synthesis and structured representations of musical audio. McGill University
- BuQS17: Buch, M., Quinton, E., & Sturm, B. L. (2017) Nicht-negativeMatrixFaktorisierungnutzendesKlangsynthesenSystem (NiMFKS): Extensions of NMF-based concatenative sound synthesis. In Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017 (p. 7). Edinburgh
- CaRo13: Caetano, M., & Rodet, X. (2013) Musical Instrument Sound Morphing Guided by Perceptually Motivated Features.IEEE Transactions on Audio, Speech, and Language Processing, 21(8), 1666–1675. DOI
- ChHo06: Chazan, D., & Hoory, R. (2006, April 25) Feature-domain concatenative speech synthesis.
- CoBo08: Coleman, G., & Bonada, J. (2008) Sound transformation by descriptor using an analytic domain. In Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008 (p. 7).
- CoBM11: Coleman, G., Bonada, J., & Maestre, E. (2011) Adding dynamic smoothing to mixture mosaicing synthesis.
- CMBM10: Coleman, G., Maestre, E., Bonada, J., Maestre, E., & Bonada, J. (2010) Augmenting sound mosaicing with descriptor-driven transformation. In Proceedings of DAFx-10 (p. 4).
- CoSt11: Collins, N., & Sturm, B. L. (2011) Sound cross-synthesis and morphing using dictionary-based methods.In International Computer Music Conference.
- CoDA07: Cont, A., Dubnov, S., & Assayag, G. (2007) GUIDAGE: A fast audio query guided assemblage.. Presented at the Proceedings of International Computer Music Conference (ICMC), ICMA
- Dili13: DI LISCIA, O. P. (2013) A Pure Data toolkit for real-time synthesis of ATS spectral data.
- DrPr15: Driedger, J., & Pratzlich, T. (2015) Let It Bee – Towards NMF-Inspired Audio Mosaicing.In Proceedings of ISMIR (p. 7). Malaga
- Dudl55: Dudley, H. (1955) Fundamentals of Speech Synthesis.Journal of the Audio Engineering Society, 3(4), 170–185.
- Dudl64: Dudley, H. (1964) Thirty Years of Vocoder Research.The Journal of the Acoustical Society of America, 36(5), 1021–1021. DOI
- ElZi17: Elbaz, D., & Zibulevsky, M. (2017) Perceptual audio loss function for deep learning.In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
- ERRD17: Engel, J., Resnick, C., Roberts, A., Dieleman, S., Eck, D., Simonyan, K., & Norouzi, M. (2017) Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.In PMLR.
- FéBD08: Févotte, C., Bertin, N., & Durrieu, J.-L. (2008) Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis.Neural Computation, 21(3), 793–830. DOI
- GoVe97: Goodwin, M., & Vetterli, M. (1997) Atomic decompositions of audio signals.In 1997 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics, 1997. DOI
- Haze01: Hazel, S. (2001) Soundmosaic. Web Page.
- HoCo06: Hoffman, M., & Cook, P. R. (2006) Feature-Based Synthesis: A Tool for Evaluating, Designing, and Interacting with Music IR Systems.In Proceedings of the 7th International Symposium on Music Information Retrieval (p. 2).
- HoCo07: Hoffman, M., & Cook, P. R. (2007) Real-time feature-based synthesis for live musical performance.In Proceedings of the 7th international conference on New interfaces for musical expression - NIME ’07 (p. 309). New York, New York: ACM Press DOI
- HoCo00: Hoffman, M., & Cook, P. R. (n.d.) Feature-Based Synthesis: Mapping Acoustic and Perceptual Features onto Synthesis Parameters. , 4.
- HoBC10: Hoffman, M. D., Blei, D. M., & Cook, P. R. (2010) Bayesian Nonparametric Matrix Factorization for Recorded Music.In International Conference on Machine Learning (p. 8).
- Hohm02: Hohmann, V. (2002) Frequency analysis and synthesis using a Gammatone filterbank. Acta Acustica United with Acustica, 88(3), 433–442.
- KePu12: Kersten, S., & Purwins, P. (2012) Fire Texture Sound Re-Synthesis Using Sparse Decomposition and Noise Modelling.In International Conference on Digital Audio Effects (DAFx12).
- LaCo03: Lazier, A., & Cook, P. (2003) Mosievius: feature driven interactive audio mosaicing. , 6.
- MaBC97a: Masri, P., Bateman, A., & Canagarajah, N. (1997a) A review of time–frequency representations, with application to sound/music analysis–resynthesis. Organised Sound, 2(03), 193–205. DOI
- MaBC97b: Masri, P., Bateman, A., & Canagarajah, N. (1997b) The importance of the time–frequency representation for sound/music analysis–resynthesis. Organised Sound, 2(03), 207–214. DOI
- MKGK17: Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., … Bengio, Y. (2017) SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.In Proceedings of International Conference on Learning Representations (ICLR) 2017.
- MWPT18: Mor, N., Wolf, L., Polyak, A., & Taigman, Y. (2018) A universal music translation network.ArXiv:1805.07848 [Cs, Stat].
- Mori16: Morise, M. (2016) D4C, a Band-aperiodicity Estimator for High-quality Speech Synthesis.Speech Commun., 84(C), 57–65. DOI
- MoYO16: Morise, M., Yokomori, F., & Ozawa, K. (2016) WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications.IEICE Transactions on Information and Systems, E99.D(7), 1877–1884. DOI
- MEKR11: Müller, M., Ellis, D. P. W., Klapuri, A., & Richard, G. (2011) Signal Processing for Music Analysis.IEEE Journal of Selected Topics in Signal Processing, 5(6), 1088–1110. DOI
- ÓnJH16: Ó Nuanáin, C., Jordà Puig, S., & Herrera Boyer, P. (2016) An interactive software instrument for real-time rhythmic concatenative synthesis.
- OlRö16: O’Leary, S., & Röbel, A. (2016) A Montage Approach to Sound Texture Synthesis.IEEE/ACM Trans. Audio, Speech and Lang. Proc., 24(6), 1094–1105.
- SaSG13: Salamon, J., Serrà, J., & Gómez, E. (2013) Tonal representations for music retrieval: from version identification to query-by-humming. International Journal of Multimedia Information Retrieval, 2(1), 45–58. DOI
- SaCa14: Sarroff, A. M., & Casey, M. (2014) Musical audio synthesis using autoencoding neural nets.. Ann Arbor, MI: Michigan Publishing, University of Michigan Library
- ScPu11: Scholler, S., & Purwins, H. (2011) Sparse Approximations for Drum Sound Classification. IEEE Journal of Selected Topics in Signal Processing, 5(5), 933–940. DOI
- Schw05: Schwarz, D. (2005) Current Research in Concatenative Sound Synthesis.In International Computer Music Conference (ICMC) (pp. 1–1). Barcelona, Spain
- Schw11: Schwarz, D. (2011) State of the art in sound texture synthesis.In Proceedings of DAFx-11 (pp. 221–231).
- SeSm90: Serra, X., & Smith, J. (1990) Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition.Computer Music Journal, 14(4), 12–24. DOI
- SBSA05: Simon, I., Basu, S., Salesin, D., & Agrawala, M. (2005) Audio analogies: Creating new music from an existing performance by concatenative synthesis.In Proceedings of the 2005 International Computer Music Conference (pp. 65–72).
- SmBr03: Smaragdis, P., & Brown, J. C. (2003) Non-negative matrix factorization for polyphonic music transcription. In Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on. (pp. 177–180). DOI
- Stur09: Sturm, B L. (2009) Sparse approximation and atomic decomposition: considering atom interactions in evaluating and building signal representations (phdthesis). University of California, Santa Barbara, CA
- SRMS09: Sturm, Bob L., Roads, C., McLeran, A., & Shynk, J. J. (2009) Analysis, visualization, and transformation of audio signals using dictionary-based methods. Journal of New Music Research, 38(4), 325–341. DOI
- SCSY17: Su, S.-Y., Chiu, C.-K., Su, L., & Yang, Y.-H. (2017) Automatic conversion of pop music into chiptunes for 8-bit pixel art.
- VeSm18: Verma, P., & Smith, J. O. (2018) Neural style transfer for audio spectograms.In 31st Conference on Neural Information Processing Systems (NIPS 2017).
- VeMe98: Verma, T. S., & Meng, T. H. Y. (1998) An analysis/synthesis tool for transient signals that allows a flexible sines+transients+noise model for audio.In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181) (Vol. 6, pp. 3573–3576). Seattle, WA, USA: IEEE DOI
- VeMe99: Verma, T. S., & Meng, T. H. Y. (1999) Sinusoidal modeling using frame-based perceptually weighted matching pursuits. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258) (pp. 981–984 vol.2). Phoenix, AZ, USA: IEEE DOI
- ViBB08: Vincent, E., Bertin, N., & Badeau, R. (2008) Harmonic and inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch transcription.In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 109–112). DOI
- Virt07: Virtanen, T. (2007) Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria.IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1066–1074. DOI
- WCKR17: Wager, S., Chen, L., Kim, M., & Raphael, C. (2017) Towards expressive instrument synthesis through smooth frame-by-frame reconstruction: From string to woodwind. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 391–395). DOI
- Wyse17: Wyse, L. (2017) Audio Spectrogram Representations for Processing with Convolutional Neural Networks.In Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE]).
- ZHKV18: Zhou, C., Horgan, M., Kumar, V., Vasco, C., & Darcy, D. (2018) Voice Conversion with Conditional SampleRNN.ArXiv:1808.08311 [Cs, Eess].
- ZiPa01: Zils, A., & Pachet, F. (2001) Musical mosaicing.In Proceedings of DAFx-01 (Vol. 2, p. 135). Limerick, Ireland