The Living Thing / Notebooks :

Analysis/resynthesis of audio

Generative stochastic models for audio. Analyse audio using machine listening methods to decompose it into features, maybe over a sparse basis, as in learning gamelan, and possibly of low dimension due to some sparsification maybe including with some stochastic dependence, e.g. a random field or regression model of some kind. Then simulate features from that stochastic model. Depending what your cost function was and how good your model fit was and how you smoothed your data, this might produce something acoustically indistinguishable from the source, or have performed concatenative synthesis from a sparse basis dictionary, or have produced a parametric synthesizer software package.

There is a lot of funny business with machine learning for polyphonic audio. For a start, a naive linear-algebra-style decomposition doesn’t perform great because human acoustic perception is messy. e.g. all white noise sounds the same to us, but deterministic models need a large basis to minutely approximate it in the \(L_2\) norm. Our phase sensitivity is frequency dependent. Adjacent frequencies mask each other. Many other things I don’t know about. One could use cost functions based on psychoacoustic cochlear models, but those are tricky to synthesize from, (although possible if perhaps unsatisfying with a neural network). There are also classic alternate psychoacoustic decompositions such as the Mel Frequency Cepstral Transform, but these are even harder to invert.

Mosaicing synthesis

I’m publishing in this area.

More soon.

Non-negative matrix factorisation approaches

Authors such as Virtanen, 2007, Févotte, Bertin, & Durrieu, 2008, Vincent, Bertin, & Badeau, 2008 and Smaragdis & Brown, 2003 Bertin, Badeau, & Vincent, 2010 popularised using non-negative matrix factorisations to identify the “activations” of power spectrograms for music analysis. It didn’t take long for this to be used to resynthesis tasks, by e.g. Aarabi & Peeters, 2018, Buch, Quinton, & Sturm, 2017 (source, site), Driedger & Pratzlich, 2015 (site), Hoffman, Blei, & Cook, 2010.


That first step might be to find some model which can approximately capture the cyclic and disordered components of the signal. Indeed Metamorph and smstools, based on a “sinusoids+noise” model do this kind of decomposition, but they mostly use it for resynthesis in limited ways, not simulating realisations from the fit model some possibly stochastic process. There is an implementation in csound called ATS which looks interesting?

Some non-parametric conditional wavelet density sounds more fun to me, maybe as a Markov random field - although what exact generative model I would fit here is still opaque. The sequence probably possesses multiple at scales, and there is evidence that music might have a recursive grammatical structure which would be hard to learn even if we had a perfect decomposition.


What is Loris?