The Living Thing / Notebooks : Learning Gamelan

I feel a certain class of audio signal should be easy to decompose and thence learn in a musically useful way; ones approximated by LTI, nearly-linear, nearly-additive filterbanks with sparse activations. This is a very specialised thing, except musically very useful, and close enough so soluble that it might be worth it.

On online learning of sparse basis dictionaries, for music; A specialised type of system identification or some generalisation of “shift invariant sparse coding”.

It seems like this would boil down to something like sparse dictionary learning, with the sparse activations, and a dictionary sparse in LPC components.

There are two ways to do this - time domain, and frequency domain.

For the latter, sparse time-domain activations are non local in Fourier components, but possibly simple to recover.

For the former, one could solve Durbin-Watson equations in the time domain, although we expect that to be unstable. We could go for sparse simultaneous kernel inference in the time domain, which might be better, or directly infer the Horner-form. Then we have a lot of simultaneous filter components and tedious inference for them. Otherwise, we could do it directly in the FFT domain, although this makes MIMO harder, and excludes the potential for non-linearities. The fact that I am expecting to identify many distinct systems in Fourier space as atoms complicates this slightly.

Thought: can I use HPSS to do this with the purely harmonic components? And use the percussive components as priors for the activations? How do you enforce causality for triggering in the FFT-transformed domain?

We have activations and components, but the activations are a KxT matrix, and the K components the rows of a KxL matrix. We wish the convolution of one with the other to approximately recover the original signal with a certain loss function.

Why gamelan? It’s tuned percussion, with a non-trivial tuning system, and no pitch bending.

Theory: TBD

Other questions: Infer chained biquads? Even restrict them to be bandpass? Or sparse, high-order filters of some description?

Does the approach of Tufts and Kumaresan (TuKu82) using sparse matrix sketching get us any closer to this?

RNN notes


Abdallah, S. A., & Plumbley, M. D.(2004) Polyphonic Music Transcription by Non-Negative Sparse Coding of Power Spectra. . Presented at the ISMIR
Adiloglu, K., Annies, A., Wahlen, E., Purwins, H., & Obermayer, K. (2012) A Graphical Representation and Dissimilarity Measure for Basic Everyday Sound Events. IEEE Transactions on Audio, Speech, and Language Processing, 20(5), 1542–1552. DOI.
Bach, F. R., & Jordan, M. I.(2006) Learning spectral clustering, with application to speech separation. Journal of Machine Learning Research, 7(Oct), 1963–2001.
Bayro-Corrochano, E. (2005) The Theory and Use of the Quaternion Wavelet Transform. Journal of Mathematical Imaging and Vision, 24(1), 19–35. DOI.
Bertin, N., Badeau, R., & Vincent, E. (2010) Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription. IEEE Transactions on Audio, Speech, and Language Processing, 18(3), 538–549. DOI.
Carabias-Orti, J. J., Virtanen, T., Vera-Candeas, P., Ruiz-Reyes, N., & Canadas-Quesada, F. J.(2011) Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1144–1158. DOI.
Chen, Y., & Hero, A. O.(2012) Recursive ℓ1,∞ Group Lasso. IEEE Transactions on Signal Processing, 60(8), 3978–3987. DOI.
Dieleman, S., & Schrauwen, B. (2014) End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6964–6968). IEEE DOI.
Eichler, M., Dahlhaus, R., & Dueck, J. (2016) Graphical Modeling for Multivariate Hawkes Processes with Nonparametric Link Functions. Journal of Time Series Analysis, n/a–n/a. DOI.
Ekanadham, C., Tranchina, D., & Simoncelli, E. P.(2011) Recovery of Sparse Translation-Invariant Signals With Continuous Basis Pursuit. IEEE Transactions on Signal Processing, 59(10), 4735–4744. DOI.
Févotte, C., Bertin, N., & Durrieu, J.-L. (2008) Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis. Neural Computation, 21(3), 793–830. DOI.
Finke, A., & Singh, S. S.(2016) Approximate Smoothing and Parameter Estimation in High-Dimensional State-Space Models. arXiv:1606.08650 [stat].
Green, D., & Bass, S. (1984) Representing periodic waveforms with nonorthogonal basis functions. IEEE Transactions on Circuits and Systems, 31(6), 518–534. DOI.
Gribonval, R. (2003) Piecewise linear source separation. In Proc. Soc. Photographic Instrumentation Eng. (Vol. 5207, pp. 297–310). San Diego, CA, USA
Gribonval, R., & Bacry, E. (2003) Harmonic decomposition of audio signals with matching pursuit. IEEE Transactions on Signal Processing, 51(1), 101–111. DOI.
Gribonval, R., Figueras i Ventura, R. M., & Vandergheynst, P. (2006) A simple test to check the optimality of a sparse signal approximation. Signal Processing, 86(3), 496–510. DOI.
Grosse, R., Raina, R., Kwong, H., & Ng, A. Y.(2007) Shift-Invariant Sparse Coding for Audio Classification. In The Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI2007) (Vol. 9, p. 8).
Hardt, M., Ma, T., & Recht, B. (2016) Gradient Descent Learns Linear Dynamical Systems. arXiv:1609.05191 [cs, Math, Stat].
Helén, M., & Virtanen, T. (2005) Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine. In Signal Processing Conference, 2005 13th European (pp. 1–4).
Hua, Y., & Sarkar, T. K.(1990) Matrix pencil method for estimating parameters of exponentially damped/undamped sinusoids in noise. IEEE Transactions on Acoustics, Speech and Signal Processing, 38(5), 814–824. DOI.
Huggins, P. S., & Zucker, S. W.(2007) Greedy Basis Pursuit. IEEE Transactions on Signal Processing, 55(7), 3760–3772. DOI.
Hyvärinen, A., & Hoyer, P. (2000) Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces. Neural Computation, 12(7), 1705–1720. DOI.
Jost, P. (2007, June) Algorithmic aspects of sparse approximations (phdthesis). . Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Jost, P., Vandergheynst, P., & Frossard, P. (2006) Tree-Based Pursuit: Algorithm and Properties. IEEE Trans. on Signal Processing.
Jost, P., Vandergheynst, P., Lesage, S., & Gribonval, R. (2006) MoTIF: An Efficient Algorithm for Learning Translation Invariant Dictionaries. In 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings (Vol. 5, pp. V–V). Toulouse, France DOI.
Klapuri, A., Virtanen, T., & Heittola, T. (2010) Sound source separation in monaural music signals using excitation-filter model and em algorithm. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5510–5513). DOI.
Knudson, K. C., Yates, J., Huk, A., & Pillow, J. W.(2014) Inferring sparse representations of continuous signals with continuous orthogonal matching pursuit. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (Vol. 27, pp. 1215–1223). Curran Associates, Inc.
Kreutz-Delgado, K., Murray, J. F., Rao, B. D., Engan, K., Lee, T., & Sejnowski, T. J.(2003) Dictionary Learning Algorithms for Sparse Representation. Neural Computation, 15(2), 349–396.
Kronland-Martinet, R., Guillemain, P., & Ystad, S. (1997) Modelling of Natural Sounds by Time Frequency and Wavelet Representations. Organised Sound, 2(3), 179–191.
Kronland-Martinet, R., Guillemain, P., & Ystad, S. (2001) From sound modeling to analysis-synthesis of sounds. In Workshop on Proceedings of MOSART Current Research Directions in Computer Music Workshop.
Leveau, P., Vincent, E., Richard, G., & Daudet, L. (2008) Instrument-Specific Harmonic Atoms for Mid-Level Music Representation. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 116–128. DOI.
Lewicki, M. S.(2002) Efficient Coding of Natural Sounds. Nature Neuroscience, 5(4), 356–363.
Lewicki, M. S., & Sejnowski, T. J.(1999) Coding time-varying signals using sparse, shift-invariant representations. In Proc. Conf. Advances Neural Infor. Proc. Syst. (Vol. 11, pp. 730–736). Denver, CO: MIT Press
Lewicki, M. S., & Sejnowski, T. J.(2000) Learning Overcomplete Representations. Neural Computation, 12(2), 337–365. DOI.
Luengo, D., Santamaría, I., & Vielva, L. (2005) A general solution to blind inverse problems for sparse input signals. Neurocomputing, 69(1–3), 198–215. DOI.
Mallat, S. G., & Zhang, Z. (1993) Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12), 3397–3415.
Masri, P., Bateman, A., & Canagarajah, N. (1997a) A review of time–frequency representations, with application to sound/music analysis–resynthesis. Organised Sound, 2(03), 193–205.
Masri, P., Bateman, A., & Canagarajah, N. (1997b) The importance of the time–frequency representation for sound/music analysis–resynthesis. Organised Sound, 2(03), 207–214.
Mattingley, J., & Boyd, S. (2010) Real-Time Convex Optimization in Signal Processing. IEEE Signal Processing Magazine, 27(3), 50–61. DOI.
Meinshausen, N., & Yu, B. (2009) Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1), 246–270. DOI.
Müller, M., Kurth, F., & Clausen, M. (2005a) Audio matching via chroma-based statistical features. In Proc. Int. Conf. Music Info. Retrieval (pp. 288–295). London, U.K.
Müller, M., Kurth, F., & Clausen, M. (2005b) Chroma-based statistical audio features for audio matching. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 275–278). New Paltz, NY
Needell, D., & Tropp, J. A.(2008) CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. arXiv:0803.2392 [cs, Math].
Pardalos, P. (2010) Convex optimization theory. Optimization Methods and Software, 25(3), 487–487.
Peeters, G. (2004) A large set of audio features for sound description (similarity and classification) in the CUIDADO project.
Preis, D., & Georgopoulos, V. C.(1999) Wigner Distribution Representation and Analysis of Audio Signals: An Illustrated Tutorial Review. Journal of the Audio Engineering Society, 47(12), 1043–1053.
Ravelli, E., Richard, G., & Daudet, L. (2008) Fast MIR in a Sparse Transform Domain. In Int. Conf. Music Info. Retrieval. Philadelphia, PA
Rebollo-Neira, L. (2006) On non-orthogonal signal representations. In C. V. Benton (Ed.), New Topics in Mathematical Physics Research (pp. 205–239). Nova Science Publishers, Inc.
Rebollo-Neira, L. (2007) Oblique Matching Pursuit. IEEE Signal Processing Letters, 14(10), 703–706. DOI.
Rebollo-Neira, L., & Lowe, D. (2002) Optimized orthogonal matching pursuit approach. IEEE Signal Processing Letters, 9(4), 137–140. DOI.
Rioul, O., & Vetterli, M. (1991) Wavelets and signal processing. IEEE Signal Processing Mag., 8(4), 14–38.
Routtenberg, T., & Tabrikian, J. (2010) Blind MIMO-AR System Identification and Source Separation with Finite-alphabet. Trans. Sig. Proc., 58(3), 990–1000. DOI.
Rubinstein, R., Bruckstein, A. M., & Elad, M. (2010) Dictionaries for Sparse Representation Modeling. Proceedings of the IEEE, 98(6), 1045–1057. DOI.
Sefati, S., Cowan, N. J., & Vidal, R. (2015) Linear systems with sparse inputs: Observability and input recovery. In 2015 American Control Conference (ACC) (pp. 5251–5257). DOI.
Smaragdis, P. (2004) Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs. In C. G. Puntonet & A. Prieto (Eds.), Independent Component Analysis and Blind Signal Separation (pp. 494–499). Springer Berlin Heidelberg
Smaragdis, P., & Brown, J. C.(2003) Non-negative matrix factorization for polyphonic music transcription. In Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on. (pp. 177–180). DOI.
Smith, E. C., & Lewicki, M. S.(2006) Efficient auditory coding. Nature, 439(7079), 978–982. DOI.
Sturm, B. L.(2001a) Composing for an Ensemble of Atoms: The Metamorphosis of Scientific Experiment into Music. Organised Sound, 6, 131–145. DOI.
Sturm, B. L.(2001b) Synthesis and Algorithmic Composition Techniques Derived from Particle Physics. In Proc. 8th Biennial Arts Tech. Symp. New London, CT
Sturm, B. L.(2004) MATConcat: An Application for Exploring Concatenative Sound Synthesis Using MATLAB. In Proc. 7th Int. Conf. Digital Audio Effects. Naples, Italy
Sturm, B. L.(2009) Sparse Approximation and Atomic Decomposition: Considering Atom Interactions in Evaluating and Building Signal Representations (phdthesis). . University of California, Santa Barbara, CA
Tropp, J. (2004, August) Topics in Sparse Approximation (phdthesis). . The University of Texas at Austin
Tropp, J. A., Wakin, M. B., Duarte, M. F., Baron, D., & Baraniuk, R. G.(2006) Random Filters for Compressive Sampling and Reconstruction. In Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (Vol. 3, pp. 872–875). DOI.
Tufts, D. W., & Kumaresan, R. (1982) Estimation of frequencies of multiple sinusoids: Making linear prediction perform like maximum likelihood. Proceedings of the IEEE, 70(9), 975–989. DOI.
van den Oord, A. (n.d.) Wavenet: A Generative Model for Raw Audio.
Van Eeghem, F., & De Lathauwer, L. (2013) Blind system identification as a compressed sensing problem.
Vincent, E., Bertin, N., & Badeau, R. (2008) Harmonic and inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch transcription. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 109–112). DOI.
Virtanen, T. (2007) Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1066–1074. DOI.
Wright, M., Beauchamp, J., Fitz, K., Rodet, X., Röbel, A., Serra, X., & Wakefield, G. (2001) Analysis/synthesis comparison. Organised Sound, 5(03), 173–189.
Yi, B.-K., & Faloutsos, C. (2000) Fast Time Sequence Indexing for Arbitrary Lp Norms. In VLDB ’00: Proceedings of the 26th International Conference on Very Large Data Bases (pp. 385–394). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Yi, B.-K., Jagadish, H. V., & Faloutsos, C. (1998) Efficient retrieval of similar time sequences under time warping. In , 14th International Conference on Data Engineering, 1998. Proceedings (pp. 201–208). DOI.
Yin, W., Osher, S., Goldfarb, D., & Darbon, J. (2008) Bregman Iterative Algorithms for ell_1-Minimization with Applications to Compressed Sensing. SIAM J. Imaging Sciences, 1(1), 143–168. DOI.
Yu, D., & Deng, L. (2011) Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP]. IEEE Signal Processing Magazine, 28(1), 145–154. DOI.
Yu, G., & Slotine, J.-J. (2009) Audio classification from time-frequency texture. In Acoustics, Speech, and Signal Processing, IEEE International Conference on (Vol. 0, pp. 1677–1680). Los Alamitos, CA, USA: IEEE Computer Society DOI.
Zhang, X., & Zbigniew, W. R.(2007) Analysis of Sound Features for Music Timbre Recognition. In International Conference on Multimedia and Ubiquitous Engineering, 2007. MUE ’07 (pp. 3–8). Washington, DC DOI.
Zhang, Y., Liang, P., & Wainwright, M. J.(2016) Convexified Convolutional Neural Networks. arXiv:1609.01000 [cs].
Zils, A., & Pachet, F. (2001) Musical Mosaicing. In Proc. COST G-6 Conf. Digital Audio Effects. Limerick, Ireland