The Living Thing / Notebooks : Learning Gamelan

I feel a certain class of audio signal should be easy to decompose and thence learn in a musically useful way; ones approximated by LTI, nearly-linear, nearly-additive filterbanks with sparse activations. This is a very specialised thing, except musically very useful, and close enough so soluble that it might be worth it.

On online learning of sparse basis dictionaries, for music; A specialised type of system identification or some generalisation of “shift invariant sparse coding”.

It seems like this would boil down to something like sparse dictionary learning, with the sparse activations, and a dictionary sparse in LPC components.

There are two ways to do this - time domain, and frequency domain.

For the latter, sparse time-domain activations are non local in Fourier components, but possibly simple to recover.

For the former, one could solve Durbin-Watson equations in the time domain, although we expect that to be unstable. We could go for sparse simultaneous kernel inference in the time domain, which might be better, or directly infer the Horner-form. Then we have a lot of simultaneous filter components and tedious inference for them. Otherwise, we could do it directly in the FFT domain, although this makes MIMO harder, and excludes the potential for non-linearities. The fact that I am expecting to identify many distinct systems in Fourier space as atoms complicates this slightly.

Thought: can I use HPSS to do this with the purely harmonic components? And use the percussive components as priors for the activations? How do you enforce causality for triggering in the FFT-transformed domain?

We have activations and components, but the activations are a KxT matrix, and the K components the rows of a KxL matrix. We wish the convolution of one with the other to approximately recover the original signal with a certain loss function.

Why gamelan? It’s tuned percussion, with a non-trivial tuning system, and no pitch bending.

Theory: TBD

Other questions: Infer chained biquads? Even restrict them to be bandpass? Or sparse, high-order filters of some description?

Does the approach of Tufts and Kumaresan (TuKu82) using sparse matrix sketching get us any closer to this?

RNN notes

Refs

AbPl04
Abdallah, S. A., & Plumbley, M. D.(2004) Polyphonic Music Transcription by Non-Negative Sparse Coding of Power Spectra. . Presented at the ISMIR
AAWP12
Adiloglu, K., Annies, A., Wahlen, E., Purwins, H., & Obermayer, K. (2012) A Graphical Representation and Dissimilarity Measure for Basic Everyday Sound Events. IEEE Transactions on Audio, Speech, and Language Processing, 20(5), 1542–1552. DOI.
BaJo06
Bach, F. R., & Jordan, M. I.(2006) Learning spectral clustering, with application to speech separation. Journal of Machine Learning Research, 7(Oct), 1963–2001.
Bayr05
Bayro-Corrochano, E. (2005) The Theory and Use of the Quaternion Wavelet Transform. Journal of Mathematical Imaging and Vision, 24(1), 19–35. DOI.
BeBV10
Bertin, N., Badeau, R., & Vincent, E. (2010) Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription. IEEE Transactions on Audio, Speech, and Language Processing, 18(3), 538–549. DOI.
CVVR11
Carabias-Orti, J. J., Virtanen, T., Vera-Candeas, P., Ruiz-Reyes, N., & Canadas-Quesada, F. J.(2011) Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1144–1158. DOI.
ChHe12
Chen, Y., & Hero, A. O.(2012) Recursive ℓ1,∞ Group Lasso. IEEE Transactions on Signal Processing, 60(8), 3978–3987. DOI.
DiSc14
Dieleman, S., & Schrauwen, B. (2014) End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6964–6968). IEEE DOI.
EiDD16
Eichler, M., Dahlhaus, R., & Dueck, J. (2016) Graphical Modeling for Multivariate Hawkes Processes with Nonparametric Link Functions. Journal of Time Series Analysis, n/a–n/a. DOI.
EkTS11
Ekanadham, C., Tranchina, D., & Simoncelli, E. P.(2011) Recovery of Sparse Translation-Invariant Signals With Continuous Basis Pursuit. IEEE Transactions on Signal Processing, 59(10), 4735–4744. DOI.
FéBD08
Févotte, C., Bertin, N., & Durrieu, J.-L. (2008) Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis. Neural Computation, 21(3), 793–830. DOI.
FiSi16
Finke, A., & Singh, S. S.(2016) Approximate Smoothing and Parameter Estimation in High-Dimensional State-Space Models. arXiv:1606.08650 [stat].
GrBa84
Green, D., & Bass, S. (1984) Representing periodic waveforms with nonorthogonal basis functions. IEEE Transactions on Circuits and Systems, 31(6), 518–534. DOI.
Grib03
Gribonval, R. (2003) Piecewise linear source separation. In Proc. Soc. Photographic Instrumentation Eng. (Vol. 5207, pp. 297–310). San Diego, CA, USA
GrBa03
Gribonval, R., & Bacry, E. (2003) Harmonic decomposition of audio signals with matching pursuit. IEEE Transactions on Signal Processing, 51(1), 101–111. DOI.
GrFV06
Gribonval, R., Figueras i Ventura, R. M., & Vandergheynst, P. (2006) A simple test to check the optimality of a sparse signal approximation. Signal Processing, 86(3), 496–510. DOI.
GRKN07
Grosse, R., Raina, R., Kwong, H., & Ng, A. Y.(2007) Shift-Invariant Sparse Coding for Audio Classification. In The Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI2007) (Vol. 9, p. 8).
HaMR16
Hardt, M., Ma, T., & Recht, B. (2016) Gradient Descent Learns Linear Dynamical Systems. arXiv:1609.05191 [cs, Math, Stat].
HeVi05
Helén, M., & Virtanen, T. (2005) Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine. In Signal Processing Conference, 2005 13th European (pp. 1–4).
HuSa90
Hua, Y., & Sarkar, T. K.(1990) Matrix pencil method for estimating parameters of exponentially damped/undamped sinusoids in noise. IEEE Transactions on Acoustics, Speech and Signal Processing, 38(5), 814–824. DOI.
HuZu07
Huggins, P. S., & Zucker, S. W.(2007) Greedy Basis Pursuit. IEEE Transactions on Signal Processing, 55(7), 3760–3772. DOI.
HyHo00
Hyvärinen, A., & Hoyer, P. (2000) Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces. Neural Computation, 12(7), 1705–1720. DOI.
Jost07
Jost, P. (2007, June) Algorithmic aspects of sparse approximations (phdthesis). . Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
JoVF06
Jost, P., Vandergheynst, P., & Frossard, P. (2006) Tree-Based Pursuit: Algorithm and Properties. IEEE Trans. on Signal Processing.
JVLG06
Jost, P., Vandergheynst, P., Lesage, S., & Gribonval, R. (2006) MoTIF: An Efficient Algorithm for Learning Translation Invariant Dictionaries. In 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings (Vol. 5, pp. V–V). Toulouse, France DOI.
KlVH10
Klapuri, A., Virtanen, T., & Heittola, T. (2010) Sound source separation in monaural music signals using excitation-filter model and em algorithm. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5510–5513). DOI.
KYHP14
Knudson, K. C., Yates, J., Huk, A., & Pillow, J. W.(2014) Inferring sparse representations of continuous signals with continuous orthogonal matching pursuit. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (Vol. 27, pp. 1215–1223). Curran Associates, Inc.
KMRE03
Kreutz-Delgado, K., Murray, J. F., Rao, B. D., Engan, K., Lee, T., & Sejnowski, T. J.(2003) Dictionary Learning Algorithms for Sparse Representation. Neural Computation, 15(2), 349–396.
KrGY97
Kronland-Martinet, R., Guillemain, P., & Ystad, S. (1997) Modelling of Natural Sounds by Time Frequency and Wavelet Representations. Organised Sound, 2(3), 179–191.
KrGY01
Kronland-Martinet, R., Guillemain, P., & Ystad, S. (2001) From sound modeling to analysis-synthesis of sounds. In Workshop on Proceedings of MOSART Current Research Directions in Computer Music Workshop.
LVRD08
Leveau, P., Vincent, E., Richard, G., & Daudet, L. (2008) Instrument-Specific Harmonic Atoms for Mid-Level Music Representation. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 116–128. DOI.
Lewi02
Lewicki, M. S.(2002) Efficient Coding of Natural Sounds. Nature Neuroscience, 5(4), 356–363.
LeSe99
Lewicki, M. S., & Sejnowski, T. J.(1999) Coding time-varying signals using sparse, shift-invariant representations. In Proc. Conf. Advances Neural Infor. Proc. Syst. (Vol. 11, pp. 730–736). Denver, CO: MIT Press
LeSe00
Lewicki, M. S., & Sejnowski, T. J.(2000) Learning Overcomplete Representations. Neural Computation, 12(2), 337–365. DOI.
LuSV05
Luengo, D., Santamaría, I., & Vielva, L. (2005) A general solution to blind inverse problems for sparse input signals. Neurocomputing, 69(1–3), 198–215. DOI.
MaZh93
Mallat, S. G., & Zhang, Z. (1993) Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12), 3397–3415.
MaBC97a
Masri, P., Bateman, A., & Canagarajah, N. (1997a) A review of time–frequency representations, with application to sound/music analysis–resynthesis. Organised Sound, 2(03), 193–205.
MaBC97b
Masri, P., Bateman, A., & Canagarajah, N. (1997b) The importance of the time–frequency representation for sound/music analysis–resynthesis. Organised Sound, 2(03), 207–214.
MaBo10
Mattingley, J., & Boyd, S. (2010) Real-Time Convex Optimization in Signal Processing. IEEE Signal Processing Magazine, 27(3), 50–61. DOI.
MeYu09
Meinshausen, N., & Yu, B. (2009) Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1), 246–270. DOI.
MüKC05a
Müller, M., Kurth, F., & Clausen, M. (2005a) Audio matching via chroma-based statistical features. In Proc. Int. Conf. Music Info. Retrieval (pp. 288–295). London, U.K.
MüKC05b
Müller, M., Kurth, F., & Clausen, M. (2005b) Chroma-based statistical audio features for audio matching. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 275–278). New Paltz, NY
NeTr08
Needell, D., & Tropp, J. A.(2008) CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. arXiv:0803.2392 [cs, Math].
Pard10
Pardalos, P. (2010) Convex optimization theory. Optimization Methods and Software, 25(3), 487–487.
Peet04
Peeters, G. (2004) A large set of audio features for sound description (similarity and classification) in the CUIDADO project.
PrGe99
Preis, D., & Georgopoulos, V. C.(1999) Wigner Distribution Representation and Analysis of Audio Signals: An Illustrated Tutorial Review. Journal of the Audio Engineering Society, 47(12), 1043–1053.
RaRD08
Ravelli, E., Richard, G., & Daudet, L. (2008) Fast MIR in a Sparse Transform Domain. In Int. Conf. Music Info. Retrieval. Philadelphia, PA
Rebo06
Rebollo-Neira, L. (2006) On non-orthogonal signal representations. In C. V. Benton (Ed.), New Topics in Mathematical Physics Research (pp. 205–239). Nova Science Publishers, Inc.
Rebo07
Rebollo-Neira, L. (2007) Oblique Matching Pursuit. IEEE Signal Processing Letters, 14(10), 703–706. DOI.
ReLo02
Rebollo-Neira, L., & Lowe, D. (2002) Optimized orthogonal matching pursuit approach. IEEE Signal Processing Letters, 9(4), 137–140. DOI.
RiVe91
Rioul, O., & Vetterli, M. (1991) Wavelets and signal processing. IEEE Signal Processing Mag., 8(4), 14–38.
RoTa10
Routtenberg, T., & Tabrikian, J. (2010) Blind MIMO-AR System Identification and Source Separation with Finite-alphabet. Trans. Sig. Proc., 58(3), 990–1000. DOI.
RuBE10
Rubinstein, R., Bruckstein, A. M., & Elad, M. (2010) Dictionaries for Sparse Representation Modeling. Proceedings of the IEEE, 98(6), 1045–1057. DOI.
SeCV15
Sefati, S., Cowan, N. J., & Vidal, R. (2015) Linear systems with sparse inputs: Observability and input recovery. In 2015 American Control Conference (ACC) (pp. 5251–5257). DOI.
Smar04
Smaragdis, P. (2004) Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs. In C. G. Puntonet & A. Prieto (Eds.), Independent Component Analysis and Blind Signal Separation (pp. 494–499). Springer Berlin Heidelberg
SmBr03
Smaragdis, P., & Brown, J. C.(2003) Non-negative matrix factorization for polyphonic music transcription. In Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on. (pp. 177–180). DOI.
SmLe06
Smith, E. C., & Lewicki, M. S.(2006) Efficient auditory coding. Nature, 439(7079), 978–982. DOI.
Stur01a
Sturm, B. L.(2001a) Composing for an Ensemble of Atoms: The Metamorphosis of Scientific Experiment into Music. Organised Sound, 6, 131–145. DOI.
Stur01b
Sturm, B. L.(2001b) Synthesis and Algorithmic Composition Techniques Derived from Particle Physics. In Proc. 8th Biennial Arts Tech. Symp. New London, CT
Stur04
Sturm, B. L.(2004) MATConcat: An Application for Exploring Concatenative Sound Synthesis Using MATLAB. In Proc. 7th Int. Conf. Digital Audio Effects. Naples, Italy
Stur09
Sturm, B. L.(2009) Sparse Approximation and Atomic Decomposition: Considering Atom Interactions in Evaluating and Building Signal Representations (phdthesis). . University of California, Santa Barbara, CA
Trop04
Tropp, J. (2004, August) Topics in Sparse Approximation (phdthesis). . The University of Texas at Austin
TWDB06
Tropp, J. A., Wakin, M. B., Duarte, M. F., Baron, D., & Baraniuk, R. G.(2006) Random Filters for Compressive Sampling and Reconstruction. In Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (Vol. 3, pp. 872–875). DOI.
TuKu82
Tufts, D. W., & Kumaresan, R. (1982) Estimation of frequencies of multiple sinusoids: Making linear prediction perform like maximum likelihood. Proceedings of the IEEE, 70(9), 975–989. DOI.
Oord00
van den Oord, A. (n.d.) Wavenet: A Generative Model for Raw Audio.
VaDe13
Van Eeghem, F., & De Lathauwer, L. (2013) Blind system identification as a compressed sensing problem.
ViBB08
Vincent, E., Bertin, N., & Badeau, R. (2008) Harmonic and inharmonic Nonnegative Matrix Factorization for Polyphonic Pitch transcription. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 109–112). DOI.
Virt07
Virtanen, T. (2007) Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1066–1074. DOI.
WBFR01
Wright, M., Beauchamp, J., Fitz, K., Rodet, X., Röbel, A., Serra, X., & Wakefield, G. (2001) Analysis/synthesis comparison. Organised Sound, 5(03), 173–189.
YiFa00
Yi, B.-K., & Faloutsos, C. (2000) Fast Time Sequence Indexing for Arbitrary Lp Norms. In VLDB ’00: Proceedings of the 26th International Conference on Very Large Data Bases (pp. 385–394). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
YiJF98
Yi, B.-K., Jagadish, H. V., & Faloutsos, C. (1998) Efficient retrieval of similar time sequences under time warping. In , 14th International Conference on Data Engineering, 1998. Proceedings (pp. 201–208). DOI.
YOGD08
Yin, W., Osher, S., Goldfarb, D., & Darbon, J. (2008) Bregman Iterative Algorithms for ell_1-Minimization with Applications to Compressed Sensing. SIAM J. Imaging Sciences, 1(1), 143–168. DOI.
YuDe11
Yu, D., & Deng, L. (2011) Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP]. IEEE Signal Processing Magazine, 28(1), 145–154. DOI.
YuSl09
Yu, G., & Slotine, J.-J. (2009) Audio classification from time-frequency texture. In Acoustics, Speech, and Signal Processing, IEEE International Conference on (Vol. 0, pp. 1677–1680). Los Alamitos, CA, USA: IEEE Computer Society DOI.
ZhZb07
Zhang, X., & Zbigniew, W. R.(2007) Analysis of Sound Features for Music Timbre Recognition. In International Conference on Multimedia and Ubiquitous Engineering, 2007. MUE ’07 (pp. 3–8). Washington, DC DOI.
ZhLW16
Zhang, Y., Liang, P., & Wainwright, M. J.(2016) Convexified Convolutional Neural Networks. arXiv:1609.01000 [cs].
ZiPa01
Zils, A., & Pachet, F. (2001) Musical Mosaicing. In Proc. COST G-6 Conf. Digital Audio Effects. Limerick, Ireland