The Living Thing / Notebooks :

Machine listening

See also musical corpora, musical metrics synchronisation, sparse basis dictionaries, speech recognition, learning gamelan, analysis/resynthesis, whatever other machine listening posts I forgot.

I’m not going to talk about speech recognition here; That boat is full.

Machine listening: machine learning, from audio music. Everything from that damn shazam app, to teaching computers to recognise speech, to doing artsy shit with sound. I’m mostly concerned with the third one. Statistics, features, descriptors, metrics, kernels and affinities and the spaces and topologies they induce. for musical audio e.g. your Mp3 pop song library. This has considerable overlap with musical metrics but there I start from scores and transcriptions.

Polyphony and its problems.

Approximate logarithmic perception and its problems.

Should I do a separate psychoaccoustics notebook?

Interesting descriptors/features

Audio summaries that attempt to turn raw signals into useful feature vectors reflective of human perception of them. This is a huge industry, because it makes audio convenient for transmission (hello mobile telephony, MP3) But it’s also useful for understanding speech, music etc. There are as many descriptors as there are IEEE conference slots.

See AlSS16 for an intimidatingly comprehensive summary.

I’m especially interested in

  1. invertible ones, for analysis/resynthesis. If not analytically invertible, convexity would do.
  2. differentiable ones, for leveraging artificial neural infrastructure for easy optimisation.
  3. Ones that avoid windowed DTFT, because it sounds awful in the resynthesis phase and is lazy.
  4. both the harmonic and percussive parts.

Question: can we construct new ones from compressive system identification, as in Carm13?

Also, ones that can encode noisiness in the signal as well as harmonicity…? I guess I should read AlSS16.

Deep neural networks

See, e.g. Jordi Pons’ Spectrogram CNN discussion for some introductions to the kind of features a neural network might “discover” in audio recognition tasks.

There is some interesting stuff here; for example, Dieleman and Schrauwen (DiSc14) show that convolutional neural networks trained on raw audio (i.e. not spectrograms) for music classification recover Mel-like frequency bands. Thickstun et al (ThHK16) do some similar work.

And Keunwoo Choi shows that you can listen to what they learn.

Sparse comb filters

Differentiable! Conditionally invertible! Handy for syncing.

Moorer (Moor74) proposed this for harmonic purposes, but Robertson et al (RoSP11) have shown it to be handy for rhythm.

Autocorrelation features

Measure the signal’s full or partial autocorrelation with itself.

Linear Predictive coefficents

How do these transform?

I think this hinges always on Skorohod embedding.

Cepstra

Classic, but inconvenient to invert.

MFCC

Mel-frequency Ceptral Coefficients, or Mel Cepstral transform. Take the perfectly respectable-if-fiddly cepstrum and make it really messy, with a vague psychoacoustic model in the hope that the distinctions in the resulting “MFCC” might correspond to distinctions correspond to human perceptual distinctions.

Folk wisdom holds that MFCC features are Eurocentric, in that they destroy, or at least obscure, tonal language features. Ubiquitous, but inconsistently implemented; MFCCs are generally not the same across implementations, probably because the Mel scale is itself not universally standardised.

Asides from being loosely psychoacoustically motivated features, what do the coefficients of an MFCC specifically tell me?

Hmm. If I have got this right, these are “generic features”; things we can use in machine learning because we hope they project the spectrum into a space which approximately preserves psychoacoustic dissimilarity, whilst having little redundancy.

This heuristic pro is weighted with the practical con that they are not practically differentiable, nor invertible except by heroic computational effort, nor are they humanly interpretable, and riven with poorly-supported somewhat arbitrary steps. (The Cepstrum of the Mel-frequency-spectrogram is a weird thing that no longer picks out harmonics in the way that God and Tukey intended.)

Filterbanks

Inc bandpasses, Gammatones… Random filterbanks?

Dynamic dictionaries

See sparse basis dictionaries.

Cochlear activation models

Gah.

Units

Erbs, Mels, Sones, Phones…

Implementations

Here are some options for doing it:

For my part, I find it congenial to use python and supercollider together for my tricky offline computations and supercollider for the more “live”, realtime stuff; this feels like it gets me best of each of those worlds, and especially of the development communities. YMMV.

Spectral peak tracking

Other specialist tools

Large-Scale Content-Based Matching of Midi and Audio Files:

MIDI files, when paired with corresponding audio recordings, can be used as ground truth for many music information retrieval tasks. We present a system which can efficiently match and align MIDI files to entries in a large corpus of audio content based solely on content, i.e., without using any metadata. The core of our approach is a convolutional network-based cross-modality hashing scheme which transforms feature matrices into sequences of vectors in a common Hamming space. Once represented in this way, we can efficiently perform large-scale dynamic time warping searches to match MIDI data to audio recordings. We evaluate our approach on the task of matching a huge corpus of MIDI files to the Million Song Dataset.

See also Dannenberg’s bibliographies on score following.

mir_eval evaluates MIR metrics.

Refs

AbKI95
Abe, T., Kobayashi, T., & Imai, S. (1995) Harmonics tracking and pitch extraction based on instantaneous frequency. In International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95 (Vol. 1, pp. 756–759 vol.1). DOI.
AlSS16
Alías, F., Socoró, J. C., & Sevillano, X. (2016) A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Applied Sciences, 6(5), 143. DOI.
ABMD10
Anglade, A., Benetos, E., Mauch, M., & Dixon, S. (2010) Improving Music Genre Classification Using Automatically Induced Harmony Rules. Journal of New Music Research, 39, 349–361. DOI.
BEWL11
Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
BiPL96
Bigand, E., Parncutt, R., & Lerdahl, F. (1996) Perception of musical tension in short chord sequences: The influence of harmonic function, sensory dissonance, horizontal motion, and musical training. Perception & Psychophysics, 58(1), 125–141. DOI.
BlCH10
Blei, D. M., Cook, P. R., & Hoffman, M. (2010) Bayesian nonparametric matrix factorization for recorded music. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 439–446).
BoBV12
Boulanger-Lewandowski, N., Bengio, Y., & Vincent, P. (2012) Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
BJRL16
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M.(2016) Time series analysis: forecasting and control. (Fifth edition.). Hoboken, New Jersey: John Wiley & Sons, Inc
CVVR11
Carabias-Orti, J. J., Virtanen, T., Vera-Candeas, P., Ruiz-Reyes, N., & Canadas-Quesada, F. J.(2011) Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1144–1158. DOI.
Carm13
Carmi, A. Y.(2013) Compressive system identification: Sequential methods and entropy bounds. Digital Signal Processing, 23(3), 751–770. DOI.
Cart87
Carter, G. C.(1987) Coherence and time delay estimation. Proceedings of the IEEE, 75(2), 236–255. DOI.
ChFS16
Choi, K., Fazekas, G., & Sandler, M. (2016) Explaining Deep Convolutional Neural Networks on Music Classification. arXiv:1607.02444 [Cs].
CCFH67
Cochran, W. T., Cooley, J. W., Favin, D. L., Helms, H. D., Kaenel, R. A., Lang, W. W., … Welch, P. D.(1967) What is the fast Fourier transform?. Proceedings of the IEEE, 55(10), 1664–1674. DOI.
CoLW70
Cooley, J. W., Lewis, P. A. W., & Welch, P. D.(1970) The application of the fast Fourier transform algorithm to the estimation of spectra and cross-spectra. Journal of Sound and Vibration, 12(3), 339–352. DOI.
DeKa07
Demopoulos, R. J., & Katchabaw, M. J.(2007) Music Information Retrieval: A Survey of Issues and Approaches. . Technical Report
DiSc14
Dieleman, S., & Schrauwen, B. (2014) End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6964–6968). IEEE DOI.
DVFK10
Dörfler, M., Velasco, G., Flexer, A., & Klien, V. (2010) Sparse Regression in Time-Frequency Representations of Complex Audio.
DuKL06
Du, P., Kibbe, W. A., & Lin, S. M.(2006) Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22(17), 2059–2065. DOI.
Fitz10
Fitzgerald, D. (2010) Harmonic/percussive separation using median filtering.
FFCE16
Flamary, R., Févotte, C., Courty, N., & Emiya, V. (2016) Optimal spectral transportation with application to music transcription. In arXiv:1609.09799 [cs, stat] (pp. 703–711). Curran Associates, Inc.
FLTZ11
Fu, Z., Lu, G., Ting, K. M., & Zhang, D. (2011) A Survey of Audio-Based Music Classification and Annotation. IEEE Transactions on Multimedia, 13(2), 303–319. DOI.
GlLT09
Glover, J. C., Lazzarini, V., & Timoney, J. (2009) Simpl: A Python library for sinusoidal modelling. In DAFx 09 proceedings of the 12th International Conference on Digital Audio Effects, Politecnico di Milano, Como Campus, Sept. 1-4, Como, Italy (pp. 1–4). Dept. of Electronic Engineering, Queen Mary Univ. of London,
GoDa05
Godsill, S., & Davy, M. (2005) Bayesian computational models for inharmonicity in musical instruments. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005 (pp. 283–286). IEEE DOI.
GóHe04
Gómez, E., & Herrera, P. (2004) Estimating The Tonality Of Polyphonic Audio Files: Cognitive Versus Machine Learning Modelling Strategies. In ISMIR.
GrMK10
Grosche, P., Muller, M., & Kurth, F. (2010) Cyclic Tempogram - a Mid-level Tempo Representation For Music Signals. In 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (pp. 5522–5525). Piscataway, NJ.: IEEE DOI.
GrMS10
Grosche, Peter, Müller, M., & Sapp, C. S.(2010) What makes beat tracking difficult? a case study on Chopin mazurkas. In Proceedings of the International Conference on Music Information Retrieval (ISMIR 2010).
Harr78
Harris, F. J.(1978) On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE, 66(1), 51–83.
Herm88
Hermes, D. J.(1988) Measurement of pitch by subharmonic summation. The Journal of the Acoustical Society of America, 83(1), 257–264. DOI.
HCEG16
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., … Wilson, K. (2016) CNN Architectures for Large-Scale Audio Classification. arXiv:1609.09430 [Cs, Stat].
HDYD12
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI.
HoBB10
Hoffman, M., Bach, F. R., & Blei, D. M.(2010) Online learning for latent dirichlet allocation. In advances in neural information processing systems (pp. 856–864).
Iriz01
Irizarry, R. A.(2001) Local Harmonic Estimation in Musical Sound Signals. Journal of the American Statistical Association, 96(454), 357–367. DOI.
JoDa13
Joël Bensoam, & David Roze. (2013) Solving interactions between nonlinear resonators. In Proceedings of the Sound and Music Computing Conference.
KaSH00
Kailath, T., Sayed, A. H., & Hassibi, B. (2000) Linear estimation. . Upper Saddle River, N.J: Prentice Hall
KMBT11
Kalouptsidis, N., Mileounis, G., Babadi, B., & Tarokh, V. (2011) Adaptive algorithms for sparse system identification. Signal Processing, 91(8), 1910–1919. DOI.
KeDP13
Kereliuk, C., Depalle, P., & Pasquier, P. (2013) Audio Interpolation and Morphing via Structured-Sparse Linear Regression.
LaNK87
Lahat, M., Niederjohn, R. J., & Krubsack, D. (1987) A spectral autocorrelation method for measurement of the fundamental frequency of noise-corrupted speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(6), 741–750. DOI.
Ljun99
Ljung, L. (1999) System identification: theory for the user. (2nd ed.). Upper Saddle River, NJ: Prentice Hall PTR
LCHR16
Luo, Y., Chen, Z., Hershey, J. R., Roux, J. L., & Mesgarani, N. (2016) Deep Clustering and Conventional Networks for Music Separation: Stronger Together. arXiv:1611.06265 [Cs, Stat].
MKSW99
Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999) Performance Measures For Information Extraction. In In Proceedings of DARPA Broadcast News Workshop (pp. 249–252).
MaPe14
Marchand, U., & Peeters, G. (2014) The Modulation Scale Spectrum And Its Application To Rhythm-Content Analysis. In DAFX (Digital Audio Effects). Erlangen, Germany
MaPW09
Maxwell, J. B., Pasquier, P., & Whitman, B. (2009) Hierarchical Sequential Memory for Music: A Cognitive Model. In Proceedings of the tenth International Society for Music Information Retrieval Conference (ISMIR 2009) (pp. 429–434).
McEl11
McFee, B., & Ellis, D. P.(2011) Analyzing song structure with spectral clustering. In IEEE conference on Computer Vision and Pattern Recognition (CVPR).
MeHV16
Mesaros, A., Heittola, T., & Virtanen, T. (2016) Metrics for Polyphonic Sound Event Detection. Applied Sciences, 6(6), 162. DOI.
Moor74
Moorer, J. . (1974) The optimum comb method of pitch period analysis of continuous digitized speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 22(5), 330–338. DOI.
MEKR11
Müller, M., Ellis, D. P. W., Klapuri, A., & Richard, G. (2011) Signal Processing for Music Analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1088–1110. DOI.
MüDr12
Müller, Meinard, & Driedger, J. (2012) Data-Driven Sound Track Generation. In Multimodal Music Processing (Vol. 3, pp. 175–194). Dagstuhl, Germany: Schloss Dagstuhl—Leibniz-Zentrum für Informatik
MuGa03
Murthy, H. A., & Gadde, V. (2003) The modified group delay function and its application to phoneme recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03) (Vol. 1, p. I-68-71 vol.1). DOI.
NCRG16
Nussbaum-Thom, M., Cui, J., Ramabhadran, B., & Goel, V. (2016) Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units. (pp. 390–394). DOI.
Parn97
Parncutt, R. (1997) A model of the perceptual root(s) of a chord accounting for voicing and prevailing tonality. In M. Leman (Ed.), Music, Gestalt, and Computing (pp. 181–199). Springer Berlin Heidelberg
PaMK10
Paulus, J., Müller, M., & Klapuri, A. (2010) Audio-Based Music Structure Analysis. In ISMIR (pp. 625–636). ISMIR
PHMM16
Phan, H., Hertel, L., Maass, M., & Mertins, A. (2016) Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks. In Interspeech 2016.
Pick04
Pickens, J. (2004) Harmonic modeling for polyphonic music retrieval. . Citeseer
PiIl05
Pickens, J., & Iliopoulos, C. S.(2005) Markov Random Fields and Maximum Entropy Modeling for Music Information Retrieval. In ISMIR (pp. 207–214). Citeseer
PlLe65
Plomp, R., & Levelt, W. J.(1965) Tonal consonance and critical bandwidth. The Journal of the Acoustical Society of America, 38(4), 548–560. DOI.
PoLS16
Pons, J., Lidy, T., & Serra, X. (2016) Experimenting with musically motivated convolutional neural networks. In Content-Based Multimedia Indexing (CBMI), 2016 14th International Workshop on (pp. 1–6). IEEE DOI.
RoPl06
Robertson, A. N., & Plumbley, M. D.(2006) Real-time Interactive Musical Systems: An Overview. Proc. of the Digital Music Research Network, Goldsmiths University, London, 65–68.
RoPl07
Robertson, A., & Plumbley, M. (2007) B-Keeper: A Beat-tracker for Live Performance. In Proceedings of the 7th International Conference on New Interfaces for Musical Expression (pp. 234–237). New York, NY, USA: ACM DOI.
RoPl13
Robertson, A., & Plumbley, M. D.(2013) Synchronizing Sequencing Software to a Live Drummer. Computer Music Journal, 37(2), 46–60. DOI.
RoSD13
Robertson, A., Stark, A., & Davies, M. E.(2013) Percussive Beat tracking using real-time median filtering. . Presented at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
RoSP11
Robertson, A., Stark, A. M., & Plumbley, M. D.(2011) Real-Time Visual Beat Tracking Using a Comb Filter Matrix.
Robe11
Robertson, Andrew N. (2011) A Bayesian approach to drum tracking.
RoCh97
Rochebois, T., & Charbonneau, G. (1997) Cross-synthesis using interverted principal harmonic sub-spaces. In M. Leman (Ed.), Music, Gestalt, and Computing (pp. 375–385). Springer Berlin Heidelberg
SaLi16
Sainath, T. N., & Li, B. (2016) Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks. Submitted to Proc. Interspeech.
SGER14
Salamon, J., Gomez, E., Ellis, D. P., & Richard, G. (2014) Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine, 31(2), 118–134. DOI.
SaSG13
Salamon, J., Serrà, J., & Gómez, E. (2013) Tonal representations for music retrieval: from version identification to query-by-humming. International Journal of Multimedia Information Retrieval, 2(1), 45–58. DOI.
ScBö14
Schlüter, J., & Böck, S. (2014) Improved musical onset detection with Convolutional Neural Networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6979–6983). DOI.
ScKi11
Schmidt, E. M., & Kim, Y. E.(2011) Learning emotion-based acoustic features with deep belief networks. In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 65–68). DOI.
ScPu11
Scholler, S., & Purwins, H. (2011) Sparse Approximations for Drum Sound Classification. IEEE Journal of Selected Topics in Signal Processing, 5(5), 933–940. DOI.
SBRM08
Sergé, A., Bertaux, N., Rigneault, H., & Marguet, D. (2008) Dynamic multiple-target tracing to probe spatiotemporal cartography of cell membranes. Nature Methods, 5(8), 687–694. DOI.
SCBH12
Serrà, J., Corral, Á., Boguñá, M., Haro, M., & Arcos, J. L.(2012) Measuring the Evolution of Contemporary Western Popular Music. Scientific Reports, 2. DOI.
SmLe04
Smith, E. C., & Lewicki, M. S.(2004) Learning efficient auditory codes using spikes predicts cochlear filters. In Advances in Neural Information Processing Systems (pp. 1289–1296).
SmLe06
Smith, E. C., & Lewicki, M. S.(2006) Efficient auditory coding. Nature, 439(7079), 978–982. DOI.
SmLe05
Smith, E., & Lewicki, M. S.(2005) Efficient Coding of Time-Relative Structure Using Spikes. Neural Computation, 17(1), 19–45. DOI.
SmEl09
Smyth, T., & Elmore, A. R.(2009) Explorations in convolutional synthesis. In Proceedings of the 6th Sound and Music Computing Conference, Porto, Portugal (pp. 23–25).
TeSS82
Terhardt, E., Stoll, G., & Seewann, M. (1982) Algorithm for extraction of pitch and pitch salience from complex tonal signals. The Journal of the Acoustical Society of America, 71(3), 679–688. DOI.
ThHK16
Thickstun, J., Harchaoui, Z., & Kakade, S. (2016) Learning Features of Music from Scratch. arXiv:1611.09827 [Cs, Stat].
Welc67
Welch, P. D.(1967) The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics, 15(2), 70–73.
YaHY09
Yang, C., He, Z., & Yu, W. (2009) Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis. BMC Bioinformatics, 10, 4. DOI.
YoGo12
Yoshii, K., & Goto, M. (2012) Infinite Composite Autoregressive Models for Music Signal Analysis.