The Living Thing / Notebooks :

Machine listening

See also musical corpora, musical metrics synchronisation, sparse basis dictionaries, speech recognition, learning gamelan, analysis/resynthesis, whatever other machine listening posts I forgot.

I’m not going to talk about speech recognition here; That boat is full.

Machine listening: machine learning, from audio music. Everything from that damn shazam app, to teaching computers to recognise speech, to doing artsy shit with sound. I’m mostly concerned with the third one. Statistics, features, descriptors, metrics, kernels and affinities and the spaces and topologies they induce. for musical audio e.g. your Mp3 pop song library. This has considerable overlap with musical metrics but there I start from scores and transcriptions.

Polyphony and its problems.

Approximate logarithmic perception and its problems.

Should I create a separate psychoaccoustics notebook?

Interesting descriptors/features

Audio summaries that attempt to turn raw signals into useful feature vectors reflective of human perception of them. This is a huge industry, because it makes audio convenient for transmission (hello mobile telephony, MP3) But it’s also useful for understanding speech, music etc. There are as many descriptors as there are IEEE conference slots.

See AlSS16 for an intimidatingly comprehensive summary.

I’m especially interested in

  1. invertible ones, for analysis/resynthesis. If not analytically invertible, convexity would do.
  2. differentiable ones, for leveraging artificial neural infrastructure for easy optimisation.
  3. Ones that avoid windowed DTFT, because it sounds awful in the resynthesis phase and is lazy.
  4. both the harmonic and percussive parts.

Also, ones that can encode noisiness in the signal as well as harmonicity…? I guess I should read AlSS16.

Deep neural networks

See, e.g. Jordi Pons’ Spectrogram CNN discussion for some introductions to the kind of features a neural network might “discover” in audio recognition tasks.

There is some interesting stuff here; for example, Dieleman and Schrauwen (DiSc14) show that convolutional neural networks trained on raw audio (i.e. not spectrograms) for music classification recover Mel-like frequency bands. Thickstun et al (ThHK16) do some similar work.

And Keunwoo Choi shows that you can listen to what they learn.

Sparse comb filters

Differentiable! Conditionally invertible! Handy for syncing.

Moorer (Moor74) proposed this for harmonic purposes, but Robertson et al (RoSP11) have shown it to be handy for rhythm.

Autocorrelation features

Measure the signal’s full or partial autocorrelation with itself.

Linear Predictive coefficents

How do these transform?

I think this hinges always on Skorohod embedding.

Cepstra

Classic, but inconvenient to invert.

MFCC

Mel-frequency Ceptral Coefficients, or Mel Cepstral transform. Take the perfectly respectable-if-fiddly cepstrum and make it really messy, with a vague psychoacoustic model in the hope that the distinctions in the resulting “MFCC” might correspond to distinctions correspond to human perceptual distinctions.

Folk wisdom holds that MFCC features are Eurocentric, in that they destroy, or at least obscure, tonal language features. Ubiquitous, but inconsistently implemented; MFCCs are generally not the same across implementations, probably because the Mel scale is itself not universally standardised.

Asides from being loosely psychoacoustically motivated features, what do the coefficients of an MFCC specifically tell me?

Hmm. If I have got this right, these are “generic features”; things we can use in machine learning because we hope they project the spectrum into a space which approximately preserves psychoacoustic dissimilarity, whilst having little redundancy.

This heuristic pro is weighted with the practical con that they are not practically differentiable, nor invertible except by heroic computational effort, nor are they humanly interpretable, and riven with poorly-supported somewhat arbitrary steps. (The Cepstrum of the Mel-frequency-spectrogram is a weird thing that no longer picks out harmonics in the way that God and Tukey intended.)

Filterbanks

Inc bandpasses, Gammatones… Random filterbanks?

Dynamic dictionaries

See sparse basis dictionaries.

Cochlear activation models

Gah.

Units

Erbs, Mels, Sones, Phones…

Implementations

Here are some options for doing it:

Musicbricks

musicbricks is an umbrella project to unify (sometime post hoc) many of the efforts mentioned individually below, plus a few other new ones.

  • Fraunhofer ML software (C++) is part of this project, including such things as

    • Real-Time Pitch Detection
    • MusicBricksTranscriber
    • Goatify Pdf
    • Time Stretch Pitch Shift Library

Librosa

LibROSA I have been using a lot recently, and I highly recommend it, especially if your pipeline already includes python. Sleek minimal design, with a curated set of algorithms (compare and contrast with the chaos of the vamp plugins ecosystem). Python-based, but fast enough because it uses the numpy numerical libraries. The API design meshes well with Scikit-learn, the de facto python machine learning standard, and it’s flexible and hackable.

  • see also talkbox for a nice-looking but abandoned (?) alternative, which is nonetheless worth it for Alexander Schindler’s lovely MIR lecture based around it.
  • amen is a remix program built on librosa

Sonicannotator

SonicAnnotator seems to be about cobbling together vamp plugins for batch analysis. That is more steps that I want in an already clunky workflow in the current projects It’s also more about RDF ontologies where I want matrices of floats.

Essentia

For C++ and Python there is Essentia, as seen in Freesound, which is a high recommendation IMO. (Watch out, the source download is enormous; just shy of half a gigbyte.) Features python and vamp integration, and a great many algorithms. I haven’t given it a fair chance because LibROSA has been such a joy to use. However, the intriguing Dunya project is based off it.

echonoest

echonest is a proprietary system that was used to generate the Million Songs Database. Seems to be gradually decaying, and was bought up by spotify. has great demos, such as autocanonisation.

MARSYAS

TODO

RP extract

TODO

phonological corpus tools

speech-focussed, phonological corpus tools is another research library for largeish corpus analysis, similarity-classification etc.

Metamorph, smstools

John Glover, soundcloud staffer, has several analysis libraries culminating in Metamorph,

a new open source library for performing high-level sound transformations based on a sinusoids plus noise plus transients model. It is written in C++, can be built as both a Python extension module and a Csound opcode, and currently runs on Mac OS X and Linux.

It is designed to work primarily on monophonic, quasi-harmonic sound sources and can be used in a non-real-time context to process pre-recorded sound files or can operate in a real-time (streaming) mode.

See also the related spectral modeling and synthesis package, smstools.

Sinusoidal modelling with simplsound

“sinusoidal modelling”: Simplsound (GlLT09) is a python implementation of that technique.

SCMIR

If you use a lot of Supercollider, you might like SCMIR, a native supercollider thingy. It has the virtues that

  • it can run in realtime, which is lovely.
  • comes with lots of neato bells and whistles, like the author’s quirky
    breakbeat cut library.

It has the vices that

  • It runs in Supercollider, which is a backwater language unserviced by modern development infrastructure, or decent machine learning libraries, and

  • a fraught development process; I can’t even link directly to it because the author doesn’t provide it its own anchor tag, let alone a whole web page or source code repository. Release schedule is opaque and sporadic. Consequently, it is effectively a lone guy’s pet project, rather than an active community endeavour.

    That is to say this is the Etsy sweater of code knitting. If on balance this sounds like a good deal to you, you can download SCMIR from somewhere or other on Nick Collins’ homepage.

Other specialist tools

Large-Scale Content-Based Matching of Midi and Audio Files:

MIDI files, when paired with corresponding audio recordings, can be used as ground truth for many music information retrieval tasks. We present a system which can efficiently match and align MIDI files to entries in a large corpus of audio content based solely on content, i.e., without using any metadata. The core of our approach is a convolutional network-based cross-modality hashing scheme which transforms feature matrices into sequences of vectors in a common Hamming space. Once represented in this way, we can efficiently perform large-scale dynamic time warping searches to match MIDI data to audio recordings. We evaluate our approach on the task of matching a huge corpus of MIDI files to the Million Song Dataset.

See also Dannenberg’s bibliographies on score following.

mir_eval evaluates MIR metrics.

Refs

AbKI95
Abe, T., Kobayashi, T., & Imai, S. (1995) Harmonics tracking and pitch extraction based on instantaneous frequency. In International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95 (Vol. 1, pp. 756–759 vol.1). DOI.
AlSS16
Alías, F., Socoró, J. C., & Sevillano, X. (2016) A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Applied Sciences, 6(5), 143. DOI.
ABMD10
Anglade, A., Benetos, E., Mauch, M., & Dixon, S. (2010) Improving Music Genre Classification Using Automatically Induced Harmony Rules. Journal of New Music Research, 39, 349–361. DOI.
BEWL11
Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
BiPL96
Bigand, E., Parncutt, R., & Lerdahl, F. (1996) Perception of musical tension in short chord sequences: The influence of harmonic function, sensory dissonance, horizontal motion, and musical training. Perception & Psychophysics, 58(1), 125–141. DOI.
BlCH10
Blei, D. M., Cook, P. R., & Hoffman, M. (2010) Bayesian nonparametric matrix factorization for recorded music. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 439–446).
BoBV12
Boulanger-Lewandowski, N., Bengio, Y., & Vincent, P. (2012) Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
BJRL16
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M.(2016) Time series analysis: forecasting and control. (Fifth edition.). Hoboken, New Jersey: John Wiley & Sons, Inc
CVVR11
Carabias-Orti, J. J., Virtanen, T., Vera-Candeas, P., Ruiz-Reyes, N., & Canadas-Quesada, F. J.(2011) Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1144–1158. DOI.
Carm13
Carmi, A. Y.(2013) Compressive system identification: Sequential methods and entropy bounds. Digital Signal Processing, 23(3), 751–770. DOI.
Cart87
Carter, G. C.(1987) Coherence and time delay estimation. Proceedings of the IEEE, 75(2), 236–255. DOI.
ChFS16
Choi, K., Fazekas, G., & Sandler, M. (2016) Explaining Deep Convolutional Neural Networks on Music Classification. arXiv:1607.02444 [Cs].
CCFH67
Cochran, W. T., Cooley, J. W., Favin, D. L., Helms, H. D., Kaenel, R. A., Lang, W. W., … Welch, P. D.(1967) What is the fast Fourier transform?. Proceedings of the IEEE, 55(10), 1664–1674. DOI.
CoLW70
Cooley, J. W., Lewis, P. A. W., & Welch, P. D.(1970) The application of the fast Fourier transform algorithm to the estimation of spectra and cross-spectra. Journal of Sound and Vibration, 12(3), 339–352. DOI.
DeKa07
Demopoulos, R. J., & Katchabaw, M. J.(2007) Music Information Retrieval: A Survey of Issues and Approaches. . Technical Report
DiSc14
Dieleman, S., & Schrauwen, B. (2014) End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6964–6968). IEEE DOI.
DVFK10
Dörfler, M., Velasco, G., Flexer, A., & Klien, V. (2010) Sparse Regression in Time-Frequency Representations of Complex Audio.
DuKL06
Du, P., Kibbe, W. A., & Lin, S. M.(2006) Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22(17), 2059–2065. DOI.
Fitz10
Fitzgerald, D. (2010) Harmonic/percussive separation using median filtering.
FFCE16
Flamary, R., Févotte, C., Courty, N., & Emiya, V. (2016) Optimal spectral transportation with application to music transcription. In arXiv:1609.09799 [cs, stat] (pp. 703–711). Curran Associates, Inc.
FLTZ11
Fu, Z., Lu, G., Ting, K. M., & Zhang, D. (2011) A Survey of Audio-Based Music Classification and Annotation. IEEE Transactions on Multimedia, 13(2), 303–319. DOI.
GlLT09
Glover, J. C., Lazzarini, V., & Timoney, J. (2009) Simpl: A Python library for sinusoidal modelling. In DAFx 09 proceedings of the 12th International Conference on Digital Audio Effects, Politecnico di Milano, Como Campus, Sept. 1-4, Como, Italy (pp. 1–4). Dept. of Electronic Engineering, Queen Mary Univ. of London,
GoDa05
Godsill, S., & Davy, M. (2005) Bayesian computational models for inharmonicity in musical instruments. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005 (pp. 283–286). IEEE DOI.
GóHe04
Gómez, E., & Herrera, P. (2004) Estimating The Tonality Of Polyphonic Audio Files: Cognitive Versus Machine Learning Modelling Strategies. In ISMIR.
GrMK10
Grosche, P., Muller, M., & Kurth, F. (2010) Cyclic Tempogram - a Mid-level Tempo Representation For Music Signals. In 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (pp. 5522–5525). Piscataway, NJ.: IEEE DOI.
GrMS10
Grosche, Peter, Müller, M., & Sapp, C. S.(2010) What makes beat tracking difficult? a case study on Chopin mazurkas. In Proceedings of the International Conference on Music Information Retrieval (ISMIR 2010).
Harr78
Harris, F. J.(1978) On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE, 66(1), 51–83.
Herm88
Hermes, D. J.(1988) Measurement of pitch by subharmonic summation. The Journal of the Acoustical Society of America, 83(1), 257–264. DOI.
HCEG16
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., … Wilson, K. (2016) CNN Architectures for Large-Scale Audio Classification. arXiv:1609.09430 [Cs, Stat].
HDYD12
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI.
HoBB10
Hoffman, M., Bach, F. R., & Blei, D. M.(2010) Online learning for latent dirichlet allocation. In advances in neural information processing systems (pp. 856–864).
Iriz01
Irizarry, R. A.(2001) Local Harmonic Estimation in Musical Sound Signals. Journal of the American Statistical Association, 96(454), 357–367. DOI.
JoDa13
Joël Bensoam, & David Roze. (2013) Solving interactions between nonlinear resonators. In Proceedings of the Sound and Music Computing Conference.
KaSH00
Kailath, T., Sayed, A. H., & Hassibi, B. (2000) Linear estimation. . Upper Saddle River, N.J: Prentice Hall
KMBT11
Kalouptsidis, N., Mileounis, G., Babadi, B., & Tarokh, V. (2011) Adaptive algorithms for sparse system identification. Signal Processing, 91(8), 1910–1919. DOI.
KeDP13
Kereliuk, C., Depalle, P., & Pasquier, P. (2013) Audio Interpolation and Morphing via Structured-Sparse Linear Regression.
LaNK87
Lahat, M., Niederjohn, R. J., & Krubsack, D. (1987) A spectral autocorrelation method for measurement of the fundamental frequency of noise-corrupted speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(6), 741–750. DOI.
Ljun99
Ljung, L. (1999) System identification: theory for the user. (2nd ed.). Upper Saddle River, NJ: Prentice Hall PTR
LCHR16
Luo, Y., Chen, Z., Hershey, J. R., Roux, J. L., & Mesgarani, N. (2016) Deep Clustering and Conventional Networks for Music Separation: Stronger Together. arXiv:1611.06265 [Cs, Stat].
MKSW99
Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999) Performance Measures For Information Extraction. In In Proceedings of DARPA Broadcast News Workshop (pp. 249–252).
MaPe14
Marchand, U., & Peeters, G. (2014) The Modulation Scale Spectrum And Its Application To Rhythm-Content Analysis. In DAFX (Digital Audio Effects). Erlangen, Germany
MaPW09
Maxwell, J. B., Pasquier, P., & Whitman, B. (2009) Hierarchical Sequential Memory for Music: A Cognitive Model. In Proceedings of the tenth International Society for Music Information Retrieval Conference (ISMIR 2009) (pp. 429–434).
McEl11
McFee, B., & Ellis, D. P.(2011) Analyzing song structure with spectral clustering. In IEEE conference on Computer Vision and Pattern Recognition (CVPR).
MeHV16
Mesaros, A., Heittola, T., & Virtanen, T. (2016) Metrics for Polyphonic Sound Event Detection. Applied Sciences, 6(6), 162. DOI.
Moor74
Moorer, J. . (1974) The optimum comb method of pitch period analysis of continuous digitized speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 22(5), 330–338. DOI.
MEKR11
Müller, M., Ellis, D. P. W., Klapuri, A., & Richard, G. (2011) Signal Processing for Music Analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1088–1110. DOI.
MüDr12
Müller, Meinard, & Driedger, J. (2012) Data-Driven Sound Track Generation. In Multimodal Music Processing (Vol. 3, pp. 175–194). Dagstuhl, Germany: Schloss Dagstuhl—Leibniz-Zentrum für Informatik
MuGa03
Murthy, H. A., & Gadde, V. (2003) The modified group delay function and its application to phoneme recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03) (Vol. 1, p. I-68-71 vol.1). DOI.
NCRG16
Nussbaum-Thom, M., Cui, J., Ramabhadran, B., & Goel, V. (2016) Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units. (pp. 390–394). DOI.
Parn97
Parncutt, R. (1997) A model of the perceptual root(s) of a chord accounting for voicing and prevailing tonality. In M. Leman (Ed.), Music, Gestalt, and Computing (pp. 181–199). Springer Berlin Heidelberg
PaMK10
Paulus, J., Müller, M., & Klapuri, A. (2010) Audio-Based Music Structure Analysis. In ISMIR (pp. 625–636). ISMIR
PHMM16
Phan, H., Hertel, L., Maass, M., & Mertins, A. (2016) Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks. In Interspeech 2016.
Pick04
Pickens, J. (2004) Harmonic modeling for polyphonic music retrieval. . Citeseer
PiIl05
Pickens, J., & Iliopoulos, C. S.(2005) Markov Random Fields and Maximum Entropy Modeling for Music Information Retrieval. In ISMIR (pp. 207–214). Citeseer
PlLe65
Plomp, R., & Levelt, W. J.(1965) Tonal consonance and critical bandwidth. The Journal of the Acoustical Society of America, 38(4), 548–560. DOI.
PoLS16
Pons, J., Lidy, T., & Serra, X. (2016) Experimenting with musically motivated convolutional neural networks. In Content-Based Multimedia Indexing (CBMI), 2016 14th International Workshop on (pp. 1–6). IEEE DOI.
RoPl06
Robertson, A. N., & Plumbley, M. D.(2006) Real-time Interactive Musical Systems: An Overview. Proc. of the Digital Music Research Network, Goldsmiths University, London, 65–68.
RoPl07
Robertson, A., & Plumbley, M. (2007) B-Keeper: A Beat-tracker for Live Performance. In Proceedings of the 7th International Conference on New Interfaces for Musical Expression (pp. 234–237). New York, NY, USA: ACM DOI.
RoPl13
Robertson, A., & Plumbley, M. D.(2013) Synchronizing Sequencing Software to a Live Drummer. Computer Music Journal, 37(2), 46–60. DOI.
RoSD13
Robertson, A., Stark, A., & Davies, M. E.(2013) Percussive Beat tracking using real-time median filtering. . Presented at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
RoSP11
Robertson, A., Stark, A. M., & Plumbley, M. D.(2011) Real-Time Visual Beat Tracking Using a Comb Filter Matrix.
Robe11
Robertson, Andrew N. (2011) A Bayesian approach to drum tracking.
RoCh97
Rochebois, T., & Charbonneau, G. (1997) Cross-synthesis using interverted principal harmonic sub-spaces. In M. Leman (Ed.), Music, Gestalt, and Computing (pp. 375–385). Springer Berlin Heidelberg
SaLi16
Sainath, T. N., & Li, B. (2016) Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks. Submitted to Proc. Interspeech.
SGER14
Salamon, J., Gomez, E., Ellis, D. P., & Richard, G. (2014) Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine, 31(2), 118–134. DOI.
SaSG13
Salamon, J., Serrà, J., & Gómez, E. (2013) Tonal representations for music retrieval: from version identification to query-by-humming. International Journal of Multimedia Information Retrieval, 2(1), 45–58. DOI.
ScBö14
Schlüter, J., & Böck, S. (2014) Improved musical onset detection with Convolutional Neural Networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6979–6983). DOI.
ScKi11
Schmidt, E. M., & Kim, Y. E.(2011) Learning emotion-based acoustic features with deep belief networks. In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 65–68). DOI.
ScPu11
Scholler, S., & Purwins, H. (2011) Sparse Approximations for Drum Sound Classification. IEEE Journal of Selected Topics in Signal Processing, 5(5), 933–940. DOI.
SBRM08
Sergé, A., Bertaux, N., Rigneault, H., & Marguet, D. (2008) Dynamic multiple-target tracing to probe spatiotemporal cartography of cell membranes. Nature Methods, 5(8), 687–694. DOI.
SCBH12
Serrà, J., Corral, Á., Boguñá, M., Haro, M., & Arcos, J. L.(2012) Measuring the Evolution of Contemporary Western Popular Music. Scientific Reports, 2. DOI.
SmLe04
Smith, E. C., & Lewicki, M. S.(2004) Learning efficient auditory codes using spikes predicts cochlear filters. In Advances in Neural Information Processing Systems (pp. 1289–1296).
SmLe06
Smith, E. C., & Lewicki, M. S.(2006) Efficient auditory coding. Nature, 439(7079), 978–982. DOI.
SmLe05
Smith, E., & Lewicki, M. S.(2005) Efficient Coding of Time-Relative Structure Using Spikes. Neural Computation, 17(1), 19–45. DOI.
SmEl09
Smyth, T., & Elmore, A. R.(2009) Explorations in convolutional synthesis. In Proceedings of the 6th Sound and Music Computing Conference, Porto, Portugal (pp. 23–25).
TeSS82
Terhardt, E., Stoll, G., & Seewann, M. (1982) Algorithm for extraction of pitch and pitch salience from complex tonal signals. The Journal of the Acoustical Society of America, 71(3), 679–688. DOI.
ThHK16
Thickstun, J., Harchaoui, Z., & Kakade, S. (2016) Learning Features of Music from Scratch. arXiv:1611.09827 [Cs, Stat].
Welc67
Welch, P. D.(1967) The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics, 15(2), 70–73.
YaHY09
Yang, C., He, Z., & Yu, W. (2009) Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis. BMC Bioinformatics, 10, 4. DOI.
YoGo12
Yoshii, K., & Goto, M. (2012) Infinite Composite Autoregressive Models for Music Signal Analysis.