See also musical corpora, musical metrics synchronisation, sparse basis dictionaries, speech recognition, learning gamelan, analysis/resynthesis, whatever other machine listening posts I forgot.
I’m not going to talk about speech recognition here; That boat is full.
Machine listening: machine learning, from audio. Everything from that Shazam app doohickey, to teaching computers to recognise speech, to doing artsy things with sound. I’m mostly concerned with the third one. Statistics, features, descriptors, metrics, kernels and affinities and the spaces and topologies they induce. for musical audio e.g. your Mp3 pop song library. This has considerable overlap with musical metrics but there I start from scores and transcriptions.
Polyphony and its problems.
Approximate logarithmic perception and its problems.
Should I create a separate psychoacoustics notebook? Yes.
Audio summaries that attempt to turn raw signals into useful feature vectors reflective of human perception of them. This is a huge industry, because it makes audio convenient for transmission (hello mobile telephony, MP3) But it’s also useful for understanding speech, music etc. There are as many descriptors as there are IEEE conference slots.
See AlSS16 for an intimidatingly comprehensive summary.
I’m especially interested in
invertible ones, for analysis/resynthesis. If not analytically invertible, convexity would do.
Ones that avoid windowed DTFT, because it sounds awful in the resynthesis phase and is lazy.
both the harmonic and percussive parts.
Also, ones that can encode noisiness in the signal as well as harmonicity…? I guess I should read AlSS16.
Deep neural networks
See, e.g. Jordi Pons’ Spectrogram CNN discussion for some introductions to the kind of features a neural network might “discover” in audio recognition tasks.
There is some interesting stuff here; for example, Dieleman and Schrauwen (DiSc14) show that convolutional neural networks trained on raw audio (i.e. not spectrograms) for music classification recover Mel-like frequency bands. Thickstun et al (ThHK16) do some similar work.
And Keunwoo Choi shows that you can listen to what they learn.
Sparse comb filters
Differentiable! Conditionally invertible! Handy for syncing.
Moorer (Moor74) proposed this for harmonic purposes, but Robertson et al (RoSP11) have shown it to be handy for rhythm.
Measure the signal’s full or partial autocorrelation with itself.
Linear Predictive coefficents
How do these transform? If we did this as all-pole or all-zeros might be useful; but many maxima.
Classic, but inconvenient to invert.
Mel-frequency Ceptral Coefficients, or Mel Cepstral transform. Take the perfectly respectable-if-fiddly cepstrum and make it really messy, with a vague psychoacoustic model in the hope that the distinctions in the resulting “MFCC” might correspond to distinctions correspond to human perceptual distinctions.
Folk wisdom holds that MFCC features are Eurocentric, in that they destroy, or at least obscure, tonal language features. Ubiquitous, but inconsistently implemented; MFCCs are generally not the same across implementations, probably because the Mel scale is itself not universally standardised.
Asides from being loosely psychoacoustically motivated features, what do the coefficients of an MFCC specifically tell me?
Hmm. If I have got this right, these are “generic features”; things we can use in machine learning because we hope they project the spectrum into a space which approximately preserves psychoacoustic dissimilarity, whilst having little redundancy.
This heuristic pro is weighted with the practical con that they are not practically differentiable, nor invertible except by heroic computational effort, nor are they humanly interpretable, and riven with poorly-supported somewhat arbitrary steps. (The Cepstrum of the Mel-frequency-spectrogram is a weird thing that no longer picks out harmonics in the way that God and Tukey intended.)
Inc bandpasses, Gammatones… Random filterbanks?
See sparse basis dictionaries.
Cochlear activation models
Erbs, Mels, Sones, Phones…
Here are some options for doing machine listening.
musicbricks is an umbrella project to unify (post hoc) many of the efforts mentioned individually below, plus a few other new ones.
Fraunhofer ML software (C++) is part of this project, including such things as
- Real-Time Pitch Detection
- Goatify Pdf
- Time Stretch Pitch Shift Library
LibROSA I have been using a lot recently, and I highly recommend it, especially if your pipeline already includes python. Sleek minimal design, with a curated set of algorithms (compare and contrast with the chaos of the vamp plugins ecosystem). Python-based, but fast enough because it uses the numpy numerical libraries well. The API design meshes well with Scikit-learn, the de facto python machine learning standard, and it’s flexible and hackable.
amen is a remix program built on librosa
SonicAnnotator seems to be about cobbling together vamp plugins for batch analysis. That is more steps that I want in an already clunky workflow in the current projects It’s also more about RDF ontologies where I want matrices of floats.
For C++ and Python there is Essentia, as seen in Freesound, which is a high recommendation IMO. (Watch out, the source download is enormous; just shy of half a gigbyte.) Features python and vamp integration, and a great many algorithms. I haven’t given it a fair chance because LibROSA has been such a joy to use. However, the intriguing Dunya project is based off it.
echonest is a proprietary system that was used to generate the Million Songs Database. Seems to be gradually decaying, and was bought up by spotify. has great demos, such as autocanonisation.
phonological corpus tools
speech-focussed, phonological corpus tools is another research library for largeish corpus analysis, similarity-classification etc.
John Glover, soundcloud staffer, has several analysis libraries culminating in Metamorph,
a new open source library for performing high-level sound transformations based on a sinusoids plus noise plus transients model. It is written in C++, can be built as both a Python extension module and a Csound opcode, and currently runs on Mac OS X and Linux.
It is designed to work primarily on monophonic, quasi-harmonic sound sources and can be used in a non-real-time context to process pre-recorded sound files or can operate in a real-time (streaming) mode.
See also the related spectral modeling and synthesis package, smstools.
Sinusoidal modelling with simplsound
“sinusoidal modelling”: Simplsound (GlLT09) is a python implementation of that technique.
If you use a lot of Supercollider, you might like SCMIR, a native supercollider thingy. It has the virtues that
- it can run in realtime, which is lovely.
It has the vices that
It runs in Supercollider, which is a backwater language unserviced by modern development infrastructure, or decent machine learning libraries, and
a fraught development process; I can’t even link directly to it because the author doesn’t provide it its own anchor tag, let alone a whole web page or source code repository. Release schedule is opaque and sporadic. Consequently, it is effectively a lone guy’s pet project, rather than an active community endeavour.
That is to say this is the Etsy sweater of code knitting. If on balance this sounds like a good deal to you, you can download SCMIR from somewhere or other on Nick Collins’ homepage.
Other specialist tools
Large-Scale Content-Based Matching of Midi and Audio Files:
MIDI files, when paired with corresponding audio recordings, can be used as ground truth for many music information retrieval tasks. We present a system which can efficiently match and align MIDI files to entries in a large corpus of audio content based solely on content, i.e., without using any metadata. The core of our approach is a convolutional network-based cross-modality hashing scheme which transforms feature matrices into sequences of vectors in a common Hamming space. Once represented in this way, we can efficiently perform large-scale dynamic time warping searches to match MIDI data to audio recordings. We evaluate our approach on the task of matching a huge corpus of MIDI files to the Million Song Dataset.
See also Dannenberg’s bibliographies on score following.
mir_eval evaluates MIR metrics.
CLEESE is an IRCAM project to do dataset augmentation by pitch and time transform of audio.
- Robe11: Andrew N. Robertson (2011) A Bayesian approach to drum tracking.
- Parn97: Richard Parncutt (1997) A model of the perceptual root(s) of a chord accounting for voicing and prevailing tonality. In Music, Gestalt, and Computing (pp. 181–199). Springer Berlin Heidelberg
- AlSS16: Francesc Alías, Joan Claudi Socoró, Xavier Sevillano (2016) A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Applied Sciences, 6(5), 143. DOI
- LaNK87: M. Lahat, Russell J. Niederjohn, D. Krubsack (1987) A spectral autocorrelation method for measurement of the fundamental frequency of noise-corrupted speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(6), 741–750. DOI
- FLTZ11: Zhouyu Fu, Guojun Lu, Kai Ming Ting, Dengsheng Zhang (2011) A Survey of Audio-Based Music Classification and Annotation. IEEE Transactions on Multimedia, 13(2), 303–319. DOI
- NCRG16: Markus Nussbaum-Thom, Jia Cui, Bhuvana Ramabhadran, Vaibhava Goel (2016) Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units. (pp. 390–394). DOI
- KMBT11: Nicholas Kalouptsidis, Gerasimos Mileounis, Behtash Babadi, Vahid Tarokh (2011) Adaptive algorithms for sparse system identification. Signal Processing, 91(8), 1910–1919. DOI
- TeSS82: Ernst Terhardt, Gerhard Stoll, Manfred Seewann (1982) Algorithm for extraction of pitch and pitch salience from complex tonal signals. The Journal of the Acoustical Society of America, 71(3), 679–688. DOI
- McEl11: Brian McFee, Daniel PW Ellis (2011) Analyzing song structure with spectral clustering. In IEEE conference on Computer Vision and Pattern Recognition (CVPR).
- KeDP13: Corey Kereliuk, Philippe Depalle, Philippe Pasquier (2013) Audio Interpolation and Morphing via Structured-Sparse Linear Regression.
- PaMK10: Jouni Paulus, Meinard Müller, Anssi Klapuri (2010) Audio-Based Music Structure Analysis. In ISMIR (pp. 625–636). ISMIR
- WuLe17: Chih-Wei Wu, Alexander Lerch (2017) Automatic drum transcription using the student-teacher learning paradigm with unlabeled music data. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou: ISMIR
- GoDa05: S. Godsill, Manuel Davy (2005) Bayesian computational models for inharmonicity in musical instruments. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005 (pp. 283–286). IEEE DOI
- BlCH10: David M. Blei, Perry R. Cook, Matthew Hoffman (2010) Bayesian nonparametric matrix factorization for recorded music. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 439–446).
- RoPl07: Andrew Robertson, Mark Plumbley (2007) B-Keeper: A Beat-tracker for Live Performance. In Proceedings of the 7th International Conference on New Interfaces for Musical Expression (pp. 234–237). New York, NY, USA: ACM DOI
- Noll67: A. Michael Noll (1967) Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2), 293–309. DOI
- HCEG17: Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, Channing Moore, … Kevin Wilson (2017) CNN Architectures for Large-Scale Audio Classification. In Proc. IEEE ICASSP 2017.
- Cart87: G.Clifford Carter (1987) Coherence and time delay estimation. Proceedings of the IEEE, 75(2), 236–255. DOI
- Carm13: Avishy Y. Carmi (2013) Compressive system identification: sequential methods and entropy bounds. Digital Signal Processing, 23(3), 751–770. DOI
- RoCh97: Thierry Rochebois, Gérard Charbonneau (1997) Cross-synthesis using interverted principal harmonic sub-spaces. In Music, Gestalt, and Computing (pp. 375–385). Springer Berlin Heidelberg
- GrMK10: P. Grosche, M. Muller, F. Kurth (2010) Cyclic tempogram - a mid-level tempo representation for music signals. In 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (pp. 5522–5525). Piscataway, NJ.: IEEE DOI
- MüDr12: Meinard Müller, Jonathan Driedger (2012) Data-Driven Sound Track Generation. In Multimodal Music Processing (Vol. 3, pp. 175–194). Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum für Informatik
- LCHR16: Yi Luo, Zhuo Chen, John R. Hershey, Jonathan Le Roux, Nima Mesgarani (2016) Deep Clustering and Conventional Networks for Music Separation: Stronger Together. ArXiv:1611.06265 [Cs, Stat].
- HDYD12: G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed, N. Jaitly, … B. Kingsbury (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI
- PoSe17: Jordi Pons, Xavier Serra (2017) Designing Efficient Architectures for Modeling Temporal Features with Convolutional Neural Networks.
- Helm63: Heinrich Helmholtz (1863) Die Lehre von den Tonempfindungen als physiologische Grundlage für die Theorie der Musik. Braunschweig: J. Vieweg
- SBRM08: Arnauld Sergé, Nicolas Bertaux, Hervé Rigneault, Didier Marguet (2008) Dynamic multiple-target tracing to probe spatiotemporal cartography of cell membranes. Nature Methods, 5(8), 687–694. DOI
- SmLe06: Evan C. Smith, Michael S. Lewicki (2006) Efficient auditory coding. Nature, 439(7079), 978–982. DOI
- SmLe05: Evan Smith, Michael S. Lewicki (2005) Efficient Coding of Time-Relative Structure Using Spikes. Neural Computation, 17(1), 19–45. DOI
- DiSc14: Sander Dieleman, Benjamin Schrauwen (2014) End to end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6964–6968). IEEE DOI
- VeSm17: Shrikant Venkataramani, Paris Smaragdis (2017) End to end Source Separation with Adaptive Front-Ends. ArXiv:1705.02514 [Cs].
- GóHe04: Emilia Gómez, Perfecto Herrera (2004) Estimating The Tonality Of Polyphonic Audio Files: Cognitive Versus Machine Learning Modelling Strategies. In ISMIR.
- PoLS16: Jordi Pons, Thomas Lidy, Xavier Serra (2016) Experimenting with musically motivated convolutional neural networks. In Content-Based Multimedia Indexing (CBMI), 2016 14th International Workshop on (pp. 1–6). IEEE DOI
- SmEl09: Tamara Smyth, Andrew R. Elmore (2009) Explorations in convolutional synthesis. In Proceedings of the 6th Sound and Music Computing Conference, Porto, Portugal (pp. 23–25).
- DBVB17: Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson (2017) FMA: A Dataset For Music Analysis. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
- FPFF17: Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, … Xavier Serra (2017) Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
- OpSc04: A. V. Oppenheim, R. W. Schafer (2004) From frequency to quefrency: a history of the cepstrum. IEEE Signal Processing Magazine, 21(5), 95–106. DOI
- Fitz10: Derry Fitzgerald (2010) Harmonic/percussive separation using median filtering.
- AbKI95: T. Abe, T. Kobayashi, S. Imai (1995) Harmonics tracking and pitch extraction based on instantaneous frequency. In International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95 (Vol. 1, pp. 756–759 vol.1). DOI
- MaPW09: James B. Maxwell, Philippe Pasquier, Brian Whitman (2009) Hierarchical Sequential Memory for Music: A Cognitive Model. In Proceedings of the tenth International Society for Music Information Retrieval Conference (ISMIR 2009) (pp. 429–434).
- ChWa00: Ning Chen, Shijun Wang (n.d.) High-Level Music Descriptor Extraction Algorithm Based on Combination of Multi-Channel Cnns and Lstm. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
- ScBö14: J. Schlüter, S. Böck (2014) Improved musical onset detection with Convolutional Neural Networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6979–6983). DOI
- DuKL06: Pan Du, Warren A. Kibbe, Simon M. Lin (2006) Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22(17), 2059–2065. DOI
- ABMD10: Amélie Anglade, Emmanouil Benetos, Matthias Mauch, Simon Dixon (2010) Improving Music Genre Classification Using Automatically Induced Harmony Rules. Journal of New Music Research, 39, 349–361. DOI
- YoGo12: Kazuyoshi Yoshii, Masataka Goto (2012) Infinite Composite Autoregressive Models for Music Signal Analysis.
- SmLe04: Evan C. Smith, Michael S. Lewicki (2004) Learning efficient auditory codes using spikes predicts cochlear filters. In Advances in Neural Information Processing Systems (pp. 1289–1296).
- ScKi11: E.M. Schmidt, Y.E. Kim (2011) Learning emotion-based acoustic features with deep belief networks. In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 65–68). DOI
- ThHK17: John Thickstun, Zaid Harchaoui, Sham Kakade (2017) Learning Features of Music from Scratch. In Proceedings of International Conference on Learning Representations (ICLR) 2017.
- KaSH00: Thomas Kailath, Ali H. Sayed, Babak Hassibi (2000) Linear estimation. Upper Saddle River, N.J: Prentice Hall
- Iriz01: Rafael A Irizarry (2001) Local harmonic estimation in musical sound signals. Journal of the American Statistical Association, 96(454), 357–367. DOI
- PiIl05: Jeremy Pickens, Costas S. Iliopoulos (2005) Markov Random Fields and Maximum Entropy Modeling for Music Information Retrieval. In ISMIR (pp. 207–214). Citeseer
- SWLH17: Carl Southall, Chih-Wei Wu, Alexander Lerch, Jason A. Hockman (2017) MDB Drums — An Annotated Subset of MedleyDB for Automatic Drum Transcription. In Late Breaking Demo (Extended Abstract), Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou: International Society for Music Information Retrieval (ISMIR)
- Herm88: Dik J. Hermes (1988) Measurement of pitch by subharmonic summation. The Journal of the Acoustical Society of America, 83(1), 257–264. DOI
- SCBH12: Joan Serrà, Álvaro Corral, Marián Boguñá, Martín Haro, Josep Ll Arcos (2012) Measuring the Evolution of Contemporary Western Popular Music. Scientific Reports, 2. DOI
- SGER14: Justin Salamon, Emilia Gomez, Daniel PW Ellis, Gael Richard (2014) Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine, 31(2), 118–134. DOI
- MeHV16: Annamaria Mesaros, Toni Heittola, Tuomas Virtanen (2016) Metrics for Polyphonic Sound Event Detection. Applied Sciences, 6(6), 162. DOI
- YaCY17: Li-Chia Yang, Szu-Yu Chou, Yi-Hsuan Yang (2017) MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
- BoBV12: Nicolas Boulanger-Lewandowski, Yoshua Bengio, Pascal Vincent (2012) Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
- SaLi16: Tara N. Sainath, Bo Li (2016) Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks. Submitted to Proc. Interspeech.
- DeKa07: Ryan J. Demopoulos, Michael J. Katchabaw (2007) Music Information Retrieval: A Survey of Issues and Approaches. Technical Report
- CVVR11: J. J. Carabias-Orti, T. Virtanen, P. Vera-Candeas, N. Ruiz-Reyes, F. J. Canadas-Quesada (2011) Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1144–1158. DOI
- Harr78: Fredric J. Harris (1978) On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE, 66(1), 51–83. DOI
- HoBB10: Matthew Hoffman, Francis R. Bach, David M. Blei (2010) Online learning for latent dirichlet allocation. In advances in neural information processing systems (pp. 856–864).
- FFCE16: Rémi Flamary, Cédric Févotte, Nicolas Courty, Valentin Emiya (2016) Optimal spectral transportation with application to music transcription. In arXiv:1609.09799 [cs, stat] (pp. 703–711). Curran Associates, Inc.
- BiPL96: Emmanuel Bigand, Richard Parncutt, Fred Lerdahl (1996) Perception of musical tension in short chord sequences: The influence of harmonic function, sensory dissonance, horizontal motion, and musical training. Perception & Psychophysics, 58(1), 125–141. DOI
- ElZi17: Dan Elbaz, Michael Zibulevsky (2017) Perceptual audio loss function for deep learning. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
- RoSD13: Andrew Robertson, Adam Stark, Matthew EP Davies (2013) Percussive beat tracking using real-time median filtering. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases.
- MKSW99: John Makhoul, Francis Kubala, Richard Schwartz, Ralph Weischedel (1999) Performance Measures For Information Extraction. In In Proceedings of DARPA Broadcast News Workshop (pp. 249–252).
- RoPl06: A. N. Robertson, M. D. Plumbley (2006) Real-time Interactive Musical Systems: An Overview. Proc. of the Digital Music Research Network, Goldsmiths University, London, 65–68.
- RoSP11: Andrew Robertson, Adam M. Stark, Mark D. Plumbley (2011) Real-time visual beat tracking using a comb filter matrix. In Proceedings of the International Computer Music Conference 2011.
- PHMM16: Huy Phan, Lars Hertel, Marco Maass, Alfred Mertins (2016) Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks. In Interspeech 2016.
- MEKR11: M. Müller, D.P.W. Ellis, A. Klapuri, G. Richard (2011) Signal Processing for Music Analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1088–1110. DOI
- GlLT09: John C. Glover, Victor Lazzarini, Joseph Timoney (2009) Simpl: A Python library for sinusoidal modelling. In DAFx 09 proceedings of the 12th International Conference on Digital Audio Effects, Politecnico di Milano, Como Campus, Sept. 1-4, Como, Italy (pp. 1–4). Dept. of Electronic Engineering, Queen Mary Univ. of London,
- JoDa13: Joël Bensoam, David Roze (2013) Solving interactions between nonlinear resonators. In Proceedings of the Sound and Music Computing Conference.
- ScPu11: S. Scholler, H. Purwins (2011) Sparse Approximations for Drum Sound Classification. IEEE Journal of Selected Topics in Signal Processing, 5(5), 933–940. DOI
- DVFK10: Monika Dörfler, Gino Velasco, Arthur Flexer, Volkmar Klien (2010) Sparse Regression in Time-Frequency Representations of Complex Audio.
- RoPl13: Andrew Robertson, Mark D. Plumbley (2013) Synchronizing Sequencing Software to a Live Drummer. Computer Music Journal, 37(2), 46–60. DOI
- Ljun99: Lennart Ljung (1999) System identification: theory for the user. Upper Saddle River, NJ: Prentice Hall PTR
- CoLW70: J. W. Cooley, P. A. W. Lewis, P. D. Welch (1970) The application of the fast Fourier transform algorithm to the estimation of spectra and cross-spectra. Journal of Sound and Vibration, 12(3), 339–352. DOI
- ChSK77: D. G. Childers, D. P. Skinner, R. C. Kemerait (1977) The cepstrum: A guide to processing. Proceedings of the IEEE, 65(10), 1428–1443. DOI
- BlTu59: R. B. Blackman, J. W. Tukey (1959) The measurement of power spectra from the point of view of communications engineering. New York: Dover Publications
- BEWL11: Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, Paul Lamere (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
- MaPe14: Ugo Marchand, Geoffroy Peeters (2014) The Modulation Scale Spectrum And Its Application To Rhythm-Content Analysis. In DAFX (Digital Audio Effects). Erlangen, Germany
- Moor74: J.A Moorer (1974) The optimum comb method of pitch period analysis of continuous digitized speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 22(5), 330–338. DOI
- BoHT63: B P Bogert, M J R Healy, J W Tukey (1963) The quefrency alanysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. (pp. 209–243).
- Welc67: Peter D. Welch (1967) The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics, 15(2), 70–73. DOI
- BJRL16: George E. P. Box, Gwilym M. Jenkins, Gregory C. Reinsel, Greta M. Ljung (2016) Time series analysis: forecasting and control. Hoboken, New Jersey: John Wiley & Sons, Inc
- PlLe65: Reinier Plomp, Willem JM Levelt (1965) Tonal consonance and critical bandwidth. The Journal of the Acoustical Society of America, 38(4), 548–560. DOI
- SaSG13: Justin Salamon, Joan Serrà, Emilia Gómez (2013) Tonal representations for music retrieval: from version identification to query-by-humming. International Journal of Multimedia Information Retrieval, 2(1), 45–58. DOI
- CFSC17: Keunwoo Choi, György Fazekas, Mark Sandler, Kyunghyun Cho (2017) Transfer learning for music classification and regression tasks. In Proceeding of The 18th International Society of Music Information Retrieval (ISMIR) Conference 2017. suzhou, China
- CCFH67: W.T. Cochran, James W. Cooley, D.L. Favin, H.D. Helms, R.A. Kaenel, W.W. Lang, … Peter D. Welch (1967) What is the fast Fourier transform? Proceedings of the IEEE, 55(10), 1664–1674. DOI
- GrMS10: Peter Grosche, Meinard Müller, Craig Stuart Sapp (2010) What makes beat tracking difficult? a case study on Chopin mazurkas. In Proceedings of the International Conference on Music Information Retrieval (ISMIR 2010).