In machine listening and related tasks likes audio analysis we often want compact representations of audio signals in some manner that is not “raw”; something a little more useful than the simple record of the vibrations of the microphone. We might want a representation, for example, that tells us something about psychoacoustics, i.e. the parts of the sound that are relevant to a human listener. We might want to use these features to construct a musical metrics which tells us something about perceptual similarity or something.
These representations are called features or descriptors. This is a huge industry - compact summary features, for example, make audio convenient and compact for transmission. (hello mobile telephony, MP3) But this trick is also useful for understanding speech, music etc. There are as many descriptors as there are IEEE conference slots.
See, say, (Alías, Socoró, and Sevillano 2016) for an intimidatingly comprehensive summary.
I’m especially interested in
invertible ones, for analysis/resynthesis. If not analytically invertible, convexity would do.
Ones that avoid windowed DTFT, because it sounds awful in the resynthesis phase and is lazy.
ones that can encode noisiness in the signal as well as harmonicity…?
Deep neural networks
See, e.g. Jordi Pons’ Spectrogram CNN discussion for some introductions to the kind of features a neural network might “discover” in audio recognition tasks.
There is some interesting stuff here; for example, Dieleman and Schrauwen (Dieleman and Schrauwen 2014) show that convolutional neural networks trained on raw audio (i.e. not spectrograms) for music classification recover Mel-like frequency bands. Thickstun et al (Thickstun, Harchaoui, and Kakade 2017) do some similar work.
And Keunwoo Choi shows that you can listen to what they learn.
Sparse comb filters
Differentiable! Conditionally invertible! Handy for syncing.
Moorer (Moorer 1974) proposed this for harmonic purposes, but Robertson et al (Robertson, Stark, and Plumbley 2011) have shown it to be handy for rhythm.
Measure the signal’s full or partial autocorrelation with itself.
Linear Predictive coefficents
How do these transform? If we did this as all-pole or all-zeros might be useful; but many maxima.
Classic, but inconvenient to invert.
Mel-frequency Ceptral Coefficients, or Mel Cepstral transform. Take the perfectly respectable-if-fiddly cepstrum and make it really messy, with a vague psychoacoustic model in the hope that the distinctions in the resulting “MFCC” might correspond to distinctions correspond to human perceptual distinctions.
Folk wisdom holds that MFCC features are Eurocentric, in that they destroy, or at least obscure, tonal language features. Ubiquitous, but inconsistently implemented; MFCCs are generally not the same across implementations, probably because the Mel scale is itself not universally standardised.
Asides from being loosely psychoacoustically motivated features, what do the coefficients of an MFCC specifically tell me?
Hmm. If I have got this right, these are “generic features”; things we can use in machine learning because we hope they project the spectrum into a space which approximately preserves psychoacoustic dissimilarity, whilst having little redundancy.
This heuristic pro is weighted with the practical con that they are not practically differentiable, nor invertible except by heroic computational effort, nor are they humanly interpretable, and riven with poorly-supported somewhat arbitrary steps. (The cepstrum of the Mel-frequency-spectrogram is a weird thing that no longer picks out harmonics in the way that God and Tukey intended.)
Inc bandpasses, Gammatones… Random filterbanks?
See sparse basis dictionaries.
Cochlear activation models
Erbs, Mels, Sones, Phones… There are various units design to approximate the human perceptual process; I tried to document them under psychoacoustics.
Abe, T., T. Kobayashi, and S. Imai. 1995. “Harmonics Tracking and Pitch Extraction Based on Instantaneous Frequency.” In International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95, 1:756–59 vol.1. https://doi.org/10.1109/ICASSP.1995.479804.
Alías, Francesc, Joan Claudi Socoró, and Xavier Sevillano. 2016. “A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds.” Applied Sciences 6 (5): 143. https://doi.org/10.3390/app6050143.
Anglade, Amélie, Emmanouil Benetos, Matthias Mauch, and Simon Dixon. 2010. “Improving Music Genre Classification Using Automatically Induced Harmony Rules.” Journal of New Music Research 39: 349–61. https://doi.org/10.1080/09298215.2010.525654.
Bogert, B P, M J R Healy, and J W Tukey. 1963. “The Quefrency Alanysis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum and Saphe Cracking.” In, 209–43.
Chen, Ning, and Shijun Wang. n.d. “High-Level Music Descriptor Extraction Algorithm Based on Combination of Multi-Channel Cnns and Lstm.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
Childers, D. G., D. P. Skinner, and R. C. Kemerait. 1977. “The Cepstrum: A Guide to Processing.” Proceedings of the IEEE 65 (10): 1428–43. https://doi.org/10.1109/PROC.1977.10747.
Choi, Keunwoo, György Fazekas, Mark Sandler, and Kyunghyun Cho. 2017. “Transfer Learning for Music Classification and Regression Tasks.” In Proceeding of the 18th International Society of Music Information Retrieval (ISMIR) Conference 2017. suzhou, China. http://arxiv.org/abs/1703.09179.
Defferrard, Michaël, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. 2017. “FMA: A Dataset for Music Analysis.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China. http://arxiv.org/abs/1612.01840.
Dieleman, Sander, and Benjamin Schrauwen. 2014. “End to End Learning for Music Audio.” In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6964–8. IEEE. https://doi.org/10.1109/ICASSP.2014.6854950.
Grosche, P., M. Muller, and F. Kurth. 2010. “Cyclic Tempogram - a Mid-Level Tempo Representation for Music Signals.” In 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 5522–5. Piscataway, NJ.: IEEE. https://doi.org/10.1109/ICASSP.2010.5495219.
Lattner, Stefan, Monika Dorfler, and Andreas Arzt. 2019. “Learning Complex Basis Functions for Invariant Representations of Audio.” In Proceedings of the 20th Conference of the International Society for Music Information Retrieval, 8. http://archives.ismir.net/ismir2019/paper/000085.pdf.
Luo, Yi, Zhuo Chen, John R. Hershey, Jonathan Le Roux, and Nima Mesgarani. 2016. “Deep Clustering and Conventional Networks for Music Separation: Stronger Together,” November. http://arxiv.org/abs/1611.06265.
MacKinlay, Daniel. 2019. “Mosaic Style Transfer Using Sparse Autocorrelograms.” In Proceedings of the 20th Conference of the International Society for Music Information Retrieval, 5. Delft. http://archives.ismir.net/ismir2019/paper/000109.pdf.
Makhoul, John, Francis Kubala, Richard Schwartz, and Ralph Weischedel. 1999. “Performance Measures for Information Extraction.” In In Proceedings of DARPA Broadcast News Workshop, 249–52.
McDermott, Josh H., Michael Schemitsch, and Eero P. Simoncelli. 2013. “Summary Statistics in Auditory Perception.” Nature Neuroscience 16 (4): 493–98. https://doi.org/10.1038/nn.3347.
Moorer, J.A. 1974. “The Optimum Comb Method of Pitch Period Analysis of Continuous Digitized Speech.” IEEE Transactions on Acoustics, Speech and Signal Processing 22 (5): 330–38. https://doi.org/10.1109/TASSP.1974.1162596.
Noll, A. Michael. 1967. “Cepstrum Pitch Determination.” The Journal of the Acoustical Society of America 41 (2): 293–309. https://doi.org/10.1121/1.1910339.
Oppenheim, A. V., and R. W. Schafer. 2004. “From Frequency to Quefrency: A History of the Cepstrum.” IEEE Signal Processing Magazine 21 (5): 95–106. https://doi.org/10.1109/MSP.2004.1328092.
Phan, Huy, Lars Hertel, Marco Maass, and Alfred Mertins. 2016. “Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks.” In Interspeech 2016. http://arxiv.org/abs/1604.06338.
Pons, Jordi, and Xavier Serra. 2017. “Designing Efficient Architectures for Modeling Temporal Features with Convolutional Neural Networks.” In. http://jordipons.me/media/PonsSerraICASSP2017.pdf.
Robertson, Andrew, Adam M. Stark, and Mark D. Plumbley. 2011. “Real-Time Visual Beat Tracking Using a Comb Filter Matrix.” In Proceedings of the International Computer Music Conference 2011. https://www.eecs.qmul.ac.uk/~markp/2011/RobertsonStarkPlumbleyICMC2011_accepted.pdf.
Rochebois, Thierry, and Gérard Charbonneau. 1997. “Cross-Synthesis Using Interverted Principal Harmonic Sub-Spaces.” In Music, Gestalt, and Computing, edited by Marc Leman, 375–85. Lecture Notes in Computer Science 1317. Springer Berlin Heidelberg. http://link.springer.com/chapter/10.1007/BFb0034127.
Salamon, Justin, Joan Serrà, and Emilia Gómez. 2013. “Tonal Representations for Music Retrieval: From Version Identification to Query-by-Humming.” International Journal of Multimedia Information Retrieval 2 (1): 45–58. https://doi.org/10.1007/s13735-012-0026-0.
Schlüter, J., and S. Böck. 2014. “Improved Musical Onset Detection with Convolutional Neural Networks.” In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6979–83. https://doi.org/10.1109/ICASSP.2014.6854953.
Schmidt, E.M., and Y.E. Kim. 2011. “Learning Emotion-Based Acoustic Features with Deep Belief Networks.” In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 65–68. https://doi.org/10.1109/ASPAA.2011.6082328.
Scholler, S., and H. Purwins. 2011. “Sparse Approximations for Drum Sound Classification.” IEEE Journal of Selected Topics in Signal Processing 5 (5): 933–40. https://doi.org/10.1109/JSTSP.2011.2161264.
Smith, Evan C., and Michael S. Lewicki. 2006. “Efficient Auditory Coding.” Nature 439 (7079): 978–82. https://doi.org/10.1038/nature04485.
Smith, Evan, and Michael S. Lewicki. 2005. “Efficient Coding of Time-Relative Structure Using Spikes.” Neural Computation 17 (1): 19–45. https://doi.org/10.1162/0899766052530839.
Southall, Carl, Chih-Wei Wu, Alexander Lerch, and Jason A. Hockman. 2017. “MDB Drums — an Annotated Subset of MedleyDB for Automatic Drum Transcription.” In Late Breaking Demo (Extended Abstract), Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou: International Society for Music Information Retrieval (ISMIR). http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2017/10/Wu-et-al_2017_MDB-Drums-An-Annotated-Subset-of-MedleyDB-for-Automatic-Drum-Transcription.pdf.
Thickstun, John, Zaid Harchaoui, and Sham Kakade. 2017. “Learning Features of Music from Scratch.” In Proceedings of International Conference on Learning Representations (ICLR) 2017. http://arxiv.org/abs/1611.09827.
Wu, Chih-Wei, and Alexander Lerch. 2017. “Automatic Drum Transcription Using the Student-Teacher Learning Paradigm with Unlabeled Music Data.” In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou: ISMIR. http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2017/07/Wu_Lerch_2017_Automatic-drum-transcription-using-the-student-teacher-learning-paradigm-with.pdf.