Turning audio into text.
Not much here; I just needed to audio some speech recognition for a friend. I care more about a different machine-listening niche, musical machine listening.
You might try CMU sphinx.
The interesting warping sub-problem sounds interesting, though.
Connectionist Temporal Classification is a loss function useful for performing supervised learning on sequence data, without needing an alignment between input data and labels. For example, CTC can be used to train end-to-end systems for speech recognition, which is how we have been using it at Baidu’s Silicon Valley AI Lab.