Bags of words, edit distance (as see in bioinformatics, hamming distances, cunning kernels and vector spaces over documents. Vector spaces induced by document structures. Metrics based on generation by finite state machines. Maybe co-occurrence metrics would also be useful as musical metrics? Inference complexity.
If I were to actually write this entry, it would be a big research project.
“Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.
Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways…”
“Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.”