# Text processing

Information retrieval via string metrics. Speech tagging. Vector spaces induced by document structures, such as cosine similarit and word2vec style embeddings.

Metrics based on generation by finite state machines. Maybe co-occurrence metrics would also be useful as musical metrics? Inference complexity.

If I were to actually write this entry, it would be a big research project.

## Software

• Luke

“Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways…”

• whoosh

“Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.”

• xapian

• sphinx

• lemur

Uh…