The Living Thing / Notebooks :

Natural language processing

Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.

I don’t really know anything about this. See instead, perhaps design grammars, semantics, iterated function systems and my obsolete research proposal in this area, grammatical inference, information retreival.

What is NLP



BlingFire Fire Tokenizer is a tokenizer designed for fast-speed and quality tokenization of Natural Language text. It mostly follows the tokenization logic of NLTK, except hyphenated words are split and a few errors are fixed.

This looks like it is also good for non-NLP tokenization tasks.


spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. […]

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.


Like other deep learning frameworks, there is some basic NLP support in pytorch; see pytorch.text.


NLTK is a classic python teaching library for rolling your own language processing.


Formerly ClearNLP.

The Natural Language Processing for JVM languages (NLP4J) project provides:

NLP tools readily available for research in various disciplines. Frameworks for fast development of efficient and robust NLP components. API for manipulating computational structures in NLP (e.g., dependency graph). The project is initiated and currently led by the Emory NLP research group with many helps [sic] from the community.

Misc other

There are many more, but I am stopping with the links having found the bits and pieces I need for my purposes.