I don’t really know anything about this. See instead, perhaps design grammars, semantics, iterated function systems and my obsolete research proposal in this area, grammatical inference, information retreival.
What is NLP
Sebastian Ruder, Recent history of NLP a.k.a how natural language processing turned into a deep learning thing too
Peter Norvig on Chomsky and statistical versus explanatory models of natural language syntax. Full of sick burns.
BlingFire Fire Tokenizer is a tokenizer designed for fast-speed and quality tokenization of Natural Language text. It mostly follows the tokenization logic of NLTK, except hyphenated words are split and a few errors are fixed.
This looks like it is also good for non-NLP tokenization tasks.
spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. […]
spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.
Like other deep learning frameworks, there is some basic NLP support in pytorch; see pytorch.text.
NLTK is a classic python teaching library for rolling your own language processing.
The Natural Language Processing for JVM languages (NLP4J) project provides:
NLP tools readily available for research in various disciplines. Frameworks for fast development of efficient and robust NLP components. API for manipulating computational structures in NLP (e.g., dependency graph). The project is initiated and currently led by the Emory NLP research group with many helps [sic] from the community.
MALLET is another big java NLP workbenchey thing
IMS Open Corpus Workbench (CWB)…
is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations.
I’m uncertain how actively maintained this is.
The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide.
There are many more, but I am stopping with the links having found the bits and pieces I need for my purposes.
- LiBE15: Zachary C. Lipton, John Berkowitz, Charles Elkan (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning. ArXiv:1506.00019 [Cs].
- BDVJ03: Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin (2003) A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.
- BoTh73: Taylor L Booth, R.A. Thompson (1973) Applying Probability Measures to Abstract Languages. IEEE Transactions on Computers, C–22(5), 442–450. DOI
- Bail16: Christopher Andrew Bail (2016) Combining natural language processing and network analysis to examine how advocacy organizations stimulate conversation on social media. Proceedings of the National Academy of Sciences, 201607151. DOI
- LaMP01: John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. 282–289). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
- AuBB97: Jean-Michel Autebert, Jean Berstel, Luc Boasson (1997) Context-free languages and pushdown automata. In Handbook of formal languages, vol. 1 (pp. 111–174). New York, NY, USA: Springer-Verlag New York, Inc.
- CoDu02: Michael Collins, Nigel Duffy (2002) Convolution Kernels for Natural Language. In Advances in Neural Information Processing Systems 14 (pp. 625–632). MIT Press
- ASKR12: Ebru Arisoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran (2012) Deep Neural Network Language Models. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT (pp. 20–28). Stroudsburg, PA, USA: Association for Computational Linguistics
- SuMR07: Charles Sutton, Andrew McCallum, Khashayar Rohanimanesh (2007) Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Journal of Machine Learning Research, 8, 693–723.
- MiLS13: Tomas Mikolov, Quoc V. Le, Ilya Sutskever (2013) Exploiting Similarities among Languages for Machine Translation. ArXiv:1309.4168 [Cs].
- MaSc99: Christopher D Manning, Hinrich Schütze (1999) Foundations of statistical natural language processing. Cambridge, Mass.: MIT Press
- OdTG09: Timothy J. O’Donnell, Joshua B. Tenenbaum, Noah D. Goodman (2009) Fragment Grammars: Exploring Computation and Reuse in Language.
- PeSM14: Jeffrey Pennington, Richard Socher, Christopher D. Manning (2014) GloVe: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) , 12.
- ClEy05: Alexander Clark, Rémi Eyraud (2005) Identification in the Limit of Substitutable Context-Free Languages. In Algorithmic Learning Theory (Vol. 3734, pp. 283–296). Springer Berlin / Heidelberg
- Angl88: Dana Angluin (1988) Identifying languages from stochastic examples (No. No YALEU/DCS/RR-614)
- Rijs79: C. J. van Rijsbergen (1979) Information Retrieval. Butterworth-Heinemann
- HoUl79: John E. Hopcroft, Jeffrey D. Ullman (1979) Introduction to Automata Theory, Languages and Computation. Addison-Wesley Publishing Company
- MaRS08: Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze (2008) Introduction to Information Retrieval. Cambridge University Press
- KoCM08: Leonid (Aryeh) Kontorovich, Corinna Cortes, Mehryar Mohri (2008) Kernel methods for learning languages. Theoretical Computer Science, 405(3), 223–236. DOI
- Gold67: E Mark Gold (1967) Language identification in the limit. Information and Control, 10(5), 447–474. DOI
- ClFW06: Alexander Clark, Christophe Costa Florêncio, Chris Watkins (2006) Languages as Hyperplanes: Grammatical Inference with String Kernels. In Machine Learning: ECML 2006 (pp. 90–101). Springer Berlin Heidelberg
- Sala15: Ruslan Salakhutdinov (2015) Learning Deep Generative Models. Annual Review of Statistics and Its Application, 2(1), 361–385. DOI
- KoCM06: Leonid Kontorovich, Corinna Cortes, Mehryar Mohri (2006) Learning Linearly Separable Languages. In Algorithmic Learning Theory (pp. 288–303). Springer Berlin Heidelberg
- CMGB14: Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP 2014.
- GHSB15: Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, Phil Blunsom (2015) Learning to Transduce with Unbounded Memory. ArXiv:1506.02516 [Cs].
- BaRi99: Ricardo Baeza-Yates, Berthier Ribeiro-Neto (1999) Modern Information Retrieval. Addison Wesley
- MiCr17: Bhaskar Mitra, Nick Craswell (2017) Neural Models for Information Retrieval. ArXiv:1705.01509 [Cs].
- CFWS06: Alexander Clark, Christophe Costa Florêncio, Chris Watkins, Mariette Serayet (2006) Planar Languages and Learnability. In Grammatical Inference: Algorithms and Applications (pp. 148–160). Springer Berlin Heidelberg
- Weth80: C. S. Wetherell (1980) Probabilistic Languages: A Review and Some Open Questions. ACM Comput. Surv., 12(4), 361–379. DOI
- ChMa06: Nick Chater, Christopher D Manning (2006) Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10(7), 335–344. DOI
- Mann02: Christopher D Manning (2002) Probabilistic syntax. In Probabilistic linguistics (pp. 289–341). Cambridge, MA: MIT Press
- MKBČ10: Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, Sanjeev Khudanpur (2010) Recurrent Neural Network Based Language Model. In Eleventh Annual Conference of the International Speech Communication Association.
- Char96: Eugene Charniak (1996) Statistical Language Learning. A Bradford Book
- GoTh78: R. C. Gonzalez, M. G. Thomason (1978) Syntactic pattern recognition: an introduction
- Wolf00: J Gerard Wolff (2000) Syntax, parsing and production of natural language in a framework of information compression by multiple alignment, unification and search. Journal of Universal Computer Science, 6(8), 781–829.
- Grei66: Sheila A. Greibach (1966) The Unsolvability of the Recognition of Linear Context-Free Languages. J. ACM, 13(4), 582–587. DOI
- BeBo90: Jean Berstel, Luc Boasson (1990) Transductions and context-free languages. In Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity (pp. 1–278).
- SHRE05: Zach Solan, David Horn, Eytan Ruppin, Shimon Edelman (2005) Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences of the United States of America, 102(33), 11629–11634. DOI
- MoPR96: Mehryar Mohri, Fernando Pereira, Michael Riley (1996) Weighted Automata in Text and Speech Processing. In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language. Budapest, Hungary: John Wiley and Sons, Chichester
- MoPR02: Mehryar Mohri, Fernando Pereira, Michael Riley (2002) Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1), 69–88. DOI
- PeFH12: Karl-Magnus Petersson, Vasiliki Folia, Peter Hagoort (2012) What artificial grammar learning reveals about the neurobiology of syntax. Brain and Language, 120(2), 83–95. DOI