See also design grammars, iterated function systems and my research proposal in this area, grammatical inference.
Contents
Software
SpaCy
spaCy excels at largescale information extraction tasks. It’s written from the ground up in carefully memorymanaged Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. […]
spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikitlearn, Gensim and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.
pytorch.text
Like other deep learning frameworks, there is some basic NLP support in pytorch; see pytorch.text.
NLP4J
Formerly ClearNLP.
The Natural Language Processing for JVM languages (NLP4J) project provides:
NLP tools readily available for research in various disciplines. Frameworks for fast development of efficient and robust NLP components. API for manipulating computational structures in NLP (e.g., dependency graph). The project is initiated and currently led by the Emory NLP research group with many helps [sic] from the community.
Misc other
apache opennlp
MALLET is another big java NLP workbenchey thing
IMS Open Corpus Workbench (CWB)…
is a collection of opensource tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations.
I’m uncertain how actively maintained this is.

The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide.
There are many more, but I am stopping with the links having found the bits and pieces I need for my purposes.
Refs
 Angl88
 Angluin, D. (1988) Identifying languages from stochastic examples (No. No YALEU/DCS/RR614).
 ASKR12
 Arisoy, E., Sainath, T. N., Kingsbury, B., & Ramabhadran, B. (2012) Deep Neural Network Language Models. In Proceedings of the NAACLHLT 2012 Workshop: Will We Ever Really Replace the Ngram Model? On the Future of Language Modeling for HLT (pp. 20–28). Stroudsburg, PA, USA: Association for Computational Linguistics
 AuBB97
 Autebert, J.M., Berstel, J., & Boasson, L. (1997) Contextfree languages and pushdown automata. In G. Rozenberg & A. Salomaa (Eds.), Handbook of formal languages, vol. 1 (pp. 111–174). New York, NY, USA: SpringerVerlag New York, Inc.
 BaRi99
 BaezaYates, R., & RibeiroNeto, B. (1999) Modern Information Retrieval. (1st ed.). Addison Wesley
 Bail16
 Bail, C. A.(2016) Combining natural language processing and network analysis to examine how advocacy organizations stimulate conversation on social media. Proceedings of the National Academy of Sciences, 201607151. DOI.
 BDVJ03
 Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003) A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.
 BeBo90
 Berstel, J., & Boasson, L. (1990) Transductions and contextfree languages. In J. van Leeuwen, A. R. Meyer, M. Nivat, M. Paterson, & D. Perrin (Eds.), Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity (pp. 1–278).
 BoTh73
 Booth, T. L., & Thompson, R. A.(1973) Applying Probability Measures to Abstract Languages. IEEE Transactions on Computers, C22(5), 442–450. DOI.
 Char96
 Charniak, E. (1996) Statistical Language Learning. (Reprint.). A Bradford Book
 ChMa06
 Chater, N., & Manning, C. D.(2006) Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10(7), 335–344. DOI.
 CMGB14
 Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014) Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. In EMNLP 2014.
 ClEy05
 Clark, A., & Eyraud, R. (2005) Identification in the Limit of Substitutable ContextFree Languages. In S. Jain, H. Simon, & E. Tomita (Eds.), Algorithmic Learning Theory (Vol. 3734, pp. 283–296). Springer Berlin / Heidelberg
 ClEH08
 Clark, A., Eyraud, R., & Habrard, A. (2008) A Polynomial Algorithm for the Inference of Context Free Languages. In A. Clark, F. Coste, & L. Miclet (Eds.), Grammatical Inference: Algorithms and Applications (Vol. 5278, pp. 29–42). Springer Berlin / Heidelberg
 ClFW06
 Clark, A., Florêncio, C. C., & Watkins, C. (2006) Languages as Hyperplanes: Grammatical Inference with String Kernels. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Machine Learning: ECML 2006 (pp. 90–101). Springer Berlin Heidelberg
 CFWS06
 Clark, A., Florêncio, C. C., Watkins, C., & Serayet, M. (2006) Planar Languages and Learnability. In Y. Sakakibara, S. Kobayashi, K. Sato, T. Nishino, & E. Tomita (Eds.), Grammatical Inference: Algorithms and Applications (pp. 148–160). Springer Berlin Heidelberg
 CoDu02
 Collins, M., & Duffy, N. (2002) Convolution Kernels for Natural Language. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14 (pp. 625–632). MIT Press
 Gold67
 Gold, E. M.(1967) Language identification in the limit. Information and Control, 10(5), 447–474. DOI.
 GoTh78
 Gonzalez, R. C., & Thomason, M. G.(1978) Syntactic pattern recognition: an introduction.
 GHSB15
 Grefenstette, E., Hermann, K. M., Suleyman, M., & Blunsom, P. (2015) Learning to Transduce with Unbounded Memory. ArXiv:1506.02516 [Cs].
 Grei66
 Greibach, S. A.(1966) The Unsolvability of the Recognition of Linear ContextFree Languages. J. ACM, 13(4), 582–587. DOI.
 HoUl79
 Hopcroft, J. E., & Ullman, J. D.(1979) Introduction to Automata Theory, Languages and Computation. (1st ed.). AddisonWesley Publishing Company
 Huan15
 Huang, P.S. (2015) Shallow and deep learning for audio and natural language processing. . University of Illinois at UrbanaChampaign
 KoCM08
 Kontorovich, L. (Aryeh), Cortes, C., & Mohri, M. (2008) Kernel methods for learning languages. Theoretical Computer Science, 405(3), 223–236. DOI.
 KoCM06
 Kontorovich, L., Cortes, C., & Mohri, M. (2006) Learning Linearly Separable Languages. In J. L. Balcázar, P. M. Long, & F. Stephan (Eds.), Algorithmic Learning Theory (pp. 288–303). Springer Berlin Heidelberg
 LaMP01
 Lafferty, J. D., McCallum, A., & Pereira, F. C. N.(2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. 282–289). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
 LiBE15
 Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning. ArXiv:1506.00019 [Cs].
 Mack12
 MacKinlay, A. D.(2012) Pushing the boundaries of deep parsing.
 Mann02
 Manning, C. D.(2002) Probabilistic syntax. In Probabilistic linguistics (pp. 289–341). Cambridge, MA: MIT Press
 MaRS08
 Manning, C. D., Raghavan, P., & Schütze, H. (2008) Introduction to Information Retrieval. . Cambridge University Press
 MaSc99
 Manning, C. D., & Schütze, H. (1999) Foundations of statistical natural language processing. . Cambridge, Mass.: MIT Press
 MKBČ10
 Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010) Recurrent Neural Network Based Language Model. In Eleventh Annual Conference of the International Speech Communication Association.
 MiLS13
 Mikolov, T., Le, Q. V., & Sutskever, I. (2013) Exploiting Similarities among Languages for Machine Translation. ArXiv:1309.4168 [Cs].
 MiCr17
 Mitra, B., & Craswell, N. (2017) Neural Models for Information Retrieval. ArXiv:1705.01509 [Cs].
 MoPR96
 Mohri, M., Pereira, F., & Riley, M. (1996) Weighted Automata in Text and Speech Processing. In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI96), Workshop on Extended finite state models of language. Budapest, Hungary: John Wiley and Sons, Chichester
 MoPR02
 Mohri, M., Pereira, F., & Riley, M. (2002) Weighted finitestate transducers in speech recognition. Computer Speech & Language, 16(1), 69–88. DOI.
 OdTG09
 O’Donnell, T. J., Tenenbaum, J. B., & Goodman, N. D.(2009) Fragment Grammars: Exploring Computation and Reuse in Language.
 PeSM14
 Pennington, J., Socher, R., & Manning, C. D.(2014) GloVe: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12.
 PeFH12
 Petersson, K.M., Folia, V., & Hagoort, P. (2012) What artificial grammar learning reveals about the neurobiology of syntax. Brain and Language, 120(2), 83–95. DOI.
 Sala15
 Salakhutdinov, R. (2015) Learning Deep Generative Models. Annual Review of Statistics and Its Application, 2(1), 361–385. DOI.
 SHRE05
 Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005) Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences of the United States of America, 102(33), 11629–11634. DOI.
 SuMR07
 Sutton, C., McCallum, A., & Rohanimanesh, K. (2007) Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. J. Mach. Learn. Res., 8, 693–723.
 Rijs79
 van Rijsbergen, C. J.(1979) Information Retrieval. (2nd ed.). ButterworthHeinemann
 Weth80
 Wetherell, C. S.(1980) Probabilistic Languages: A Review and Some Open Questions. ACM Comput. Surv., 12(4), 361–379. DOI.
 Wolf00
 Wolff, J. G.(2000) Syntax, parsing and production of natural language in a framework of information compression by multiple alignment, unification and search. Journal of Universal Computer Science, 6(8), 781–829.