The Living Thing / Notebooks :

Natural language processing

Dave, although you took very thorough precautions in the pod against my hearing you, I could see your lips move.

See also design grammars, iterated function systems and my research proposal in this area, grammatical inference.

Contents

Software

SpaCy

http://spacy.io/:

spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. […]

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.

pytorch.text

Like other deep learning frameworks, there is some basic NLP support in pytorch; see pytorch.text.

NLTK

NLTK is a classic python teaching library for rolling your own language processing.

NLP4J

Formerly ClearNLP.

The Natural Language Processing for JVM languages (NLP4J) project provides:

NLP tools readily available for research in various disciplines. Frameworks for fast development of efficient and robust NLP components. API for manipulating computational structures in NLP (e.g., dependency graph). The project is initiated and currently led by the Emory NLP research group with many helps [sic] from the community.

Misc other

  • mate

  • corenlp

  • apache opennlp

  • IMS Open Corpus Workbench (CWB)…

    is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations.

    I’m uncertain how actively maintained this is.

  • HTK

    The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide.

There are many more, but I am stopping with the links having found the bits and pieces I need for my purposes.

Refs

Angl88
Angluin, D. (1988) Identifying languages from stochastic examples (No. No YALEU/DCS/RR-614).
ASKR12
Arisoy, E., Sainath, T. N., Kingsbury, B., & Ramabhadran, B. (2012) Deep Neural Network Language Models. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT (pp. 20–28). Stroudsburg, PA, USA: Association for Computational Linguistics
AuBB97
Autebert, J.-M., Berstel, J., & Boasson, L. (1997) Context-free languages and pushdown automata. In G. Rozenberg & A. Salomaa (Eds.), Handbook of formal languages, vol. 1 (pp. 111–174). New York, NY, USA: Springer-Verlag New York, Inc.
BaRi99
Baeza-Yates, R., & Ribeiro-Neto, B. (1999) Modern Information Retrieval. (1st ed.). Addison Wesley
Bail16
Bail, C. A.(2016) Combining natural language processing and network analysis to examine how advocacy organizations stimulate conversation on social media. Proceedings of the National Academy of Sciences, 201607151. DOI.
BDVJ03
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003) A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.
BeBo90
Berstel, J., & Boasson, L. (1990) Transductions and context-free languages. In J. van Leeuwen, A. R. Meyer, M. Nivat, M. Paterson, & D. Perrin (Eds.), Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity (pp. 1–278).
BoTh73
Booth, T. L., & Thompson, R. A.(1973) Applying Probability Measures to Abstract Languages. IEEE Transactions on Computers, C-22(5), 442–450. DOI.
Char96
Charniak, E. (1996) Statistical Language Learning. (Reprint.). A Bradford Book
ChMa06
Chater, N., & Manning, C. D.(2006) Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10(7), 335–344. DOI.
CMGB14
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP 2014.
ClEy05
Clark, A., & Eyraud, R. (2005) Identification in the Limit of Substitutable Context-Free Languages. In S. Jain, H. Simon, & E. Tomita (Eds.), Algorithmic Learning Theory (Vol. 3734, pp. 283–296). Springer Berlin / Heidelberg
ClEH08
Clark, A., Eyraud, R., & Habrard, A. (2008) A Polynomial Algorithm for the Inference of Context Free Languages. In A. Clark, F. Coste, & L. Miclet (Eds.), Grammatical Inference: Algorithms and Applications (Vol. 5278, pp. 29–42). Springer Berlin / Heidelberg
ClFW06
Clark, A., Florêncio, C. C., & Watkins, C. (2006) Languages as Hyperplanes: Grammatical Inference with String Kernels. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Machine Learning: ECML 2006 (pp. 90–101). Springer Berlin Heidelberg
CFWS06
Clark, A., Florêncio, C. C., Watkins, C., & Serayet, M. (2006) Planar Languages and Learnability. In Y. Sakakibara, S. Kobayashi, K. Sato, T. Nishino, & E. Tomita (Eds.), Grammatical Inference: Algorithms and Applications (pp. 148–160). Springer Berlin Heidelberg
CoDu02
Collins, M., & Duffy, N. (2002) Convolution Kernels for Natural Language. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14 (pp. 625–632). MIT Press
Gold67
Gold, E. M.(1967) Language identification in the limit. Information and Control, 10(5), 447–474. DOI.
GoTh78
Gonzalez, R. C., & Thomason, M. G.(1978) Syntactic pattern recognition: an introduction.
GHSB15
Grefenstette, E., Hermann, K. M., Suleyman, M., & Blunsom, P. (2015) Learning to Transduce with Unbounded Memory. ArXiv:1506.02516 [Cs].
Grei66
Greibach, S. A.(1966) The Unsolvability of the Recognition of Linear Context-Free Languages. J. ACM, 13(4), 582–587. DOI.
HoUl79
Hopcroft, J. E., & Ullman, J. D.(1979) Introduction to Automata Theory, Languages and Computation. (1st ed.). Addison-Wesley Publishing Company
Huan15
Huang, P.-S. (2015) Shallow and deep learning for audio and natural language processing. . University of Illinois at Urbana-Champaign
KoCM08
Kontorovich, L. (Aryeh), Cortes, C., & Mohri, M. (2008) Kernel methods for learning languages. Theoretical Computer Science, 405(3), 223–236. DOI.
KoCM06
Kontorovich, L., Cortes, C., & Mohri, M. (2006) Learning Linearly Separable Languages. In J. L. Balcázar, P. M. Long, & F. Stephan (Eds.), Algorithmic Learning Theory (pp. 288–303). Springer Berlin Heidelberg
LaMP01
Lafferty, J. D., McCallum, A., & Pereira, F. C. N.(2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. 282–289). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
LiBE15
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning. ArXiv:1506.00019 [Cs].
Mack12
MacKinlay, A. D.(2012) Pushing the boundaries of deep parsing.
Mann02
Manning, C. D.(2002) Probabilistic syntax. In Probabilistic linguistics (pp. 289–341). Cambridge, MA: MIT Press
MaRS08
Manning, C. D., Raghavan, P., & Schütze, H. (2008) Introduction to Information Retrieval. . Cambridge University Press
MaSc99
Manning, C. D., & Schütze, H. (1999) Foundations of statistical natural language processing. . Cambridge, Mass.: MIT Press
MKBČ10
Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010) Recurrent Neural Network Based Language Model. In Eleventh Annual Conference of the International Speech Communication Association.
MiLS13
Mikolov, T., Le, Q. V., & Sutskever, I. (2013) Exploiting Similarities among Languages for Machine Translation. ArXiv:1309.4168 [Cs].
MiCr17
Mitra, B., & Craswell, N. (2017) Neural Models for Information Retrieval. ArXiv:1705.01509 [Cs].
MoPR96
Mohri, M., Pereira, F., & Riley, M. (1996) Weighted Automata in Text and Speech Processing. In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language. Budapest, Hungary: John Wiley and Sons, Chichester
MoPR02
Mohri, M., Pereira, F., & Riley, M. (2002) Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1), 69–88. DOI.
OdTG09
O’Donnell, T. J., Tenenbaum, J. B., & Goodman, N. D.(2009) Fragment Grammars: Exploring Computation and Reuse in Language.
PeSM14
Pennington, J., Socher, R., & Manning, C. D.(2014) GloVe: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12.
PeFH12
Petersson, K.-M., Folia, V., & Hagoort, P. (2012) What artificial grammar learning reveals about the neurobiology of syntax. Brain and Language, 120(2), 83–95. DOI.
Sala15
Salakhutdinov, R. (2015) Learning Deep Generative Models. Annual Review of Statistics and Its Application, 2(1), 361–385. DOI.
SHRE05
Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005) Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences of the United States of America, 102(33), 11629–11634. DOI.
SuMR07
Sutton, C., McCallum, A., & Rohanimanesh, K. (2007) Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. J. Mach. Learn. Res., 8, 693–723.
Rijs79
van Rijsbergen, C. J.(1979) Information Retrieval. (2nd ed.). Butterworth-Heinemann
Weth80
Wetherell, C. S.(1980) Probabilistic Languages: A Review and Some Open Questions. ACM Comput. Surv., 12(4), 361–379. DOI.
Wolf00
Wolff, J. G.(2000) Syntax, parsing and production of natural language in a framework of information compression by multiple alignment, unification and search. Journal of Universal Computer Science, 6(8), 781–829.