# Semantics

“[…] archetypes don’t exist; the body exists. The belly inside is beautiful, because the baby grows there, because your sweet cock, all bright and jolly, thrusts there, and good, tasty food descends there, and for this reason the cavern, the grotto, the tunnel are beautiful and important, and the labyrinth, too, which is made in the image of our wonderful intestines. When somebody wants to invent something beautiful and important, it has to come from there, because you also came from there the day you were born, because fertility always comes from inside a cavity, where first something rots and then, lo and behold, there’s a little man, a date, a baobab.

And high is better than low, because if you have your head down, the blood goes to your brain, because feet stink and hair doesn’t stink as much, because it’s better to climb a tree and pick fruit than end up underground, food for worms, and because you rarely hurt yourself hitting something above — you really have to be in an attic — while you often hurt yourself falling. That’s why up is angelic and down devilish.”

Umberto Eco. “Foucault’s Pendulum.”

On the mapping between linguistic tokens and what they denote.

If I had time I would learn about: Wierzbicka’s semantic primes, Valiant’s PAC-learning, Wittgenstein, probably Mark Johnson if the over-writing doesn’t kill me. Logic-and-language philosophers, toy axiomatic worlds. Classic AI symbolic reasoning approaches. Drop in via game theory and neurolinguistics? Ignore most of it, mention plausible models based on statistical learnability.

## Learnability of terms

When do we need to use words? BGPL10 have a toy model for color words, which is a clever choice of domain.

StTe05:
a link to count model stochastics.

Also what embodiment means for this stuff.

## Neural models

What does the MRI tell us about denotaiton in the brain?

SNVV14 is worth it for the tag alone: “experimental semiotics”

How can we understand each other during communicative interactions? An influential suggestion holds that communicators are primed by each other’s behaviors, with associative mechanisms automatically coordinating the production of communicative signals and the comprehension of their meanings. An alternative suggestion posits that mutual understanding requires shared conceptualizations of a signal’s use, i.e., “conceptual pacts” that are abstracted away from specific experiences. Both accounts predict coherent neural dynamics across communicators, aligned either to the occurrence of a signal or to the dynamics of conceptual pacts. Using coherence spectral-density analysis of cerebral activity simultaneously measured in pairs of communicators, this study shows that establishing mutual understanding of novel signals synchronizes cerebral dynamics across communicators’ right temporal lobes. This interpersonal cerebral coherence occurred only within pairs with a shared communicative history, and at temporal scales independent from signals’ occurrences. These findings favor the notion that meaning emerges from shared conceptualizations of a signal’s use.

## Word vector models

Nearly-reversible, distributed representations of semantics.

As invented by BDVJ03 and popularised/refined by Mikolov and Dean at Google, the skip-gram semantic vector spaces —- definitely the hippest of the ways of defining String distances for natual language this season.

• Christopher Olah discusses it from a neural network perspective with diagrams and commends Bengio’s BDVJ03 for a rationale.

• Sanjeev Arora’s semantic word embeddings has an explanation of skipgrammish methods:

In all methods, the word vector is a succinct representation of the distribution of other words around this word. That this suffices to capture meaning is asserted by Firth’s hypothesis from 1957, “You shall know a word by the company it keeps.” To give an example, if I ask you to think of a word that tends to co-occur with cow, drink, babies, calcium, you would immediately answer: milk.

[…]Firth’s hypothesis does imply a very simple word embedding, albeit a very high-dimensional one.

Embedding 1: Suppose the dictionary has N distinct words (in practice, N=100,000). Take a very large text corpus (e.g., Wikipedia) and let Count5(w1,w2) be the number of times w1 and w2 occur within a distance 5 of each other in the corpus. Then the word embedding for a word w is a vector of dimension N, with one coordinate for each dictionary word. The coordinate corresponding to word w2 is Count5(w,w2). (Variants of this method involve considering cooccurence of w with various phrases or n-tuples.)

The obvious problem with Embedding 1 is that it uses extremely high-dimensional vectors. How can we compress them?

Embedding 2: Do dimension reduction by taking the rank-300 singular value decomposition (SVD) of the above vectors.

Using SVD to do dimension reduction seems an obvious idea these days but it actually is not. After all, it is unclear a priori why the above N×N matrix of cooccurance counts should be close to a rank-300 matrix. That this is the case was empirically discovered in the paper on Latent Semantic Indexing or LSI.

• For both descriptions below, we assume that the current word in a sentence is $w_i.$

CBOW: The input to the model could be $w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2},$ the preceding and following words of the current word we are at. The output of the neural network will be $w_i$. Hence you can think of the task as “predicting the word given its context”. Note that the number of words we use depends on your setting for the window size.

Skip-gram: The input to the model is w_i, and the output could be $w_{i-1}, w_{i-2}, w_{i+1}, w_{i+2}$. So the task here is “predicting the context given a word”. Also, the context is not limited to its immediate context, training instances can be created by skipping a constant number of words in its context, so for example, $w_{i-3}, w_{i-4}, w_{i+3}, w_{i+4}$, hence the name skip-gram. Note that the window size determines how far forward and backward to look for context words to predict.

According to Mikolov:

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.

CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words This can get even a bit more complicated if you consider that there are two different ways how to train the models: the normalized hierarchical softmax, and the un-normalized negative sampling. Both work quite differently.

which makes sense since with skip gram, you can create a lot more training instances from limited amount of data, and for CBOW, you will need more since you are conditioning on context, which can get exponentially huge.

• Jeff Dean’s CIKM Keynote. (that’s “Conference on Information and Knowledge Management” to you and me.)

“Embedding vectors trained for the language modeling task have very interesting properties (especially the skip-gram model)”

\begin{align*} E(\text{hotter}) - E(\text{hot}) &\approx E(\text{bigger}) - E(\text{big}) \\ E(\text{Rome}) - E(\text{Italy}) &\approx E(\text{Berlin}) - E(\text{Germany}) \end{align*}

“Skip-gram model w/ 640 dimensions trained on 6B words of news text achieves 57% accuracy for analogy-solving test set.”

Sanjeev Aror explain that, more than that, the skip gram vectors for polysemic words are a weighted sum of their constituent meanings
• We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoderdecoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. […] The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice

## Software

• word2vec

This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.”

• fastText

fastText is a library for efficient learning of word representations and sentence classification.

## Refs

Arbi02
Arbib, M. (2002) The Mirror System, Imitation, and the Evolution of Language. In C. Nehaniv & K. Dautenhahn (Eds.), Imitation in animals and artifacts. MIT Press
BGPL10
Baronchelli, A., Gong, T., Puglisi, A., & Loreto, V. (2010) Modeling the emergence of universality in color naming patterns. Proceedings of the National Academy of Sciences, 107(6), 2403–2407. DOI.
BDVJ03
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003) A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.
CaSo03
Cancho, R. F. i, & Solé, R. V.(2003) Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences, 100(3), 788–791. DOI.
CaHM07
Cao, H., Hripcsak, G., & Markatou, M. (2007) A statistical methodology for analyzing co-occurrence data from a large sample. Journal of Biomedical Informatics, 40(3), 343–352. DOI.
ChCh08
Christiansen, M. H., & Chater, N. (2008) Language as shaped by the brain. Behavioral and Brain Sciences, 31, 489–509. DOI.
CoSo10
Corominas-Murtra, B., & Solé, R. V.(2010) Universality of Zipf’s law. Physical Review E, 82(1), 11102. DOI.
Deac10
Deacon, T. W.(2010) A role for relaxed selection in the evolution of the language capacity. Proceedings of the National Academy of Sciences, 107, 9000–9006. DOI.
DDFL90
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990) Indexing by Latent Semantic Analysis.
Elma90
Elman, J. L.(1990) Finding structure in time. Cognitive Science, 14, 179–211. DOI.
Elma93
Elman, J. L.(1993) Learning and development in neural networks: the importance of starting small. Cognition, 48, 71–99. DOI.
Elma95
Elman, J. L.(1995) Language as a dynamical system. , 195.
Elma03
Elman, J. L.(2003) Generalization from Sparse Input. In Proceedings of the 38th Annual Meeting of the Chicago Linguistic Society. Citeseer
Gärd14
Gärdenfors, P. (2014) Geometry of meaning: semantics based on conceptual spaces. . Cambridge, Massachusetts: The MIT Press
GALG06
Guthrie, D., Allison, B., Liu, W., Guthrie, L., & Wilks, Y. (2006) A Closer Look at Skip-gram Modelling.
KZSZ15
Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., & Fidler, S. (2015) Skip-Thought Vectors. arXiv:1506.06726 [Cs].
LNBB15
Lazaridou, A., Nguyen, D. T., Bernardi, R., & Baroni, M. (2015) Unveiling the Dreams of Word Embeddings: Towards Language-Driven Image Generation. arXiv:1506.03500 [Cs].
LeMi14
Le, Q. V., & Mikolov, T. (2014) Distributed Representations of Sentences and Documents. In Proceedings of The 31st International Conference on Machine Learning (pp. 1188–1196).
MCCD13
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [Cs].
MiLS13
Mikolov, T., Le, Q. V., & Sutskever, I. (2013) Exploiting Similarities among Languages for Machine Translation. arXiv:1309.4168 [Cs].
MSCC13
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013) Distributed Representations of Words and Phrases and their Compositionality. In arXiv:1310.4546 [cs, stat] (pp. 3111–3119). Curran Associates, Inc.
MiYZ13
Mikolov, T., Yih, W., & Zweig, G. (2013) Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL (pp. 746–751). Citeseer
PeSM14
Pennington, J., Socher, R., & Manning, C. D.(2014) GloVe: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12.
PeFH12
Petersson, K.-M., Folia, V., & Hagoort, P. (2012) What artificial grammar learning reveals about the neurobiology of syntax. Brain and Language, 120(2), 83–95. DOI.
RiCr04
Rizzolatti, G., & Craighero, L. (2004) The Mirror-Neuron System. Annual Review of Neuroscience, 27, 169–192. DOI.
Smit03
Smith, K. (2003) The Transmission of Language: models of biological and cultural evolution.
SmKi08
Smith, K., & Kirby, S. (2008) Cultural evolution: implications for understanding the human language faculty and its evolution. Philosophical Transactions of the Royal Society B: Biological Sciences, 363, 3591–3603. DOI.
SoCF10
Solé, R. V., Corominas-Murtra, B., & Fortuny, J. (2010) Diversity, competition, extinction: the ecophysics of language change. Journal of The Royal Society Interface, rsif20100110. DOI.
StTe05
Steyvers, M., & Tenenbaum, J. B.(2005) The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth. Cognitive Science, 29(1), 41–78. DOI.
SNVV14
Stolk, A., Noordzij, M. L., Verhagen, L., Volman, I., Schoffelen, J.-M., Oostenveld, R., … Toni, I. (2014) Cerebral coherence between communicators marks the emergence of meaning. Proceedings of the National Academy of Sciences, 111(51), 18183–18188. DOI.
Zane06
Zanette, D. H.(2006) Zipf’s law and the creation of musical context. Musicae Scientiae, 10(1), 3–18. DOI.