Feature construction for inconvenient data; made famous by word embeddings such as `word2vec`

being surprisingly [semantic]{filename}semantics.md). Note that word2vec has a complex relationship to its documentation.

Entity embeddings of categorical variables (code)

We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables. We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features. We further demonstrate in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown. Thus it is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit. We also demonstrate that the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead. As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering.

Christopher Olah discusses it from a neural network perspective with diagrams and commends Bengio’s BDVJ03_ for a rationale.

Rutger Ruizendaal has a tutorial on learning embedding layers

## Refs

- GALG06: David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, Yorick Wilks (2006) A Closer Look at Skip-gram Modelling.
- CaHM07: Hui Cao, George Hripcsak, Marianthi Markatou (2007) A statistical methodology for analyzing co-occurrence data from a large sample.
*Journal of Biomedical Informatics*, 40(3), 343–352. DOI - LeMi14: Quoc V. Le, Tomas Mikolov (2014) Distributed Representations of Sentences and Documents. In Proceedings of The 31st International Conference on Machine Learning (pp. 1188–1196).
- MSCC13: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (2013) Distributed Representations of Words and Phrases and their Compositionality. In arXiv:1310.4546 [cs, stat] (pp. 3111–3119). Curran Associates, Inc.
- MCCD13: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013) Efficient Estimation of Word Representations in Vector Space.
*ArXiv:1301.3781 [Cs]*. - MiLS13: Tomas Mikolov, Quoc V. Le, Ilya Sutskever (2013) Exploiting Similarities among Languages for Machine Translation.
*ArXiv:1309.4168 [Cs]*. - PeSM14: Jeffrey Pennington, Richard Socher, Christopher D. Manning (2014) GloVe: Global vectors for word representation.
*Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014)*, 12. - NCVC17: Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, Shantanu Jaiswal (2017) graph2vec: Learning Distributed Representations of Graphs.
*ArXiv:1707.05005 [Cs]*. - MiYZ13: Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig (2013) Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL (pp. 746–751). Citeseer
- KZSZ15: Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler (2015) Skip-Thought Vectors.
*ArXiv:1506.06726 [Cs]*. - LNBB15: Angeliki Lazaridou, Dat Tien Nguyen, Raffaella Bernardi, Marco Baroni (2015) Unveiling the Dreams of Word Embeddings: Towards Language-Driven Image Generation.
*ArXiv:1506.03500 [Cs]*.