# Garbled highlights from NIPS 2016

## Favoured Demos

Memo Akten, Tom White, Google Magenta team.

## Variational Inference: foundations and Modern Methods

David Blei, Shakir Mohamed, Rajesh Ranganath did a really good tutorial on variational inference; my favourite, and a model of clarity.

Keywords of interest:

• SGD VI

• Introducing particular dependence stuctures in latent variables, e.g. inverse autoregressive flow

• Deep variable transforms by choosing transforms with convenient Jacobian determinant.

## Theory and Algorithms for forecasting Non-stationary Time series

Vitaly Kuznetsov, Mehryar Mohri

At last! Learning theory for time series!

Ian Goodfellow

Isola et al 2016 Image->Image translation looks good for new-wave super-resolution. But how would you find the estimation error of such a method for a given statistic?

## MetaGrad: Multiple Learning rates in Online Learning

Tim van Erven, Wouter M Koolen

Learn correct learning rate by simultaneously trying many.

Question: Why is this online-specific?

## Structured Orthogonal Random Features

I forget who presented YSCH16.

We present an intriguing discovery related to Random Fourier Features: replacing multiplication by a random Gaussian matrix with multiplication by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for its effectiveness. Motivated by the discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from $$\mathcal{O}(d^2)$$ to $$\mathcal{O}(d log d)$$, where d is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF.

Leads naturally to question: How to manage other types of correlation. How about time series?

## Universal Correspondence Network

I forgot who presented CGSC16, which integrates geometric transforms into CNNs in a reasonably natural way:

We present a deep learning framework for accurate visual correspondences and demonstrate its effectiveness for both geometric and semantic matching, spanning across rigid motions to intra-class shape or appearance variations. In contrast to previous CNN-based approaches that optimize a surrogate patch similarity objective, we use deep metric learning to directly learn a feature space that preserves either geometric or semantic similarity.

Cries out for a musical implementation

## Weight Normalization: A simple reparameterisation to Accelerate Training of Deep Neural Networks

Tim Salimans presents the simplest paper at NIPS, SaKi16:

We present weight normalization: a reparameterisation of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterisation is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

An elaborate motivation for a conceptually and practically simple way (couple of lines of code) of fixing up batch normalisation.

## Relevant sparse codes with variational information bottleneck

Matthew Chalk presents ChMT16.

In many applications, it is desirable to extract only the relevant aspects of data. A principled way to do this is the information bottleneck (IB) method, where one seeks a code that maximises information about a relevance variable, Y, while constraining the information encoded about the original data, X. Unfortunately however, the IB method is computationally demanding when data are high-dimensional and/or non-gaussian. Here we propose an approximate variational scheme for maximising a lower bound on the IB objective, analogous to variational EM. Using this method, we derive an IB algorithm to recover features that are both relevant and sparse. Finally, we demonstrate how kernelised versions of the algorithm can be used to address a broad range of problems with non-linear relation between X and Y.

This one is a cool demo machine.

## Dense Associative Memory for Pattern recognition

Dmitry Krotov presents KrHo16:

We propose a model of associative memory having an unusual mathematical structure. Contrary to the standard case, which works well only in the limit when the number of stored memories is much smaller than the number of neurons, our model stores and reliably retrieves many more patterns than the number of neurons in the network. We propose a simple duality between this dense associative memory and neural networks commonly used in models of deep learning. On the associative memory side of this duality, a family of models that smoothly interpolates between two limiting cases can be constructed. One limit is referred to as the feature-matching mode of pattern recognition, and the other one as the prototype regime. On the deep learning side of the duality, this family corresponds to neural networks with one hidden layer and various activation functions, which transmit the activities of the visible neurons to the hidden layer. This family of activation functions includes logistics, rectified linear units, and rectified polynomials of higher degrees. The proposed duality makes it possible to apply energy-based intuition from associative memory to analyze computational properties of neural networks with unusual activation functions – the higher rectified polynomials which until now have not been used for training neural networks. The utility of the dense memories is illustrated for two test cases: the logical gate XOR and the recognition of handwritten digits from the MNIST data set.

## Density estimation using Real NVP

Laurent Dinh explains DiSB16:

Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.

## InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Xi Chen presents CCDH16:

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

Usable parameterizations of GAN by structuring the latent space.

## Parameter Learning for Log-supermodular Distributions

Tatiana Shpakova presents ShBa16.

Hack of note:

In order to minimize the expectation[…], we propose to use the projected stochastic gradient method, not on the data as usually done, but on our own internal randomization.

## Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates

Non-negative matrix factorization is a popular tool for decomposing data into feature and weight matrices under non-negativity constraints. It enjoys practical success but is poorly understood theoretically. This paper proposes an algorithm that alternates between decoding the weights and updating the features, and shows that assuming a generative model of the data, it provably recovers the ground- truth under fairly mild conditions. In particular, its only essential requirement on features is linear independence. Furthermore, the algorithm uses ReLU to exploit the non-negativity for decoding the weights, and thus can tolerate adversarial noise that can potentially be as large as the signal, and can tolerate unbiased noise much larger than the signal. The analysis relies on a carefully designed coupling between two potential functions, which we believe is of independent interest.

## Time series workshop

Luminaries:

• Mehryar Mohri

• Yan Liu

• Andrew Nobel

• Inderjit Dhillon

• Stephen Roberts

Mehryar Mohri presented his online-learning time series analysis using mixtures of experts through empirical discrepancy. He had me up until the model selection phase, when I got lost in a recursive argument. Will come back to this.

Yan Liu - FDA approaches, Hawkes models, clustering of time series. Large section on subspace clustering, which I guess I need to comprehend at some point. Time is special because it reflects the arrow of entropy. Also it can give us a notion of real causality.

Andrew B. Nobel- important of mis-specification in time series models, wrt compounding of the problem over time, increased difficulty of validating assumptions. Time is special because it compounds error. P.s. why not more focus on algorithm failure cases? NIPS conference dynamic doesn’t encourage falsification.

Mohri: time is special because i.i.d is a special case thereof. “Prediction” really is about the future states with these. (How do you do inference of “true models” in his formalism?)

I missed the name of one Bayesian presenter, who asked:

Why not use DNN to construct features? How can the feature construction of DNNs be plugged in to Bayesian models? BTW, Bayesian nonparametrics still state of the art for general time series.

## High dimensional learning with structure

Luminaries:

• Richard Samworth

• Po-Ling Loh

• Sahand Negahban

• Mark Schmidt

• Kai-Wei Chang

• Allen Yang

• Chinmay Hegde

• Rene Vidal

• Guillaume Obozinski

• Lorenzo Rosasco

Several applications necessitate learning a very large number of parameters from small amounts of data, which can lead to overfitting, statistically unreliable answers, and large training/prediction costs. A common and effective method to avoid the above mentioned issues is to restrict the parameter-space using specific structural constraints such as sparsity or low rank. However, such simple constraints do not fully exploit the richer structure which is available in several applications and is present in the form of correlations, side information or higher order structure. Designing new structural constraints requires close collaboration between domain experts and machine learning practitioners. Similarly, developing efficient and principled algorithms to learn with such constraints requires further collaborations between experts in diverse areas such as statistics, optimization, approximation algorithms etc. This interplay has given rise to a vibrant research area.

The main objective of this workshop is to consolidate current ideas from diverse areas such as machine learning, signal processing, theoretical computer science, optimization and statistics, clarify the frontiers in this area, discuss important applications and open problems, and foster new collaborations.

Chinmay Hegde:

We consider the demixing problem of two (or more) high-dimensional vectors from nonlinear observations when the number of such observations is far less than the ambient dimension of the underlying vectors. Specifically, we demonstrate an algorithm that stably estimate the underlying components under general structured sparsity assumptions on these components. Specifically, we show that for certain types of structured superposition models, our method provably recovers the components given merely n = O(s) samples where s denotes the number of nonzero entries in the underlying components. Moreover, our method achieves a fast (linear) convergence rate, and also exhibits fast (near-linear) per-iteration complexity for certain types of structured models. We also provide a range of simulations to illustrate the performance of the proposed algorithm.

This ends up being a sparse recovery for given bases (e.g. Dirac deltas plus Fourier basis). The interesting problem is recovering the correct decomposition with insufficient incoherence (they have a formalism for this)

Rene Vidal: “Deep learning is nonlinear tensor factorization”. Various results on tensor factorization, regularized with various norms. They have proofs for a generalized class of matrix factorisations that “Sufficiently wide” factorization matrices do not have local minima. Conclusion: increase size of factorization, in optimisation procedure.

Guillaume Obozinski: hierarchical sparsity penalties for DAG inference.

Makoto Yamada ,Koh Takeuchi, Tomoharu Iwata, John Shawe-Taylor, Samuel Kaski. Localized Lasso for High-Dimensional Regression

## Doug Eck

Presents magenta.

## NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop

NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop

## Adaptive and Scalable Nonparametric Methods in Machine Learning

Looked solidly amazing, but I was caught up elsewhere:

Adaptive and Scalable Nonparametric Methods in Machine Learning

## Brains and Bits: Neuroscience Meets Machine Learning

Max Welling: Making Deep Learning Efficient Through Sparsification.

## Constructive machine learning

Rus Salakhutdinov

On Multiplicative Integration with Recurrent Neural Networks Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Ruslan R. Salakhutdinov

Constructive machine learning