Favoured Demos
Memo Akten, Tom White, Google Magenta team.
Variational Inference: foundations and Modern Methods
David Blei, Shakir Mohamed, Rajesh Ranganath did a really good tutorial on variational inference; my favourite, and a model of clarity.
Keywords of interest:

SGD VI

Introducing particular dependence stuctures in latent variables, e.g. inverse autoregressive flow

Deep variable transforms by choosing transforms with convenient Jacobian determinant.
Theory and Algorithms for forecasting Nonstationary Time series
Vitaly Kuznetsov, Mehryar Mohri
At last! Learning theory for time series!
Generative Adversarial models
Ian Goodfellow
Isola et al 2016 Image>Image translation looks good for newwave superresolution. But how would you find the estimation error of such a method for a given statistic?
MetaGrad: Multiple Learning rates in Online Learning
Tim van Erven, Wouter M Koolen
Learn correct learning rate by simultaneously trying many.
Question: Why is this onlinespecific?
Structured Orthogonal Random Features
I forget who presented YSCH16.
We present an intriguing discovery related to Random Fourier Features: replacing multiplication by a random Gaussian matrix with multiplication by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for its effectiveness. Motivated by the discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from to , where d is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF.
Leads naturally to question: How to manage other types of correlation. How about time series?
Universal Correspondence Network
I forgot who presented CGSC16, which integrates geometric transforms into CNNs in a reasonably natural way:
We present a deep learning framework for accurate visual correspondences and demonstrate its effectiveness for both geometric and semantic matching, spanning across rigid motions to intraclass shape or appearance variations. In contrast to previous CNNbased approaches that optimize a surrogate patch similarity objective, we use deep metric learning to directly learn a feature space that preserves either geometric or semantic similarity.
Cries out for a musical implementation
Weight Normalization: A simple reparameterisation to Accelerate Training of Deep Neural Networks
Tim Salimans presents the simplest paper at NIPS, SaKi16:
We present weight normalization: a reparameterisation of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterisation is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noisesensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speedup of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.
An elaborate motivation for a conceptually and practically simple way (couple of lines of code) of fixing up batch normalisation.
Relevant sparse codes with variational information bottleneck
Matthew Chalk presents ChMT16.
In many applications, it is desirable to extract only the relevant aspects of data. A principled way to do this is the information bottleneck (IB) method, where one seeks a code that maximises information about a relevance variable, Y, while constraining the information encoded about the original data, X. Unfortunately however, the IB method is computationally demanding when data are highdimensional and/or nongaussian. Here we propose an approximate variational scheme for maximising a lower bound on the IB objective, analogous to variational EM. Using this method, we derive an IB algorithm to recover features that are both relevant and sparse. Finally, we demonstrate how kernelised versions of the algorithm can be used to address a broad range of problems with nonlinear relation between X and Y.
This one is a cool demo machine.
Dense Associative Memory for Pattern recognition
Dmitry Krotov presents KrHo16:
We propose a model of associative memory having an unusual mathematical structure. Contrary to the standard case, which works well only in the limit when the number of stored memories is much smaller than the number of neurons, our model stores and reliably retrieves many more patterns than the number of neurons in the network. We propose a simple duality between this dense associative memory and neural networks commonly used in models of deep learning. On the associative memory side of this duality, a family of models that smoothly interpolates between two limiting cases can be constructed. One limit is referred to as the featurematching mode of pattern recognition, and the other one as the prototype regime. On the deep learning side of the duality, this family corresponds to neural networks with one hidden layer and various activation functions, which transmit the activities of the visible neurons to the hidden layer. This family of activation functions includes logistics, rectified linear units, and rectified polynomials of higher degrees. The proposed duality makes it possible to apply energybased intuition from associative memory to analyze computational properties of neural networks with unusual activation functions – the higher rectified polynomials which until now have not been used for training neural networks. The utility of the dense memories is illustrated for two test cases: the logical gate XOR and the recognition of handwritten digits from the MNIST data set.
Density estimation using Real NVP
Laurent Dinh explains DiSB16:
Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using realvalued nonvolume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact loglikelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, loglikelihood evaluation and latent variable manipulations.
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
Xi Chen presents CCDH16:
This paper describes InfoGAN, an informationtheoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the WakeSleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.
Usable parameterizations of GAN by structuring the latent space.
Parameter Learning for Logsupermodular Distributions
Tatiana Shpakova presents ShBa16.
Hack of note:
In order to minimize the expectation[…], we propose to use the projected stochastic gradient method, not on the data as usually done, but on our own internal randomization.
Recovery Guarantee of Nonnegative Matrix Factorization via Alternating Updates
Nonnegative matrix factorization is a popular tool for decomposing data into feature and weight matrices under nonnegativity constraints. It enjoys practical success but is poorly understood theoretically. This paper proposes an algorithm that alternates between decoding the weights and updating the features, and shows that assuming a generative model of the data, it provably recovers the ground truth under fairly mild conditions. In particular, its only essential requirement on features is linear independence. Furthermore, the algorithm uses ReLU to exploit the nonnegativity for decoding the weights, and thus can tolerate adversarial noise that can potentially be as large as the signal, and can tolerate unbiased noise much larger than the signal. The analysis relies on a carefully designed coupling between two potential functions, which we believe is of independent interest.
Time series workshop
Time series workshop home page, and the nonstationary time series tutorial with video.
Luminaries:

Mehryar Mohri

Yan Liu

Andrew Nobel

Inderjit Dhillon

Stephen Roberts
Mehryar Mohri presented his onlinelearning time series analysis using mixtures of experts through empirical discrepancy. He had me up until the model selection phase, when I got lost in a recursive argument. Will come back to this.
Yan Liu  FDA approaches, Hawkes models, clustering of time series. Large section on subspace clustering, which I guess I need to comprehend at some point. Time is special because it reflects the arrow of entropy. Also it can give us a notion of real causality.
Andrew B. Nobel important of misspecification in time series models, wrt compounding of the problem over time, increased difficulty of validating assumptions. Time is special because it compounds error. P.s. why not more focus on algorithm failure cases? NIPS conference dynamic doesn't encourage falsification.
Mohri: time is special because i.i.d is a special case thereof. “Prediction” really is about the future states with these. (How do you do inference of “true models” in his formalism?)
I missed the name of one Bayesian presenter, who asked:
Why not use DNN to construct features? How can the feature construction of DNNs be plugged in to Bayesian models? BTW, Bayesian nonparametrics still state of the art for general time series.
High dimensional learning with structure
High dimensional learning with structure page.
Luminaries:

Richard Samworth

PoLing Loh

Sahand Negahban

Mark Schmidt

KaiWei Chang

Allen Yang

Chinmay Hegde

Rene Vidal

Guillaume Obozinski

Lorenzo Rosasco
Several applications necessitate learning a very large number of parameters from small amounts of data, which can lead to overfitting, statistically unreliable answers, and large training/prediction costs. A common and effective method to avoid the above mentioned issues is to restrict the parameterspace using specific structural constraints such as sparsity or low rank. However, such simple constraints do not fully exploit the richer structure which is available in several applications and is present in the form of correlations, side information or higher order structure. Designing new structural constraints requires close collaboration between domain experts and machine learning practitioners. Similarly, developing efficient and principled algorithms to learn with such constraints requires further collaborations between experts in diverse areas such as statistics, optimization, approximation algorithms etc. This interplay has given rise to a vibrant research area.
The main objective of this workshop is to consolidate current ideas from diverse areas such as machine learning, signal processing, theoretical computer science, optimization and statistics, clarify the frontiers in this area, discuss important applications and open problems, and foster new collaborations.
Chinmay Hegde:
We consider the demixing problem of two (or more) highdimensional vectors from nonlinear observations when the number of such observations is far less than the ambient dimension of the underlying vectors. Specifically, we demonstrate an algorithm that stably estimate the underlying components under general structured sparsity assumptions on these components. Specifically, we show that for certain types of structured superposition models, our method provably recovers the components given merely n = O(s) samples where s denotes the number of nonzero entries in the underlying components. Moreover, our method achieves a fast (linear) convergence rate, and also exhibits fast (nearlinear) periteration complexity for certain types of structured models. We also provide a range of simulations to illustrate the performance of the proposed algorithm.
This ends up being a sparse recovery for given bases (e.g. Dirac deltas plus Fourier basis). The interesting problem is recovering the correct decomposition with insufficient incoherence (they have a formalism for this)
Rene Vidal: “Deep learning is nonlinear tensor factorization”. Various results on tensor factorization, regularized with various norms. They have proofs for a generalized class of matrix factorisations that “Sufficiently wide” factorization matrices do not have local minima. Conclusion: increase size of factorization, in optimisation procedure.
Guillaume Obozinski: hierarchical sparsity penalties for DAG inference.
Makoto Yamada ,Koh Takeuchi, Tomoharu Iwata, John ShaweTaylor, Samuel Kaski. Localized Lasso for HighDimensional Regression
Doug Eck
Presents magenta.
Computing with spikes workshop
computing with spikes home page.
Bayesian Deep Learning workshop
Bayesian Deep Learning workshop homepage.
NIPS 2016 Endtoend Learning for Speech and Audio Processing Workshop
NIPS 2016 Endtoend Learning for Speech and Audio Processing Workshop
Adaptive and Scalable Nonparametric Methods in Machine Learning
Looked solidly amazing, but I was caught up elsewhere:
Adaptive and Scalable Nonparametric Methods in Machine Learning
Brains and Bits: Neuroscience Meets Machine Learning
Especially curious about
Max Welling: Making Deep Learning Efficient Through Sparsification.
Spatiotemporal forecasting
homepage of NIPS workshop on ML for Spatiotemporal Forecasting.
Constructive machine learning
Rus Salakhutdinov
On Multiplicative Integration with Recurrent Neural Networks Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Ruslan R. Salakhutdinov