# Virtual Trip Report ICML 2020

Last week, I attended my first machine learning conference: ICML 2020! The virtual format required by the corona measures made joining cheap and easy, though certainly at a cost: Poster sessions, especially, seemed to have a relatively high bar to entry, resulting in a lot of quiet or empty zoom rooms. And while the virtual format allows extremely quick hopping between different livestreams, it also seemed to decrease interaction, and, at least for me, it was often hard to focus on the talks. Still, I have to thank the organizing committee for the stellar infrastructure created within a short period of time that offered many avenues for networking, discussion and learning. I had some great discussions during the Q&A sessions!

Before going into this trip report, I should note that I have my own interests, in particular generative modeling and neuro-symbolic AI. It is easy to think these topics are very popular considering how many papers ICML had on these topics, but such trends can be discovered for practically any topic given the large volume of papers!

My highlights of ICML 2020 were certainly the great tutorials and workshops. The tutorial on Bayesian Deep Learning by Andrew Wilson was very insightful, and the workshops were varied and relevant, with two whole days focused on Graph Neural Network (GNN) approaches. If there is one real trend to spot, then it is the increased focus on adding prior knowledge through for example inductive bias instead of blindly applying NNs!

## Monday: Tutorials

### Representation Learning without Labels

This tutorial by Danilo Rezende, Ali Eslami and Irina Higgins introduced the currently very popular problem of unsupervised representation learning, or the learning of features without using labelled data. They introduced this problem through Plato’s famous allegory of the cave: Humans in the bottom of a cave can only observe the shadows of objects, which is all they can use to figure out how to behave.

Similarly, our model has to deduce many interesting properties of an object from eg just a 2D pixel image. How would it figure out what properties are of interest while only observing part of the object? This is challenging without labels. Luckily, these days unsupervised pre-training allows transferring useful representations to then fine-tune them on a specific task.

A popular approach is to use reconstruction: Learn to compress input data to a lower-dimensional space, and then reconstruct the input data from this embedding. Unfortunately, often the learned embedding is not that useful! They show two method to ensure this: Build in structure or inductive bias, or train for a proxy task instead. An example here is the very recently popular self-supervised contrastive learning: Transform input data into similar representations using data augmentation, and ensure the resulting embeddings are similar. This was already applied to images, but this conference also has a paper that applies it to representing computer programs.

In the landscape section of the tutorial, Ali and Irina discussed the possible modelling choices to make for unsupervised representation learning. The main modelling approaches are VAEs (explicit likelihood), GANs (implicit likelihood), Energy-based models (undirected) and exact likelihood models (flows, autoregressive models).

For the level of modelling, one has to choose 1) the depth to which the manifold of the data was modeled, and 2) to what degree causality is modeled.

### Causal Reinforcement Learning

Elias Bareinboim presented his ideas about causal reinforcement learning. This frames RL through the lens of Causal Inference to create a new Causal RL loop, where the environment and agent are tied to a Structural Causal Model and a Causal Graph:

Causal Inference allows answering new questions within RL:

[latexpage]

What I found surprising is that RL already involves interventional learning: In online RL, an agent performs all experiments herself, and learns $p(y|do(x))$. Furthermore, the distinction between off-policy learning and do-calculus learning is that in Off-policy learning, it is assumed that the underlying context (causal structure) is equal to what we will evaluate in. This is not often the case, for example in learning from medical data.

### Parameter free online optimization

I only watched a bit of this tutorial by Francesco Orabona and Ashok Cutkosky, which argued for designing optimization algorithms that don’t have any hyperparameters at all (not even a learning rate!). As an example, they showed that using coin betting math, you can find very good online convex optimization algorithms! This talk looked very promising and I might finish it sometime.

### Bayesian Deep Learning and a Probabilistic Perspective of Model Construction

Andrew Wilson presented possibly my favorite tutorial on Bayesian Deep Learning. A model is able to solve a task well if the solution is in its support (the set of possible solutions) and if its inductive biases a priori assign high likelihood to the solution. I love the picture below, which explains this well. Linear models can represent only linear solutions, and do not prefer image datasets, ie, their inductive bias is poor for the task. MLPs can support any type of data, but also do not strongly prefer certain solutions. Finally, CNNs also support many datasets, but because of its strong inductive biases are a good suit to image datasets!

This tutorial argues for the “Bayesian Model Average” (BMA, aka the Bayesian predictive distribution) $p(y|x, Y, X)=\int p(y|x, \theta)p(w|Y, X)d\theta$ . It averages over all parameter settings of a model! Each setting gives very different predictions, especially in NNs, because they can represent many complementary explanations for the data. Furthermore, it gives a principled approach to uncertainty estimation.

Wilson used a nice minimal example that shows the benefit of BMA. Assume you have a biased coin with probability of tails $\lambda$, which we flip once, landing on tails. Using MLE, we would (over)estimate $\lambda$ to be 1 from just this flip! The Bayesian approach would be to add a prior over $\lambda$. However, the MAP (maximum of the posterior over $\lambda$) is not an option either: With a uniform prior, the MAP of this problem is also $\lambda=1$. The BMA gives $mathbb{E}[\lambda]=\frac{2}{3}$ instead, which seems much more reasonable!

In the second part of the tutorial, Wilson discusses how Gaussian Processes (GPs) arise from the Bayesian view. GPs will prefer simple solutions, even though they use theoretically infinitely-dimensional parameters! Furthermore, he shows how to use neural network kernels that allow us to integrate NNs and GPs.

He also discussed their recent library GPyTorch, which is able to scale GPs to run on millions of datapoints using GPUs. GPs are notorious for their scaling laws in number of datapoints, so this is quite impressive!

Next he discussed practical methods for Bayesian DL. Research showed that for many minima in the loss surface, you can walk from one minimum to another through very low regions of loss called ‘basins’.

Approximating the BMA can be done using variational methods or MCMC. The problem with these approaches is that they find unimodal posteriors (ie, in one basin). He argues that simply using ensembles of NNs that also output variance is a better approach: Each independently trained NN is, if I understand correctly, a sample from the posterior over weights. In a new paper, Wilson and Pavel Izmailov introduce MultiSWAG, which improves the BMA for such ensembles that marginalizes over multiple modes, and not only within one basin.

Finally, the BMA overcomes the Double Descent property of Neural Networks. Increasing complexity monotonically decreases test error:

## Tuesday-Thursday: Posters

I saw (parts of) quite a few talks. I don’t have the time (or space) to list everything, so here are my 10 favorites in chronological order:

• Generating Programmatic Referring Expressions via Program Synthesis, Huang et al: Great paper that attempts to generate expressions that uniquely refer to a single object on an image. It uses a very multi-modal approach: Translate images into Scene Graphs, then iteratively create a computer program using program synthesis to create a query that only selects a single object in the generated Scene Graph.
• SoftSort: A Differantiable Continuous Relaxation of the argsort Operator, Prillo et al: Uses the exact output of a sorting operator, then uses the softmax operator on this output to create a continuous relaxation of sorting. Simple, but fast and works well!
• Incremental Sampling Without Replacement for Sequence Models, Shi et al: Introduces a method to iteratively sample from complex sequence models without replacement. For example, if one is generating computer programs that should have some specified behavior, this allows generating one (or multiple, using Stochastic Beam Search!) sample without replacement. If it isn’t correct, simple generate another one without wasting computation on duplicates!
• Learning to Simulate Complex Physics with Graph Networks, Sanchez et al: Translate physical processes into a huge set of particles, create a graph based on their distances, then run a GNN (deep ones, 10 layers!) over this graph to predict the future state of the particles! Pretty simple, but the results are gorgeous, check out the videos.
• Learning Reasoning Strategies in End-to-End Differentiable Proving, Minervini et al: More work on efficiently running “Neural Theorem Provers”, which are models that use a continuous, Prolog-like inference mechanism to reason over relational data. It allows learning interpretable rules by unifying symbols based on their respective embeddings.
• Proving the Lottery Ticket Hypothesis: Pruning is All You Need, Malach et al: Gives a proof of the “Strong Lottery Ticket Hypothesis”, which is (informally): Given a randomly instantiated and overparameterized neural network, you can, simply by removing many connections between the neurons, get to great performance. This is without training any weights! I did some experiments myself by pruning using the Gumbel-Softmax trick, and indeed, it works very well :)
• Explicit Gradient Learning for Black-Box Optimization, Sarafian et al: Studies a method of black-box optimization that uses an NN that learns the gradient of the black box function. Unlike common approaches called Implicit Gradient Learning that model the function and take the derivative of the NN (eg DDPG), this learns the approximate gradient through minimization of the error of the Taylor expansion. I was impressed by the smooth gradients in very discontinuous functions:

## Friday-Saturday: Workshops

### Graph Representation Learning and Beyond (GRL+)

This workshop focused on using Graph Neural Networks (GNNs) to learn to represent objects and graphs. As GNNs are a hot topic right now, there were many posters, applications, advancements and lots of discussion! Still, I found this workshop somewhat lacking in very novel ideas (for this, I refer you to the GNN workshop a day later, further down).

Xavier Bresson started with a general talk on the history of GNNs: In order, graph embedding techniques, the spectral generalization of convolutions for graphs, message passing. More recent and advanced topics are the intersection with probabilistic graphical models and works focusing on graph isomorphisms that make better use of the theory to ensure that different graphs truly are embedded differently (or: Equivariant GNNs!).

He then continued to discuss evaluation methods of GNNs, arguing that the field is lacking its ImageNet: A dataset complex and large enough to require big advances and very complex models to solve. Currently, GNNs are mainly evaluated on small datasets like Cora and Citeseer. He proposed some datasets to benchmark on. Here, normal message passing GNNs outperformed more complex, equivariant models, possibly because their speed/memory complexities do not scale well.

Next, Thomas Kipf discussed “Relational Structure Disovery”: What if we don’t know enough about the graph, or what if we don’t even have any nodes to begin with? I believe this is a very interesting and important topic, which unfortunately was missing somewhat in the rest of the workshop. Neural Relational Inference creates a graph that connects a set of objects in latent space, and then is able to use this graph to reconstruct the physical dynamics of the objects. This is possible by sampling using the Gumbel softmax trick.

To discover entities, one approach is the recently proposed Slot Attention, which are symmetric slots to which parts of an images are mapped to using a competitive attention mechanism, with impressive results! It allows GNNs to be used with unstructured data as well.

I liked the updates on popular software for GNNs, in particular PyTorch Geometric, which looks like a fantastic open source library containing many different models and training utilities to allow training on large graphs. Another interesting resource is the Open Graph Benchmark, which is a collection of diverse datasets for machine learning on graphs that is fully compatible with PyTorch Geometric.

Kyle Cranmer had an interesting keynote on how graphs arise in physics, and what we can learn from physics about inductive bias. He argues this can be divided into compositionality, relationships, symmetry and causality. He also described applications of NNs to graph-structured data in physics, in particular in particle physics. Another model is Set2Graph, which is a permutation invariant method to transform a set of objects into a (hyper)graph. Finally, a very interesting work discovers symbolic models using deep learning where a GNN learns the particle dynamics regularized with L1 to get sparse representations. Then symbolic regression is applied to find new physical equations (which had better generalization than both the GNN itself and a heuristic from the literature!)

### Object-Oriented learning: Perception, Representation, and Reasoning

This (surprisingly multidisciplinary!) workshop focused on studying how to represent objects (usually on an image) and the interactions between them, ideally in an unsupervised way. This could provide better compositionality and generalization. I only saw a few talks here, but was impressed by the diversity of speakers, with insights from psychology, anthropology and neurocognition.

The workshop started with a fantastic talk by Klaus Greff which discusses from a philosophical and cognitive perspective what objects actually are. Objects are not:

• categories (ie, labeling an image with a single classification is insufficient)
• a supervised notion: We understand what objects are that we’ve never seen before
• units that move as one (eg trees and houses)
• necessarily physical things (eg waterfalls, holes or rainbows)
• just how they appear
• just visual (eg, a campfire has a smell and radiates heat)

That is, objects are other than the sum of its parts. He argues we should segregate unstructured inputs into individual objects which should be represented individually, then composed together into relations. Representations should be self-contained and reusable independent of context (and thus unsupervised!). To do this, he argues for universal slots (like IODINE) that are general and permutation invariant, unlike slots that are time-dependent, spatially dependent or category dependent.

A cool talk was on representation learning for planning in Minecraft.

Judy Culham tackled object recognition from an experimental neurocognitive view: She argues that the image modality is treacherous as it is missing information that humans use actively to recognize and understand objects. Her experiments showed that, by comparing how humans perceive objects and images of these objects, the distance to the object is of great importance as it can be used to deduce its size. Other aspects of missing information is the 3D depth, other senses such as touch and smell and the environmental context, and not being able to actively learn by moving around the object limits the image modality. She argues we can improve object recognition by taking this information into account or modeling it somehow.

Moira Dillon discussed the phenomenon that humans, especially earlier civilizations and young children, tend to prefer drawing just objects, and not quite the layout of these objects. This could be because drawings prioritize elements that elicit our explicit attention, and because geometry is easier to bring across on a 2D surface for objects. I think she argued that we should develop some AI with the goal to learn to draw pictures, so that we can better understand this AI and this AI can better understand us (I admit, I don’t really get this argument).

### INNF+: Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models

This workshop presented work on explicit likelihood models such as VAEs, Normalizing Flows that make use of invertible neural networks, and complicated things I don’t understand like Neural ODEs. This workshop was quite technical, but very interesting with a great lineup of speakers.

Max Welling had the first keynote, discussing a recent framework for likelihood-based models called SurVAE. He argues that by using generative models with strong inductive biases, we can reduce the data hunger of deep learning! Generative models are forward models: They simulate the data generation process. To be able to train generative models, we often use VAEs, which add an inverse model (aka encoder) that reverses the data generation process to find a latent quality of interest. Normalizing flows (or bijections) are related to this forward/inverse paradigm, where both the forward and inverse model are deterministic! This allows putting both in the SurVAE framework, which also adds models that are differentiable and surjective (such as abs, slicing, max, sort), ie deterministic in one direction and stochastic in the other. I don’t fully understand this framework, but it looks powerful and general.

I liked the next keynote by Eric Nalisnick which aims to detect distribution shift: Given a trained model and some set of (test) data, detect if the test data was from a different distribution than the training distribution (ie, out-of-distribution (OOD)). They experimented with some common models and dataset with surprising results: The test set of other datasets than the one trained on are assigned higher likelihoods in many common models!

The rest of the talk focused on how to detect OOD data. One method is a model selection: Compared to some model for OOD data (eg uniform), is the data more probable under our model or the OOD model? A better way is to detect any departure from our model: Do not just consider this OOD model! He introduced a typicality test which excludes the mode.

The intuition behind excluding the mode is that in high dimensions, the volume concentrates away from the mode. This is well known as the “soap bubble” effect in Gaussian distributions. The typicality test is much better at distinguishing OOD datasets than competitive approaches for image data.

Emilien Dupont had another excellent and intuitive talk about the representational limits of neural ordinary differentiable equations (NODEs). The ‘flow’ of an NODE is a homeomorphism, which is like equivalence classes of topological spaces: They preserve the topology of the input space. One result is that the trajectories of ODEs cannot intersect, and thus cannot represent simple functions like $f(x)=-x$:

He claims this also holds for normalizing flows (“all models that are continuous and invertible”), but I’m missing why: $f(x)=-x$ is a continuous invertible function, why would it not be able to represent this? For dimensions larger than 1, NODEs learn to approximate such mappings:

However, such NODEs become numerically ill-posed. Another solution he introduced called Augmented NODEs is to increase the dimensionality of the input space so the function has more room to ‘wiggle’:

With double the amount of dimensions, it was recently proved that this can represent any mapping from $\mathbb{R}^d$ to $\mathbb{R}^d$, at the cost of making exact inference intractable.

A cool talk presented by Hyunjik Kim shows that dot-product self-attention is not Lipschitz. By replacing the dot product by negative squared L2 distance and by tying the query and key weights, we can have a variant of self-attention with an upper bound on the Lipschitz constant, which also allows for invertible self-attention! The performance of this form of attention is lower, but we can compensate for this by stacking more layers.

Cheng Zhang presented work on different divergence measures for variational inference. Usually, the KL-divergence between the variational posterior and the true posterior is used. However, this underestimates uncertainty. Other measures are more flexible, but can be harder to infer. She uses meta learning to learn what family of divergence to choose.

Adji Bousso Dieng tackles mode collapse in GANs. Her model PresGAN trains a generative model using GAN training so that the likelihood is defined and entropy-regularized. This enforces the generator to cover the modes of the likelihood distribution which results in diverse image generation.

Another talk by Kyle Cranmer focuses on the importance of likelihood-based models in physical sciences. In such sciences, simulators are complex causal and generative models for data, for which exact inference or likelihood computation is typically intractable because of the large amount of latent variables. Still, using samples and automatic differentiation, we can optimize summary statistics through the simulator to get the best input parameters. Another way to find these parameters is to introduce a surrogate model, which is a (neural network) simulator for a simulator! Using amortized variational inference, we can both find the parameters and get approximate likelihoods using density estimation.

### Bridge Between Perception and Reasoning: Graph Neural Networks & Beyond

Last but certainly not least is this workshop that seems to be about GNNs but was mostly about neural-symbolic AI in general, with a focus on using reasoning in the perceptual domain. It was a rather exploratory workshop with many interesting and novel ideas!

The workshop started with a keynote by Yoshua Bengio, who provided a somewhat cognition-focused talk. He argues that only some part of knowledge can be verbalized and used in System 2 reasoning. Bengio tries to provide inductive biases inspired by system 2 for machine learning. One example (the “consciousness prior”) is a sparse factor graph that consists of high-level semantic variables, that are causal and consist of agents, intentions and controllable objects. Handling processing on this sparse factor graph requires attention, ie, focus on only a few elements at a time. He presented several works based on Recurrent Independent Mechanisms, which are a type of RNN that incorporate these ideas.

A very interesting paper presented in this workshop by Hongyu Ren looked at how to embed logical queries. Previous query embedding methods could not handle the logical negation operator. By embedding entities and queries using beta distributions, we find a natural expression for negation, conjunction, disjunction and existential quantification. For each operator, there is a probabilistic project operator that maps from one fuzzy set to another.

Next, Zico Kolter presented some interesting work on structured layers, that embed differentiable modules within neural networks. These are phrased as finding a value $y$ such that $f(x, y; \theta)=0$. Using implicit differentiation, we can backpropagate through the exact solution of the equation using $\frac{\partial y}{\partial x}$. An example of such a layer is convex optimization. He argues that this idea is possible for a wide variety of tasks.

He applies structured layers to MAXSAT (a form of SAT solving that maximizes the number of satisfied clauses). SATNet is able to do this in a differentiable way using a semidefinite relaxation of the problem and shows a factorization method from the literature for which an implicit function solution exists.

Using this layer, it can learn to solve sudoku’s of MNIST digits, which is relatively challenging for NNs.

Luc De Raedts keynote discussed the relation between statistical relational AI (StarAI) and Neural-Symbolic AI. He argues that Neural-Symbolic AI requires combining logic, neural networks and probability. Within StarAI, there are two lines of thought: Directed approaches have a data generation process, and correspond more to logic programs, while undirected approaches act more as a logical regularizer to modeling data, that penalizes violations of logical constraints. The directed approach is integrated with neural networks through for example NTPs, and the undirected approach through Semantic Loss, LTNs and SBR (and other <self plug> differentiable fuzzy logics </self plug>).

He argues for the use of DeepProbLog, which extends the probabilistic logic ProbLog with predicates that are interpreted using neural networks. It is very expressive, having both logic and neural networks as a special case, but have the downside of being hard to scale.