Ph.D. candidate at Mila.

I am a graduate student at the Montreal Institute for Learning Algorithms (Mila), advised by Aaron Courville. I work on deep latent variable models and efficient algorithms for approximate inference. I primarily focus on improving the expressiveness of deep probabilistic models, the optimization process of inference, and understanding the training dynamics of generative models in general.

- (20211214) We are organizing a new NeurIPS '21 workshop on programming languages, machine learning, probabilistic programming, and program synthesis.
- (20211213) I am giving an invited keynote talk at the NeurIPS '21 Optimal Transport and Machine Learning workshop on OT & Probability Flows (including normalizing flows and diffusion models).
- (20210608) New theoretical development on the marginal likelihood of diffusion based models. (update 09/28) To be presented as a spotlight talk at NeurIPS 2021!
- (20210422) The 3rd iteration of INNF+, workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, is coming back this year at ICML 2021.
- (20210115) Our paper Convex Potential Flows (CP-Flow) is accepted to ICLR 2021!
- (20201017) I am named a Google PhD fellow in Machine Learning.
- (20200613) New paper AR-DAE, a black-box entropy gradient estimator, accepted to ICML 2020

## A Variational Perspective on Diffusion-Based Generative Models and Score Matching arXiv GitHubDiscrete-time diffusion-based generative models and score matching methods have shown promising results in modeling high-dimensional image data. Recently, Song et al. (2021) show that diffusion processes that transform data into noise can be reversed via learning the score function, i.e. the gradient of the log-density of the perturbed data. They propose to plug the learned score function into an inverse formula to define a generative diffusion process. Despite the empirical success, a theoretical underpinning of this procedure is still lacking. In this work, we approach the (continuous-time) generative diffusion directly and derive a variational framework for likelihood estimation, which includes continuous-time normalizing flows as a special case, and can be seen as an infinitely deep variational autoencoder. Under this framework, we show that minimizing the score-matching loss is equivalent to maximizing a lower bound of the likelihood of the plug-in reverse SDE proposed by Song et al. (2021), bridging the theoretical gap.Chin-Wei Huang, Jae Hyun Lim, Aaron Courville Neural Information Processing Systems, 2021 (spotlight) | ||

## Bijective-Contrastive Estimation paper videoIn this work, we propose Bijective-Contrastive Estimation (BCE), a classification-based learning criterion for energy-based models. We generate a collection of contrasting distributions using bijections, and solve all the classification problems between the original data distribution and the distributions induced by the bijections using a classifier parameterized by an energy model. We show that if the classification objective is minimized, the energy function will uniquely recover the data density up to a normalizing constant. This has the benefit of not having to explicitly specify a contrasting distribution, like noise contrastive estimation. Experimentally, we demonstrate that the proposed method works well on 2D synthetic datasets. We discuss the difficulty in high dimensional cases, and propose potential directions to explore for future work.Jae Hyun Lim*, Chin-Wei Huang*, Aaron Courville, Chris Pal Symposium on Advances in Approximate Bayesian Inference, 2020 (contributed talk) | ||

## Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization arXiv openreviewFlow-based models are powerful tools for designing probabilistic models with tractable density. This paper introduces Convex Potential Flows (CP-Flow), a natural and efficient parameterization of invertible models inspired by the optimal transport (OT) theory. CP-Flows are the gradient map of a strongly convex neural potential function. The convexity implies invertibility and allows us to resort to convex optimization to solve the convex conjugate for efficient inversion. To enable maximum likelihood training, we derive a new gradient estimator of the log-determinant of the Jacobian, which involves solving an inverse-Hessian vector product using the conjugate gradient method. The gradient estimator has constant-memory cost, and can be made effectively unbiased by reducing the error tolerance level of the convex optimization routine. Theoretically, we prove that CP-Flows are universal density approximators and are optimal in the OT sense. Our empirical results show that CP-Flow performs competitively on standard benchmarks of density estimation and variational inference.Chin-Wei Huang, Ricky TQ Chen, Christos Tsirigotis, Aaron Courville International Conference on Learning Representations, 2021 | ||

## RealCause: Realistic Causal Inference Benchmarking arXivThere are many different causal effect estimators in causal inference. However, it is unclear how to choose between these estimators because there is no ground-truth for causal effects. A commonly used option is to simulate synthetic data, where the ground-truth is known. However, the best causal estimators on synthetic data are unlikely to be the best causal estimators on realistic data. An ideal benchmark for causal estimators would both (a) yield ground-truth values of the causal effects and (b) be representative of real data. Using flexible generative models, we provide a benchmark that both yields ground-truth and is realistic. Using this benchmark, we evaluate 66 different causal estimators.Brady Neal, Chin-Wei Huang, Sunand Raghupathi Preprint, 2020 | ||

## A benchmark of medical out of distribution detection arXivThere is a rise in the use of deep learning for automated medical diagnosis, most notably in medical imaging. Such an automated system uses a set of images from a patient to diagnose whether they have a disease. However, systems trained for one particular domain of images cannot be expected to perform accurately on images of a different domain. These images should be filtered out by an Out-of-Distribution Detection (OoDD) method prior to diagnosis. This paper benchmarks popular OoDD methods in three domains of medical imaging: chest x-rays, fundus images, and histology slides. Our experiments show that despite methods yielding good results on some types of out-of-distribution samples, they fail to recognize images close to the training distribution.Tianshi Cao, Chin-Wei Huang, David Yu-Tung Hui, Joseph Paul Cohen Preprint, 2020 | ||

## AR-DAE: Towards Unbiased Neural Entropy Gradient Estimation arXivEntropy is ubiquitous in machine learning, but it is in general intractable to compute the entropy of the distribution of an arbitrary continuous random variable. In this paper, we propose theamortized residual denoising autoencoder (AR-DAE) to approximate the gradient of the log density function, which can be used to estimate the gradient of entropy. Amortization allows us to significantly reduce the error of the gradient approximator by approaching asymptotic optimality of a regular DAE, in which case the estimation is in theory unbiased. Jae Hyun Lim, Aaron Courville, Chris Pal, Chin-Wei Huang International Conference on Machine Learning, 2020 | ||

## Augmented Normalizing Flows: Bridging the Gap Between Generative Flows and Latent Variable Models arXivIn this work, we propose a new family of generative flows on an augmented data space, with an aim to improve expressivity without drastically increasing the computational cost of sampling and evaluation of a lower bound on the likelihood. Theoretically, we prove the proposed flow can approximate a Hamiltonian ODE as a universal transport map. Empirically, we demonstrate stateof-the-art performance on standard benchmarks of flow-based generative modeling.Chin-Wei Huang, Laurent Dinh, Aaron Courville See the workshop paper (under the title Solving ODE with universal flows: Approximation theory for flow-based models) for additional theoretical development. Presented at the ICLR DeepDiffEq workshop (contributed talk). | ||

## Stochastic Neural Network with Kronecker Flow arXivRecent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and better than the baselines.Chin-Wei Huang, Ahmed Touati, Pascal Vincent, Gintare Karolina Dziugaite, Alexandre Lacoste, Aaron Courville International Conference on Artificial Intelligence and Statistics, 2020 | ||

## vGraph: A Generative Model for Joint Community Detection and Node Representation Learning paperThis paper focuses on two fundamental tasks of graph analysis: community detection and node representation learning, which capture the global and local structures of graphs, respectively. In the current literature, these two tasks are usually independently studied while they are actually highly correlated. We propose a probabilistic generative model called vGraph to learn community membership and node representation collaboratively. Specifically, we assume that each node can be represented as a mixture of communities, and each community is defined as a multinomial distribution over nodes. Both the mixing coefficients and the community distribution are parameterized by the low-dimensional representations of the nodes and communities. We designed an effective variational inference algorithm which regularizes the community membership of neighboring nodes to be similar in the latent space. ...Fan-Yun Sun, Meng Qu, Jordan Hoffmann, Chin-Wei Huang, Jian Tang Neural Information Processing Systems, 2019 | ||

## Probability Distillation: A Caveat and Alternatives paperDue to Van den Oord et al. (2018), probability distillation has recently been of interest to deep learning practitioners, where, as a practical workaround for deploying autoregressive models in real-time applications, a student network is used to obtain quality samples in parallel. We identify a pathological optimization issue with the adopted stochastic minimization of the reverse-KL divergence: the curse of dimensionality results in a skewed gradient distribution that renders training inefficient. This means that KL-based “evaluative” training can be susceptible to poor exploration if the target distribution is highly structured. We then explore alternative principles for distillation, including one with an “instructive” signal, and show that it is possible to achieve qualitatively better results than with KL minimization.Chin-Wei Huang*, Faruk Ahmed*, Kundan Kumar, Alexandre Lacoste, Aaron Courville Association for Uncertainty in Artificial Intelligence, 2019 | ||

## Hierarchical Importance Weighted Autoencoders paperImportance weighted variational inference (Burda et al., 2015) uses multiple i.i.d. samples to have a tighter variational lower bound. We believe a joint proposal has the potential of reducing the number of redundant samples, and introduce a hierarchical structure to induce correlation. The hope is that the proposals would coordinate to make up for the error made by one another to reduce the variance of the importance estimator. Theoretically, we analyze the condition under which convergence of the estimator variance can be connected to convergence of the lower bound. Empirically, we confirm that maximization of the lower bound does implicitly minimize variance. Further analysis shows that this is a result of negative correlation induced by the proposed hierarchical meta sampling scheme, and performance of inference also improves when the number of samples increases.Chin-Wei Huang, Kris Sankaran, Eeshan Dhekane, Alexandre Lacoste, Aaron Courville International Conference on Machine Learning, 2019 | ||

## Improving Explorability in Variational Inference with Annealed Variational Objectives paperDespite the advances in the representational capacity of approximate distributions for variational inference, the optimization process can still limit the density that is ultimately learned. We demonstrate the drawbacks of biasing the true posterior to be unimodal, and introduce Annealed Variational Objectives (AVO) into the training of hierarchical variational methods. Inspired by Annealed Importance Sampling, the proposed method facilitates learning by incorporating energy tempering into the optimization objective. In our experiments, we demonstrate our method's robustness to deterministic warm up, and the benefits of encouraging exploration in the latent space.Chin-Wei Huang, Shawn Tan, Alexandre Lacoste, Aaron Courville Neural Information Processing Systems, 2018 | ||

## Neural Autoregressive Flows paperNormalizing flows and autoregressive models have been successfully combined to produce state-of-the-art results in density estimation, via Masked Autoregressive Flows (MAF), and to accelerate state-of-the-art WaveNet-based speech synthesis to 20x faster than real-time, via Inverse Autoregressive Flows (IAF). We unify and generalize these approaches, replacing the (conditionally) affine univariate transformations of MAF/IAF with a more general class of invertible univariate transformations expressed as monotonic neural networks. We demonstrate that the proposed neural autoregressive flows (NAF) are universal approximators for continuous probability distributions, and their greater expressivity allows them to better capture multimodal target distributions. Experimentally, NAF yields state-of-the-art performance on a suite of density estimation tasks and outperforms IAF in variational autoencoders trained on binarized MNIST.Chin-Wei Huang*, David Krueger*, Alexandre Lacoste, Aaron Courville International Conference on Machine Learning, 2018 (long talk) | ||

## Neural Language Modeling by Jointly Learning Syntax and Lexicon arXiv openreviewWe propose a neural language model capable of unsupervised syntactic structure induction. The model leverages the structure information to form better semantic representations and better language modeling. Standard recurrent neural networks are limited by their structure and fail to efficiently use syntactic information. On the other hand, tree-structured recursive networks usually require additional structural supervision at the cost of human expert annotation. In this paper, We propose a novel neural language model, called the Parsing-Reading-Predict Networks (PRPN), that can simultaneously induce the syntactic structure from unannotated sentences and leverage the inferred structure to learn a better language model. In our model, the gradient can be directly back-propagated from the language model loss into the neural parsing network. Experiments show that the proposed model can discover the underlying syntactic structure and achieve state-of-the-art performance on word/character-level language model tasks.Yikang Shen, Zhouhan Lin, Chin-Wei Huang, Aaron Courville International Conference on Learning Representations, 2018 |