Full publication list can be found on Google Scholar.
^{*} indicates equal contribution.

- Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning [Paper]

Hao-Lun Hsu, Weixin Wang, Miroslav Pajic,**Pan Xu**, arXiv: 2404.10728.

- Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning [Paper]

Zhishuai Liu,**Pan Xu**, arXiv: 2403.09621.

- Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization [Paper]

Zhen Qin*, Zhishuai Liu*,**Pan Xu**, arXiv: 2310.15976.

- Optimal Batched Best Arm Identification [Paper]

Tianyuan Jin, Yu Yang, Jing Tang, Xiaokui Xiao,**Pan Xu**, arXiv: 2310.14129.

- PhyGCN: Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning [Paper]

Yihe Deng, Ruochi Zhang,**Pan Xu**, Jian Ma, Quanquan Gu, bioRxiv: 2023.10.01.560404.

- Optimal Batched Linear Bandits

Xuanfei Ren, Tianyuan Jin,**Pan Xu**

In Proc. of the 41st International Conference on Machine Learning (**ICML**), Vienna, Austria, 2024.

[Summary] [Paper] [Code]We introduce the E$^4$ algorithm for the batched linear bandit problem, incorporating an Explore-Estimate-Eliminate-Exploit framework. With a proper choice of exploration rate, we prove E$^4$ achieves the finite-time minimax optimal regret with only $O(\log\log T)$ batches, and the asymptotically optimal regret with only $3$ batches as $T\rightarrow\infty$, where $T$ is the time horizon. We further prove a lower bound on the batch complexity of linear contextual bandits showing that any asymptotically optimal algorithm must require at least $3$ batches in expectation as $T\rightarrow\infty$, which indicates E$^4$ achieves the asymptotic optimality in regret and batch complexity simultaneously. To the best of our knowledge, E$^4$ is the first algorithm for linear bandits that simultaneously achieves the minimax and asymptotic optimality in regret with the corresponding optimal batch complexities.

- Distributionally Robust Off-Dynamics Reinforcement Learning: Provable Efficiency with Linear Function Approximation

Zhishuai Liu,**Pan Xu**

In Proc. of the 27th International Conference on Artificial Intelligence and Statistics (**AISTATS**), Valencia, Spain, 2024.

[Summary] [Paper] [Code]We study off-dynamics Reinforcement Learning (RL), where the policy is trained on a source domain and deployed to a distinct target domain. We cast this problem as an online distributionally robust Markov decision process (DRMDP), where the policy is learned by actively interacting with the source domain while seeking the optimal performance under the worst possible dynamics within an uncertainty set of the source domain's transition kernel. We introduce DR-LSVI-UCB, the first provably efficient online DRMDP algorithm for off-dynamics RL with function approximation, and establish a polynomial suboptimality bound that is independent of the state and action space sizes. Our work makes the first step towards a deeper understanding of the provable efficiency of online DRMDPs with linear function approximation.

- Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Haque Ishfaq, Qingfeng Lan,**Pan Xu**, A. Rupam Mahmood, Doina Precup, Anima Anandkumar, Kamyar Azizzadenesheli

In Proc. of the 12th International Conference on Learning Representations (**ICLR**), Vienna, Austria, 2024.

[Summary] [Paper] [Code]We present a scalable and effective exploration strategy using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it achieves the state-of-the-art regret bound $\tilde{O}(d^{3/2}H^{3/2}\sqrt{T})$ for randomized methods, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

- Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs

Tianyuan Jin, Hao-Lun Hsu, William Chang,**Pan Xu**

In Proc. of the 38th Annual AAAI Conference on Artificial Intelligence (**AAAI**), Vancouver, Canada, 2024. [Oral Presentation, 2.3%]

[Summary] [Paper] [Code]We study the multi-agent multi-armed bandit (MAMAB) problem. We propose the $\epsilon$-exploring Multi-Agent Thompson Sampling ($\epsilon$-MATS) algorithm, which performs TS exploration with probability $\epsilon$ while adopts a greedy policy otherwise. We prove that $\epsilon$-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size. We also derive a lower bound for this setting, which implies our frequentist regret upper bound is optimal up to constant and logarithm terms. We conduct thorough experiments on standard MAMAB problems including 0101-Chains, Gem Mining, and Wind Farm Control, which demonstrate the superior performance and the improved computational efficiency of $\epsilon$-MATS compared with existing algorithms in the same setting.

- Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

Yi Shen,**Pan Xu**, Michael M. Zavlanos

Transactions on Machine Learning Research (**TMLR**), 2024. [Featured Certification]

[Summary] [Paper]We propose a novel DRO approach that employs the Wasserstein distance to encompass distributions with varying support and take into consideration the geometry of the distribution support. We present a regularized (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We validate the performance of our approach on the International Stroke Trial (IST) dataset.

- Thompson Sampling with Less Exploration is Fast and Optimal

Tianyuan Jin, Xianglin Yang, Xiaokui Xiao,**Pan Xu**

In Proc. of the 40th International Conference on Machine Learning (**ICML**), Honolulu, Hawaii, USA, 2023.

[Summary] [Paper]We propose $\epsilon$-Exploring Thompson Sampling ($\epsilon$-TS), a modified version of the Thompson Sampling (TS) algorithm for multi-armed bandits. In $\epsilon$-TS, arms are selected greedily based on empirical mean rewards with probability $1-\epsilon$, and based on posterior samples obtained from TS with probability $\epsilon$. Here, $\epsilon\in(0,1)$ is a user-defined constant. By reducing exploration, $\epsilon$-TS improves computational efficiency compared to TS while achieving better regret bounds. We establish that $\epsilon$-TS is both minimax optimal and asymptotically optimal for various popular reward distributions, including Gaussian, Bernoulli, Poisson, and Gamma. Our algorithm is as easy to implement as TS, but operates significantly faster due to reduced exploration. Empirical evaluations confirm the efficiency and optimality of $\epsilon$-TS.

- Distributionally Robust Policy Gradient for Offline Contextual Bandits

Zhouhao Yang*, Yihong Guo*,**Pan Xu**, Anqi Liu, Anima Anandkumar

In Proc. of the 26th International Conference on Artificial Intelligence and Statistics (**AISTATS**), Valencia, Spain, 2023.

[Summary] [Paper]Learning an optimal policy from offline data is notoriously challenging, which requires the evaluation of the learning policy using data pre-collected from a static logging policy. We study the policy optimization problem in offline contextual bandits using policy gradient methods. We employ a distributionally robust policy gradient method, DROPO, to account for the distributional shift between the static logging policy and the learning policy in policy gradient. Our approach conservatively estimates the conditional reward distributional and updates the policy accordingly.

- Queer In AI: A Case Study in Community-Led Participatory AI

Organizers of QueerInAI, ... ,**Pan Xu**, ..., Luke Stark. [Show all authors]

In Proc. of the ACM Conference on Fairness, Accountability, and Transparency (**ACM FAccT**), Chicago, IL, USA, 2023. [Best Paper Award]

[Summary] [Paper]We present Queer in AI as a case study for community-led participatory design in AI. We examine how participatory design and intersectional tenets started and shaped this community's programs over the years. We discuss different challenges that emerged in the process, look at ways this organization has fallen short of operationalizing participatory and intersectional principles, and then assess the organization's impact. Queer in AI provides important lessons and insights for practitioners and theorists of participatory methods broadly through its rejection of hierarchy in favor of decentralization, success at building aid and programs by and for the queer community, and effort to change actors and institutions outside of the queer community. Finally, we theorize how communities like Queer in AI contribute to the participatory design in AI more broadly by fostering cultures of participation in AI, welcoming and empowering marginalized participants, critiquing poor or exploitative participatory practices, and bringing participation to institutions outside of individual research projects. Queer in AI's work serves as a case study of grassroots activism and participatory methods within AI, demonstrating the potential of community-led participatory methods and intersectional praxis, while also providing challenges, case studies, and nuanced insights to researchers developing and using participatory methods.

- Multiple Models for Outbreak Decision Support in the Face of Uncertainty

Katriona Shea, ... ,**Pan Xu**, ... , Michael C. Runge [Show all authors]

Proceedings of the National Academy of Sciences (**PNAS**), Volume 120, No. 18, 2023.

[Summary] [Paper]During infectious disease outbreaks, uncertainty hinders our ability to forecast dynamics and to make critical decisions about management. Motivated by the COVID-19 pandemic, leading agencies have initiated efforts to prepare for future outbreaks, for example, the US Centers for Disease Control and Prevention’s National Center for Forecasting and Outbreak Analytics and the WHO’s Hub for Pandemic and Epidemic Intelligence were recently inaugurated. Critically, such efforts need to inform policy as well as provide insight into expected disease dynamics. We present a validated case study from early in the pandemic, drawing on recommendations to minimize cognitive biases and incorporate decision theory, to illustrate how a policy-focused process could work for urgent, important, time-sensitive outbreak decision making in the face of uncertainty.

- Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning

Yizhou Zhang*, Guannan Qu*,**Pan Xu***, Yiheng Lin, Zaiwei Chen, Adam Wierman

In Proc. of the ACM on Measurement and Analysis of Computing Systems (POMACS). Presented on the**ACM SIGMETRICS**conference, Orlando, Florida, USA, 2023.

[Summary] [Paper]We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) algorithm that provably learns a near-globally-optimal policy using only local information. In particular, we show that, despite restricting each agent's attention to only its $\kappa$-hop neighborhood, the agents are able to learn a policy with an optimality gap that decays polynomially in $\kappa$. In addition, we show the finite-sample convergence of LPI to the global optimal policy, which explicitly captures the trade-off between optimality and computational complexity in choosing $\kappa$. Numerical simulations demonstrate the effectiveness of LPI.

- Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits

Tianyuan Jin,**Pan Xu**, Xiaokui Xiao, Anima Anandkumar

In Proc. of the 35th Conference on Advances in Neural Information Processing Systems (**NeurIPS**), New Orleans, Louisiana, USA, 2022.

[Summary] [Paper]We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound.

- Active Ranking without Strong Stochastic Transitivity

Hao Lou, Tao Jin, Yue Wu,**Pan Xu**, Farzad Farnoud, Quanquan Gu

In Proc. of the 35th Conference on Advances in Neural Information Processing Systems (**NeurIPS**), New Orleans, Louisiana, USA, 2022.

[Summary] [Paper]We consider the problem of recovering the exact full ranking for a list of items under ranking models that do NOT assume the Strong Stochastic Transitivity property. We propose a $\delta$-correct algorithm, Probe-Rank, that actively learns the ranking of the items from noisy pairwise comparisons. We prove a sample complexity upper bound for Probe-Rank, which only depends on the preference probabilities between items that are adjacent in the true ranking. This improves upon existing sample complexity results that depend on the preference probabilities for all pairs of items.

- Langevin Monte Carlo for Contextual Bandits

**Pan Xu**, Hongkai Zheng, Eric Mazumdar, Kamyar Azizzadenesheli, Anima Anandkumar

In Proc. of the 39th International Conference on Machine Learning (**ICML**), Baltimore, Maryland, USA, 2022.

[Summary] [Paper] [Code]We propose an efficient posterior sampling algorithm, viz., Langevin Monte Carlo Thompson Sampling (LMC-TS), that uses Markov Chain Monte Carlo (MCMC) methods to directly sample from the posterior distribution in contextual bandits. Our method is computationally efficient since it only needs to perform noisy gradient descent updates without constructing the Laplace approximation of the posterior distribution. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits, viz., linear contextual bandits. We conduct experiments on both synthetic data and real-world datasets on different contextual bandit models, which demonstrates that directly sampling from the posterior is both computationally efficient and competitive in performance.

- Neural Contextual Bandits with Deep Representation and Shallow Exploration

**Pan Xu**, Zheng Wen, Handong Zhao, Quanquan Gu

In Proc. of the 10th International Conference on Learning Representations (**ICLR**), Online, 2022.

[Summary] [Paper] [Code]We study neural contextual bandits, a general class of contextual bandits, where each context-action pair is associated with a raw feature vector, but the specific reward generating function is unknown. We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network (deep representation learning), and uses an upper confidence bound (UCB) approach to explore in the last linear layer (shallow exploration). We prove that under standard assumptions, our proposed algorithm achieves $\widetilde{O}(\sqrt{T})$ finite-time regret, where $T$ is the learning time horizon. Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.

- Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise Comparisons

Yue Wu*, Tao Jin*, Hao Lou,**Pan Xu**, Farzad Farnoud, Quanquan Gu

In Proc. of the 25th International Conference on Artificial Intelligence and Statistics (**AISTATS**), Online, 2022.

[Summary] [Paper]In heterogeneous rank aggregation problems, users often exhibit various accuracy levels when comparing pairs of items. Thus, a uniform querying strategy over users may not be optimal. To address this issue, we propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons from multiple users and improves the users' average accuracy by maintaining an active set of users. We prove that our algorithm can return the true ranking of items with high probability. We also provide a sample complexity bound for the proposed algorithm, which outperforms the non-active strategies in the literature and close to oracle under mild conditions. Experiments are provided to show the empirical advantage of the proposed methods over the state-of-the-art baselines.

- Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US

Estee Y Cramer, ... ,**Pan Xu**, ... , Nicholas G. Reich. [Show all authors]

Proceedings of the National Academy of Sciences (**PNAS**), Volume 119, No. 15, 2022.

[Summary] [Paper]This paper compares the probabilistic accuracy of short-term forecasts of reported deaths due to COVID-19 during the first year and a half of the pandemic in the United States. Results show high variation in accuracy between and within stand-alone models and more consistent accuracy from an ensemble model that combined forecasts from all eligible models. This demonstrates that an ensemble model provided a reliable and comparatively accurate means of forecasting deaths during the COVID-19 pandemic that exceeded the performance of all of the models that contributed to it. This work strengthens the evidence base for synthesizing multiple models to support public-health action.

- Double Explore-then-Commit: Asymptotic Optimality and Beyond

Tianyuan Jin,**Pan Xu**, Xiaokui Xiao, Quanquan Gu

In Proc. of the 34th Annual Conference on Learning Theory (**COLT**), Online, 2021.

This work has been presented at the NeurIPS 2020 Offline Reinforcement Learning Workshop.

[Summary] [Paper]We propose a double explore-then-commit (DETC) algorithm that has two exploration and exploitation phases and show that it can achieve the asymptotically optimal regret bound. To our knowledge, DETC is the first non-fully-sequential algorithm that achieves asymptotic optimality. In addition, we extend DETC to batched bandit problems, where (i) the exploration process is split into a small number of batches and (ii) the round complexity is of central interest. We prove that a batched version of DETC can achieve the asymptotic optimality with only a constant round complexity. This is the first batched bandit algorithm that can attain the optimal asymptotic regret bound and optimal round complexity simultaneously.

- Faster Convergence of Stochastic Gradient Langevin Dynamics for Non-Log-Concave Sampling

Difan Zou,**Pan Xu**, Quanquan Gu

In Proc. of the 37th International Conference on Uncertainty in Artificial Intelligence (**UAI**), Online, 2021.

[Summary] [Paper]We provide a new convergence analysis of stochastic gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave. At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain. Under certain conditions on the target distribution, we prove that $\widetilde O(d^4\epsilon^{-2})$ stochastic gradient evaluations suffice to guarantee $\epsilon$-sampling error in terms of the total variation distance, where $d$ is the problem dimension. This improves existing results on the convergence rate of SGLD \citep{raginsky2017non,xu2018global}. We further show that provided an additional Hessian Lipschitz condition on the log-density function, SGLD is guaranteed to achieve $\epsilon$-sampling error within $\widetilde O(d^{15/4}\epsilon^{-3/2})$ stochastic gradient evaluations. Our proof technique provides a new way to study the convergence of Langevin based algorithms, and sheds some light on the design of fast stochastic gradient based sampling algorithms.

- MOTS: Minimax Optimal Thompson Sampling

Tianyuan Jin,**Pan Xu**, Jieming Shi, Xiaokui Xiao, Quanquan Gu

In Proc. of the 38th International Conference on Machine Learning (**ICML**), Online, 2021.

[Summary] [Paper]Thompson sampling is one of the most widely used algorithms for many online decision problems, due to its simplicity in implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can match the minimax lower bound $\Omega(\sqrt{KT})$ for $K$-armed bandit problems, where $T$ is the total time horizon. In this paper, we solve this long open problem by proposing a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(\sqrt{KT})$ for finite time horizon $T$, as well as the asymptotic optimal regret bound for Gaussian rewards when $T$ approaches infinity. To our knowledge, MOTS is the first Thompson sampling type algorithm that achieves the minimax optimality for multi-armed bandit problems.

- Almost Optimal Anytime Algorithm for Batched Multi-Armed Bandits

Tianyuan Jin, Jing Tang,**Pan Xu**, Keke Huang, Xiaokui Xiao, Quanquan Gu

In Proc. of the 38th International Conference on Machine Learning (**ICML**), Online, 2021.

[Summary] [Paper]In batched multi-armed bandit problems, the learner can adaptively pull arms and adjust strategy in batches. In many real applications, not only the regret but also the batch complexity need to be optimized. Existing batched bandit algorithms usually assume that the time horizon $T$ is known in advance. However, many applications involve an unpredictable stopping time. In this paper, we study the anytime batched multi-armed bandit problem. We propose an anytime algorithm that achieves the asymptotically optimal regret for exponential families of reward distributions with $\mathcal{O}(\log \log T\cdot \ilog^{\alpha} (T))$ batches, where $\alpha\in \mathcal{O}_{T}(1)$. Moreover, we prove that for any constant $c>0$, no algorithm can achieve the asymptotically optimal regret within $c \log \log T$ batches.

- A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods

Yue Wu, Weitong Zhang,**Pan Xu**, Quanquan Gu

In Proc. of the 33rd Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Online, 2020.

This work has been presented at the ICML 2020 Theoretical Foundations of Reinforcement Learning Workshop.

[Summary] [Paper]We provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., $\|\nabla J(\theta)\|_2^2 \le \epsilon$) of the non-concave performance function $J({\theta})$, with $\mathcal{\widetilde{O}}(\epsilon^{-2.5})$ sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.

- A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation

**Pan Xu**, Quanquan Gu

In Proc. of the 37th International Conference on Machine Learning (**ICML**), Online, 2020.

[Summary] [Paper]We present a finite-time analysis of a neural Q-learning algorithm, where the data are generated from a Markov decision process, and the action-value function is approximated by a deep ReLU neural network. We prove that neural Q-learning finds the optimal policy with an $O(1/\sqrt{T})$ convergence rate if the neural function approximator is sufficiently overparameterized, where $T$ is the number of iterations. To our best knowledge, our result is the first finite-time analysis of neural Q-learning under non-i.i.d. data assumption.

- Stochastic Nested Variance Reduction for Nonconvex Optimization

Dongruo Zhou,**Pan Xu**, Quanquan Gu

Journal of Machine Learning Research (**JMLR**), Volume 21, No. 103, 2020.

The short version of this paper has been published in NeurIPS 2018. The journal version extends the original SNVRG algorithm for finding first order stationary points to local minimum finding algorithms by incorporating this manuscript: Finding Local Minima via Stochastic Nested Variance Reduction.

[Summary] [Paper]In this journal version of SNVRG, we further propose two algorithms that can find local minima faster than state-of-the-art algorithms in both finite-sum and general stochastic (online) nonconvex optimization. In particular, for finite-sum optimization problems, the proposed $\text{SNVRG}+\text{Neon2}^{\text{finite}}$ algorithm achieves $\tilde{O}(n^{1/2}\epsilon^{-2}+n\epsilon_H^{-3}+n^{3/4}\epsilon_H^{-7/2})$ gradient complexity to converge to an $(\epsilon, \epsilon_H)$-second-order stationary point, which outperforms $\text{SVRG}+\text{Neon2}^{\text{finite}}$ \citep{allen2018neon2}, the best existing algorithm, in a wide regime. For general stochastic optimization problems, the proposed $\text{SNVRG}+\text{Neon2}^{\text{online}}$ achieves $\tilde{O}(\epsilon^{-3}+\epsilon_H^{-5}+\epsilon^{-2}\epsilon_H^{-3})$ gradient complexity, which is better than both $\text{SVRG}+\text{Neon2}^{\text{online}}$ \citep{allen2018neon2} and Natasha2 \citep{allen2017natasha2} in certain regimes. Thorough experimental results on different nonconvex optimization problems back up our theory.

- Sample Efficient Policy Gradient Methods with Recursive Variance Reduction

**Pan Xu**, Felicia Gao, Quanquan Gu

In Proc. of the 8th International Conference on Learning Representations (**ICLR**), Addis Ababa, Ethiopia, 2020.

Partial results of this work have been presented at the NeurIPS 2019 Optimization Foundations of Reinforcement Learning Workshop, Vancouver, Canada, 2019 and the 2020 Virtual Conference on Reinforcement Learning for Real Life.

[Summary] [Paper] [Poster] [Code] [Video]We propose a novel policy gradient algorithm called SRVR-PG, which only requires $O(1/\epsilon^{3/2})$ episodes to find an $\epsilon$-approximate stationary point of the nonconcave performance function $J(\boldsymbol{\theta})$ (i.e., $\boldsymbol{\theta}$ such that $\|\nabla J(\boldsymbol{\theta})\|_2^2\leq\epsilon$). This sample complexity improves the existing result $O(1/\epsilon^{5/3})$ for stochastic variance reduced policy gradient algorithms by a factor of $O(1/\epsilon^{1/6})$. In addition, we also propose a variant of SRVR-PG with parameter exploration, which explores the initial policy parameter from a prior probability distribution. We conduct numerical experiments on classic control problems in reinforcement learning to validate the performance of our proposed algorithms.

- Rank Aggregation via Heterogeneous Thurstone Preference Models

Tao Jin*,**Pan Xu***, Quanquan Gu, Farzad Farnoud

In Proc. of the 34th Conference on Artificial Intelligence (**AAAI**), New York, New York, USA, 2020. [Oral presentation, 4.5%]

[Summary] [Paper] [Code]We propose the Heterogeneous Thurstone Model (HTM) for aggregating ranked data, which can take the accuracy levels of different users into account. By allowing different noise distributions, the proposed HTM model maintains the generality of Thurstone's original framework, and as such, also extends the Bradley-Terry-Luce (BTL) model for pairwise comparisons to heterogeneous populations of users. Under this framework, we also propose a rank aggregation algorithm based on alternating gradient descent to estimate the underlying item scores and accuracy levels of different users simultaneously from noisy pairwise comparisons. We theoretically prove that the proposed algorithm converges linearly up to a statistical error which matches that of the state-of-the-art method for the single-user BTL model. We evaluate the proposed HTM model and algorithm on both synthetic and real data, demonstrating that it outperforms existing methods.

- Stochastic Gradient Hamiltonian Monte Carlo Methods with Recursive Variance Reduction

Difan Zou,**Pan Xu**, Quanquan Gu

In Proc. of the 32nd Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Vancouver, Canada, 2019.

[Summary] [Paper] [Poster]Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithms have received increasing attention in both theory and practice. In this paper, we propose a Stochastic Recursive Variance-Reduced gradient HMC (SRVR-HMC) algorithm. It makes use of a semi-stochastic gradient estimator that recursively accumulates the gradient information to reduce the variance of the stochastic gradient. We provide a convergence analysis of SRVR-HMC for sampling from a class of non-log-concave distributions and show that SRVR-HMC converges faster than all existing HMC-type algorithms based on underdamped Langevin dynamics. Thorough experiments on synthetic and real-world datasets validate our theory and demonstrate the superiority of SRVR-HMC.

- Stochastic Variance-Reduced Cubic Regularization Methods

Dongruo Zhou,**Pan Xu**, Quanquan Gu

Journal of Machine Learning Research (**JMLR**), Volume 20, No. 134, 2019.

The short version of this paper has been published in ICML 2018. The journal version extends the SVRC algorithm to a sample-efficient one proposed in this manuscript: Sample Efficient Stochastic Variance-Reduced Cubic Regularization Method.

[Summary] [Paper]To reduce the sample complexity of Hessian matrix computation in SVRC, we also propose a sample efficient stochastic variance-reduced cubic regularization (Lite-SVRC) algorithm for finding the local minimum more efficiently. Lite-SVRC converges to an $(\epsilon,\sqrt{\epsilon})$-approximate local minimum within $\tilde{O}(n+n^{2/3}/\epsilon^{3/2})$ Hessian sample complexity, which is faster than all existing cubic regularization based methods. Numerical experiments with different nonconvex optimization problems conducted on real datasets validate our theoretical results for both SVRC and Lite-SVRC.

- An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient

**Pan Xu**, Felicia Gao, Quanquan Gu

In Proc. of the 35th International Conference on Uncertainty in Artificial Intelligence (**UAI**), Tel Aviv, Israel, 2019. [Oral presentation, 7.8%]

[Summary] [Paper]We revisit the stochastic variance-reduced policy gradient (SVRPG) method proposed by \citet{papini2018stochastic} for reinforcement learning. We provide an improved convergence analysis of SVRPG and show that it can find an $\epsilon$-approximate stationary point of the performance function within $O(1/\epsilon^{5/3})$ trajectories. This sample complexity improves upon the best known result $O(1/\epsilon^2)$ by a factor of $O(1/\epsilon^{1/3})$. At the core of our analysis is (i) a tighter upper bound for the variance of importance sampling weights, where we prove that the variance can be controlled by the parameter distance between different policies; and (ii) a fine-grained analysis of the epoch length and batch size parameters such that we can significantly reduce the number of trajectories required in each iteration of SVRPG. We also empirically demonstrate the effectiveness of our theoretical claims of batch sizes on reinforcement learning benchmark tasks.

- Sampling from Non-Log-Concave Distributions via Variance-Reduced Gradient Langevin Dynamics

Difan Zou,**Pan Xu**, Quanquan Gu

In Proc. of the 22nd International Conference on Artificial Intelligence and Statistics (**AISTATS**), Naha, Okinawa, Japan, 2019.

[Summary] [Paper]We study stochastic variance reduction-based Langevin dynamic algorithms, SVRG-LD and SAGA-LD \citep{dubey2016variance}, for sampling from non-log-concave distributions. Under certain assumptions on the log density function, we establish the convergence guarantees of SVRG-LD and SAGA-LD in $2$-Wasserstein distance. More specifically, we show that both SVRG-LD and SAGA-LD require $ \tilde O\big(n+n^{3/4}/\epsilon^2 + n^{1/2}/\epsilon^4\big)\cdot \exp\big(\tilde O(d+\gamma)\big)$ stochastic gradient evaluations to achieve $\epsilon$-accuracy in $2$-Wasserstein distance, which outperforms the $ \tilde O\big(n/\epsilon^4\big)\cdot \exp\big(\tilde O(d+\gamma)\big)$ gradient complexity achieved by Langevin Monte Carlo Method \citep{raginsky2017non}. Experiments on synthetic data and real data back up our theory.

- Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization

**Pan Xu***, Jinghui Chen*, Difan Zou, Quanquan Gu

In Proc. of the 31st Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Montréal, Canada, 2018. [Spotlight presentation, 3.5%]

[Summary] [Paper] [Video]We present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for nonconvex finite-sum optimization with $n$ component functions. At the core of our analysis is a direct analysis of the ergodicity of the numerical approximations to Langevin dynamics, which leads to faster convergence rates. Specifically, we show that gradient Langevin dynamics (GLD) and stochastic gradient Langevin dynamics (SGLD) converge to the $\textit{almost minimizer}$ within $\widetilde O\big(nd/(\lambda\epsilon) \big)$ and $\widetilde O\big(d^7/(\lambda^5\epsilon^5) \big)$ stochastic gradient evaluations respectively, where $d$ is the problem dimension, and $\lambda$ is the spectral gap of the Markov chain generated by GLD. Both results improve upon the best known gradient complexity results \citep{raginsky2017non}. Furthermore, for the first time we prove the global convergence guarantee for variance reduced stochastic gradient Langevin dynamics (SVRG-LD) to the almost minimizer within $\widetilde O\big(\sqrt{n}d^5/(\lambda^4\epsilon^{5/2})\big)$ stochastic gradient evaluations, which outperforms the gradient complexities of GLD and SGLD in a wide regime. Our theoretical analyses shed some light on using Langevin dynamics based algorithms for nonconvex optimization with provable guarantees.

- Stochastic Nested Variance Reduction for Nonconvex Optimization

Dongrou Zhou,**Pan Xu**, Quanquan Gu

In Proc. of the 31st Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Montréal, Canada, 2018. [Spotlight presentation, 3.5%]

[Summary] [Paper] [Video]We study finite-sum nonconvex optimization problems, where the objective function is an average of $n$ nonconvex functions. We propose a new stochastic gradient descent algorithm based on nested variance reduction. Compared with conventional stochastic variance reduced gradient (SVRG) algorithm that uses two reference points to construct a semi-stochastic gradient with diminishing variance in each iteration, our algorithm uses $K+1$ nested reference points to build a semi-stochastic gradient to further reduce its variance in each iteration. For smooth nonconvex functions, the proposed algorithm converges to an $\epsilon$-approximate first-order stationary point (i.e., $\|\nabla F(\mathbf{x})\|_2\leq \epsilon$) within $\widetilde O(n\land \epsilon^{-2}+\epsilon^{-3}\land n^{1/2}\epsilon^{-2})$ number of stochastic gradient evaluations. This improves the best known gradient complexity of SVRG $O(n+n^{2/3}\epsilon^{-2})$ and that of SCSG $O(n\land \epsilon^{-2}+\epsilon^{-10/3}\land n^{2/3}\epsilon^{-2})$. For gradient dominated functions, our algorithm also achieves better gradient complexity than the state-of-the-art algorithms. Thorough experimental results on different nonconvex optimization problems back up our theory.

- Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Yaodong Yu*,**Pan Xu***, Quanquan Gu

In Proc. of the 31st Conference on Advances in Neural Information Processing Systems (**NeurIPS**), Montréal, Canada, 2018.

[Summary] [Paper] [Video]We propose stochastic optimization algorithms that can find local minima faster than existing algorithms for nonconvex optimization problems, by exploiting the third-order smoothness to escape non-degenerate saddle points more efficiently. More specifically, the proposed algorithm only needs $\tilde{O}(\epsilon^{-10/3})$ stochastic gradient evaluations to converge to an approximate local minimum $\mathbf{x}$, which satisfies $\|\nabla f(\mathbf{x})\|_2\leq\epsilon$ and $\lambda_{\min}(\nabla^2 f(\mathbf{x}))\geq -\sqrt{\epsilon}$ in unconstrained stochastic optimization, where $\tilde{O}(\cdot)$ hides logarithm polynomial terms and constants. This improves upon the $\tilde{O}(\epsilon^{-7/2})$ gradient complexity achieved by the state-of-the-art stochastic local minima finding algorithms by a factor of $\tilde{O}(\epsilon^{-1/6})$. Experiments on two nonconvex optimization problems demonstrate the effectiveness of our algorithm and corroborate our theory.

- Subsampled Stochastic Variance-Reduced Gradient Langevin Dynamics

Difan Zou*,**Pan Xu***, Quanquan Gu

In Proc. of the 34th International Conference on Uncertainty in Artificial Intelligence (**UAI**), Monterey, California, USA, 2018.

[Summary] [Paper]Stochastic variance-reduced gradient Langevin dynamics (SVRG-LD) was recently proposed to improve the performance of stochastic gradient Langevin dynamics (SGLD) by reducing the variance of the stochastic gradient. In this paper, we propose a variant of SVRG-LD, namely SVRG-LD$^+$, which replaces the full gradient in each epoch with a subsampled one. We provide a nonasymptotic analysis of the convergence of SVRG-LD$^+$ in $2$-Wasserstein distance, and show that SVRG-LD$^+$ enjoys a lower gradient complexity than SVRG-LD, when the sample size is large or the target accuracy requirement is moderate. Our analysis directly implies a sharper convergence rate for SVRG-LD, which improves the existing convergence rate by a factor of $\kappa^{1/6}n^{1/6}$, where $\kappa$ is the condition number of the log-density function and $n$ is the sample size. Experiments on both synthetic and real-world datasets validate our theoretical results.

- Continuous and Discrete-Time Accelerated Stochastic Mirror Descent for Strongly Convex Functions

**Pan Xu***, Tianhao Wang*, Quanquan Gu

In Proc. of the 35th International Conference on Machine Learning (**ICML**), Stockholm, Sweden, 2018.

[Summary] [Paper]We provide a second-order stochastic differential equation (SDE), which characterizes the continuous-time dynamics of accelerated stochastic mirror descent (ASMD) for strongly convex functions. This SDE plays a central role in designing new discrete-time ASMD algorithms via numerical discretization and providing neat analyses of their convergence rates based on Lyapunov functions. Our results suggest that the only existing ASMD algorithm, namely, AC-SA proposed in \citet{ghadimi2012optimal} is one instance of its kind, and we can derive new instances of ASMD with fewer tuning parameters.This sheds light on revisiting accelerated stochastic optimization through the lens of SDEs, which can lead to a better understanding as well as new simpler algorithms of acceleration in stochastic optimization. Numerical experiments on both synthetic and real data support our theory.

- Stochastic Variance-Reduced Cubic Regularized Newton Method

Dongruo Zhou,**Pan Xu**, Quanquan Gu

In Proc. of the 35th International Conference on Machine Learning (**ICML**), Stockholm, Sweden, 2018.

[Summary] [Paper]We propose a stochastic variance-reduced cubic regularized Newton method (SVRC) for non-convex optimization. At the core of our algorithm is a novel semi-stochastic gradient along with a semi-stochastic Hessian, which are specifically designed for cubic regularization method. We show that our algorithm is guaranteed to converge to an $(\epsilon,\sqrt{\epsilon})$-approximate local minimum within $\widetilde{O}(n^{4/5}/\epsilon^{3/2})$ second-order oracle calls, which outperforms the state-of-the-art cubic regularization algorithms including subsampled cubic regularization. Our work also sheds light on the application of variance reduction technique to high-order non-convex optimization methods. Thorough experiments on various non-convex optimization problems support our theory.

- Stochastic Variance-Reduced Hamilton Monte Carlo Methods

Difan Zou*,**Pan Xu***, Quanquan Gu

In Proc. of the 35th International Conference on Machine Learning (**ICML**), Stockholm, Sweden, 2018.

[Summary] [Paper]We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve $\epsilon$ accuracy in 2-Wasserstein distance, our algorithm achieves $\tilde O\big(n+\kappa^{2}d^{1/2}/\epsilon+\kappa^{4/3}d^{1/3}n^{2/3}/\epsilon^{2/3}\big)$ gradient complexity (i.e., number of component gradient evaluations), which outperforms the state-of-the-art HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general log-concave distributions, and prove the corresponding gradient complexity as well. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.

- Covariate Adjusted Precision Matrix Estimation via Nonconvex Optimization

Jinghui Chen,**Pan Xu**, Lingxiao Wang, Jian Ma, Quanquan Gu

In Proc. of the 35th International Conference on Machine Learning (**ICML**), Stockholm, Sweden, 2018. [Long talk, 8.6%]

[Summary] [Paper]We propose a nonconvex estimator for the covariate adjusted precision matrix estimation problem in the high dimensional regime, under sparsity constraints. To solve this estimator, we propose an alternating gradient descent algorithm with hard thresholding. Compared with existing methods along this line of research, which lack theoretical guarantees in optimization error and/or statistical error, the proposed algorithm not only is computationally much more efficient with a linear rate of convergence, but also attains the optimal statistical rate up to a logarithmic factor. Thorough experiments on both synthetic and real data support our theory.

- Accelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms

**Pan Xu***, Tianhao Wang*, Quanquan Gu

In Proc. of the 21st International Conference on Artificial Intelligence and Statistics (**AISTATS**), Playa Blanca, Lanzarote, Canary Islands, 2018.

[Summary] [Paper]We present a new framework to analyze accelerated stochastic mirror descent through the lens of continuous-time stochastic dynamic systems. It enables us to design new algorithms, and perform a unified and simple analysis of the convergence rates of these algorithms. More specifically, under this framework, we provide a Lyapunov function based analysis for the continuous-time stochastic dynamics, as well as several new discrete-time algorithms derived from the continuous-time dynamics. We show that for general convex objective functions, the derived discrete-time algorithms attain the optimal convergence rate. Empirical experiments corroborate our theory.

- Speeding Up Latent Variable Gaussian Graphical Model Estimation via Nonconvex Optimization

**Pan Xu**, Jian Ma, Quanquan Gu

In Proc. of the 30th Conference on Advances in Neural Information Processing Systems (**NIPS**), Long Beach, California, USA, 2017.

[Summary] [Paper] [Poster] [Code]We study the estimation of the latent variable Gaussian graphical model (LVGGM), where the precision matrix is the superposition of a sparse matrix and a low-rank matrix. In order to speed up the estimation of the sparse plus low-rank components, we propose a sparsity constrained maximum likelihood estimator based on matrix factorization, and an efficient alternating gradient descent algorithm with hard thresholding to solve it. Our algorithm is orders of magnitude faster than the convex relaxation based methods for LVGGM. In addition, we prove that our algorithm is guaranteed to linearly converge to the unknown sparse and low-rank components up to the optimal statistical precision. Experiments on both synthetic and genomic data demonstrate the superiority of our algorithm over the state-of-the-art algorithms and corroborate our theory.

- Uncertainty Assessment and False Discovery Rate Control in High-Dimensional Granger Causal Inference

Aditya Chaudhry,**Pan Xu**, Quanquan Gu

In Proc. of the 34th International Conference on Machine Learning (**ICML**), Sydney, Australia, 2017.

[Summary] [Paper]Causal inference among high-dimensional time series data proves an important research problem in many fields. In the classical regime, one often establishes causality among time series via a concept known as “Granger causality.” However, existing approaches for Granger causal inference in high-dimensional data lack the means to characterize the uncertainty associated with Granger causality estimates (e.g. p-values and confidence intervals). We make two contributions in this work. First, we introduce a novel unbiased estimator for assessing Granger causality in the high-dimensional regime. We propose test statistics and confidence intervals for our estimator to allow, for the first time, uncertainty characterization in high-dimensional Granger causal inference. Second, we introduce a novel method for false discovery rate control that achieves higher power in multiple testing than existing techniques and that can cope with dependent test statistics and dependent observations. We corroborate our theoretical results with experiments on both synthetic data and real-world climatological data.

- Efficient Algorithm for Sparse Tensor-variate Gaussian Graphical Models via Gradient Descent

**Pan Xu**, Tingting Zhang, Quanquan Gu

In Proc. of the 20th International Conference on Artificial Intelligence and Statistics (**AISTATS**), Fort Lauderdale, Florida, USA, 2017.

[Summary] [Paper]We study the sparse tensor-variate Gaussian graphical model (STGGM), where each way of the tensor follows a multivariate normal distribution whose precision matrix has sparse structures. To estimate the precision matrices, we propose a sparsity constrained maximum likelihood estimator. However, due to the complex structure of the tensor-variate GGMs, the likelihood based estimator is non-convex, which poses great challenges for both computation and theoretical analysis. In order to address these challenges, we propose an efficient alternating gradient descent algorithm to solve this estimator and prove that, under certain conditions on the initial estimator, our algorithm is guaranteed to linearly converge to the unknown precision matrices up to the optimal statistical error. Experiments on both synthetic data and real world brain imaging data corroborate our theory.

- Semiparametric Differential Graph Models

**Pan Xu**, Quanquan Gu

In Proc. of the 29th Conference on Advances in Neural Information Processing Systems (**NIPS**), Barcelona, Spain, 2016.

[Summary] [Paper] [Video]In many cases of network analysis, it is more attractive to study how a network varies under different conditions than an individual static network. We propose a novel graphical model, namely the Latent Differential Graph Model, where the networks under two different conditions are represented by two semiparametric elliptical distributions respectively, and the variation of these two networks (i.e., differential graph) is characterized by the difference between their latent precision matrices. We propose an estimator for the differential graph based on quasi-likelihood maximization with nonconvex regularization. We show that our estimator attains a faster statistical rate in parameter estimation than the state-of-the-art methods, and enjoys the oracle property under mild conditions. Thorough experiments on both synthetic and real-world data support our theory.

- Forward Backward Greedy Algorithms for Multi-Task Learning with Faster Rates

Lu Tian,**Pan Xu**, Quanquan Gu

In Proc. of the 32nd International Conference on Uncertainty in Artificial Intelligence (**UAI**), New York / New Jersey, USA, 2016.

[Summary] [Paper]In this paper, we develop a multi-task learning algorithm with faster convergence rates. In particular, we propose a general estimator for multitask learning with row sparsity constraints on the parameter matrix. The proposed estimator is a nonconvex optimization problem. To solve it, we develop a forward backward greedy algorithm with provable guarantees. More specifically, we prove that the output of the greedy algorithm attains a sharper estimation error bound than state-of-the-art multi-task learning methods. Moreover, our estimator enjoys model selection consistency under mild conditions. Thorough experiments on both synthetic and real-world data demonstrate the effectiveness of our method and back up our theory.