# Q-learning with Logarithmic Regret

@inproceedings{Yang2021QlearningWL, title={Q-learning with Logarithmic Regret}, author={Kunhe Yang and Lin F. Yang and Simon Shaolei Du}, booktitle={AISTATS}, year={2021} }

This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal $Q$-function. We prove that the optimistic $Q$-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\mathrm{gap}_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound, where $S$ is the… Expand

#### 16 Citations

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

- Computer Science, Mathematics
- COLT
- 2021

A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed. Expand

An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

- Computer Science, Mathematics
- ArXiv
- 2021

This work's hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal Qfunction, which implies an exponential separation between the online RL setting and the generative model setting. Expand

Minimax Sample Complexity for Turn-based Stochastic Game

- Computer Science, Mathematics
- ArXiv
- 2020

This work proves that the plug-in solver approach, probably the most natural reinforcement learning algorithm, achieves minimax sample complexity for turn-based stochastic game (TBSG) by utilizing a `simulator' that allows sampling from arbitrary state-action pair. Expand

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

- Computer Science, Mathematics
- 2021

A novel model-free algorithm, with space complexity O(SAH), that achieves near-optimal regret as soon as the sample size exceeds the order of SApoly(H) and improves — by at least a factor of SA — upon any prior memory-efficient algorithm that is asymptotically regret- optimal. Expand

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

- Computer Science
- ArXiv
- 2021

An efficient algorithm is provided that takes only Õ(1/ · (HSA/ρ+HSA)) episodes of exploration, and is able to obtain an -optimal policy for a post-revealed reward with sub-optimality gap at least ρ, obtaining a nearly quadratic saving in terms of . Expand

Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

- Computer Science, Mathematics
- ICML
- 2021

It is shown that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. Expand

Provably Efficient Multi-Task Reinforcement Learning with Model Transfer

- Computer Science
- ArXiv
- 2021

A heterogeneous multi-player RL problem, in which a group of players concurrently face similar but not necessarily identical MDPs, is formulated with a goal of improving their collective performance through inter-player information sharing. Expand

A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs

- Computer Science
- ArXiv
- 2021

A novel asymptotic problem-dependent lower-bound for regret minimization in finitehorizon tabular Markov Decision Processes (MDPs) is derived and shows that, in certain “simple” MDPs, the lower bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all. Expand

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

- Computer Science
- ArXiv
- 2021

It is shown that optimistic algorithms can not achieve the information-theoretic lower bounds even in deterministic MDPs unless there is a unique optimal policy, and tighter upper regret bounds for optimistic algorithms are proved. Expand

Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap

- Computer Science
- COLT
- 2021

A new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound, and complements its upper bound with a lower bound showing the dependency on |Zmul| ∆min is unavoidable for any consistent algorithm. Expand

#### References

SHOWING 1-10 OF 50 REFERENCES

Minimax Regret Bounds for Reinforcement Learning

- Mathematics, Computer Science
- ICML
- 2017

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of… Expand

Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity

- Computer Science, Mathematics
- ArXiv
- 2020

The open problem on agnostic $Q$-learning proposed in [Wen and Van Roy, NIPS 2013] is settled and the upper bound suggests that the sample complexity of $\widetilde{\Theta}\left(\rho/\sqrt{\mathrm{dim}_E\right)$ is tight even in the agnostic setting. Expand

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP

- Computer Science, Mathematics
- ICLR
- 2020

It is shown that the sample complexity of exploration of the proposed Q-learning algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}})$, which improves the previously best known result. Expand

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

- Mathematics, Computer Science
- NeurIPS
- 2020

To the best of the knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically impossible). Expand

Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

- Computer Science, Mathematics
- NeurIPS
- 2020

A model-free algorithm UCB-Advantage is proposed and it is proved that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play. Expand

Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

- Computer Science, Mathematics
- NIPS
- 2015

The upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-Horizon dependency of at least $H^3$. Expand

Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model

- Computer Science, Mathematics
- NeurIPS
- 2018

The method is extended to computing $\epsilon$-optimal policies for finite-horizon MDP with a generative model and matches the sample complexity lower bounds proved in \cite{azar2013minimax} up to logarithmic factors. Expand

Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes

- Mathematics, Computer Science
- SODA
- 2018

This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and it is hoped that the analysis may be useful more broadly. Expand

Is Q-learning Provably Efficient?

- Computer Science, Mathematics
- NeurIPS
- 2018

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically… Expand

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

- Computer Science, Mathematics
- ICML
- 2019

An algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the next-state value functions. Expand