Online Decision Transformer

combine decision transformer with SAC

📖 Arxiv: 2202.05607

Motivation

💡 An online algorithms, which access to offline data can have zero or even negative effect on the online performance.

Contribution

Exploration:

We shift from deterministic to stochastic policies for defining exploration objectives during the online phase.
A New Replay Buffer:

We develop a novel replay buffer consistent with the architecture and training protocol of ODT

Methodology

This paper combines DT with SAC, which adopt a maximum-entropy idea to encourage the exploration in fine-tuning.
Minor changes:
1. Change replay buffer from saving transitions to trajectories.
2. Utilizing HER to improve the sample-efficiency in sparse rewards settings.
3. Sampling strategy.

References

Levine, Sergey. “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.” ArXiv abs/1805.00909 (2018): n. pag.
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In Advances in Neural Information Processing Systems, 2017.