💡 An online algorithms, which access to offline data can have zero or even negative effect on the online performance.
Contribution
Exploration:
We shift from deterministic to stochastic policies for defining exploration objectives during the online phase.
A New Replay Buffer:
We develop a novel replay buffer consistent with the architecture and training protocol of ODT
Methodology
This paper combines DT with SAC, which adopt a maximum-entropy idea to encourage the exploration in fine-tuning.
Minor changes:
Change replay buffer from saving transitions to trajectories.
Utilizing HER to improve the sample-efficiency in sparse rewards settings.
Sampling strategy.
References
Levine, Sergey. “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.” ArXiv abs/1805.00909 (2018): n. pag.
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In Advances in Neural Information Processing Systems, 2017.