Decision Transformer-based decision-making agents have shown the ability to generalize across multiple tasks. However, their performance relies on massive data and computation. We argue that this inefficiency stems from the forgetting phenomenon, in which a model memorizes its behaviors in parameters throughout training. As a result, training on a new task may deteriorate the model's performance on previous tasks. In contrast to LLMs' implicit memory mechanism, the human brain utilizes distributed memory storage, which helps manage and organize multiple skills efficiently, mitigating the forgetting phenomenon. Inspired by this, we propose a working memory module to store, blend, and retrieve information for different downstream tasks. Evaluation results show that the proposed method improves training efficiency and generalization in Atari games and Meta-World object manipulation tasks. Moreover, we demonstrate that memory fine-tuning further enhances the adaptability of the proposed architecture.
Our motivation comes from how humans think before they act: they can reason on past experiences to generate appropriate behavior in new situations. We want to equip our robots with similar abilities. Imagine training a robot to play four different Atari games: Asteroids, Asteroids Deluxe, Space Invaders, and Space Invaders II (As shown in below). Asteroids Deluxe is a sequel to Asteroids that introduces new boss fights and enemies, similarly, Space Invaders II is a sequel to Space Invaders. For a robot to play these four games, it must actively store what it has learned in memory and choose the appropriate strategy for each game. Throughout training, the robot's memory module continuously processes and updates relevant game information, allowing it to make informed decisions and adapt its strategies.
To store incoming information and blend it with existing memory, we calculate an erasing vector, \(\epsilon^e\), and an adding vector, \(\epsilon^a\). The erasing vector erases the current memory, while the adding vector controls information flow to the memory. We use the attention mechanism for this. First, we map memory and input information to query, key, and value vectors: \(\hat{Q}=M\hat{W}^q\), \(\hat{K}=E\hat{W}^k\), and \(\hat{V}=E\hat{W}^v\), where \(\hat{W}^q\), \(\hat{W}^k\), and \(\hat{W}^v\) are parameters. Next, we calculate the writing strength, \(\beta = \text{softmax}\Big(\frac{\hat{Q}\hat{K}^T}{\sqrt{d}}\Big)\).
The erasing vector \(\epsilon^e = w \odot (1 - \beta)\), where \(\odot\) indicates element-wise multiplication, selectively erases information from the memory matrix. The adding vector \(\epsilon^a = (w \odot \beta) \hat{W}^v x\) selectively adds information to the memory matrix. Finally, the memory is updated as \(M_t = M_{t-1} \odot (1 - \epsilon^e) + \epsilon^a\). New information is stored if the selected memory slot is empty or erased, otherwise, it blends with the existing memory contents.
We retrieve information from the updated memory slots to utilize memory for decision-making. Reading from the memory matrix is done by computing a read position vector. This vector can be computed using the above content-based addressing mechanism that compares the query vector with the contents of the memory matrix. Note that in other retrieval-based methods, the nearest neighbor is the common way to retrieve related information. However, in our case, the working memory is considerably smaller than typical external memory, which makes attention-based retrieval feasible. Since the query information is the same as the input information, we use the same content address to retrieve the memory: \({E}_{\text{out}} = {w}\odot{M}_t\).
@inproceedings{kang2023think,
title={Think Before You Act: Decision Transformers with Working Memory},
author={Kang, Jikun and Laroche, Romain and Yuan, Xingdi and Trischler, Adam and Liu, Xue and Fu, Jie},
booktitle={Forty-first International Conference on Machine Learning},
year={2023}
}