Introducción al Trabajo de Título
Guia Coguia Externo
Áreas Inteligencia artificial
Sub Áreas Aprendizaje de máquina, Metodología de búsqueda
Estado Disponible
Descripción

The principal advisor of this thesis will be Rex Liu, researcher at the CENIA. Valentin will act as co-adivsor.  

Description

Motivated by recent successes in large-scale language and vision models, recent work has sought to train transformer models in a similar manner for reinforcement learning (RL) tasks, using either next-token [1] or masked-token [2] prediction. Howwever, in contrast to vision and language, RL data comprise trajectories composed of three distinct modalities: observations, actions, and rewards. Yet current architectures for RL transformers process all modalities in exactly the same way with a uni-modal model. Looking to natural intelligence in contrast, one hallmark of biological brains is their functional specialisation, with one region focused visual processing, another on reward processing reward information, and third on motor processing. There are several advantages to this approach. The nature of information contained in each modality is different and should arguably be pre-processed in a distinct manner. Pre-processing each modality can allow the model to first extract out more temporally-extended latent representations for each modality, such as continuous action trajectories or object tracking across a visual scene, before considering how these latent representations interact across modalities. And different modalities can have different computational demands. For instance, visual inputs are much more complex and higher dimensional than motor commands, and accordingly the visual cortex has evolved to occupy a substantial fraction of the human brain. 

The goal of this project will be to explore the benefits of using multi-modal architectures in RL in contrast to the current prevailing approach of using uni-modal architectures. To that end, the project will consist of four main parts:

  1. Working within pre-existing transformer RL frameworks, namely the Decision Transformer [1] and/or MaskDP [2], the student will modify pre-existing uni-modal architectures into multi-modal ones.

  2. Benchmark multi-modal architectures against standard offline RL benchmarks from the literature. Specif- ically, we will compare RL performance against two metrics: (a) architectural complexity as measured by the number of model parameters and (b) computational efficiency as measured by the number of inference- time operations. The goal is to establish whether multi-modal architectures are more efficient in both metrics compared to uni-modal ones.

  3. Potentially explore the benefits of different computer vision data augmentation techniques on the robust- ness of the visual representations acquired within the context of RL. And if working within the MaskDP framework, investigate different possible masking strategies for each modality.

  4. To better understand the nature of the representations acquired and how modalities may interact, we will finally visualise representations acquired for vision and action through the transformer attention maps.

This project can potentially be divided into two projects, with one student focusing on benchmarking efficiency and another focused on studying model representations. 

References

[1] Lili Chen et al. “Decision Transformer: Reinforcement Learning via Sequence Modeling”. In: Advances in Neural Information Processing Systems. Ed. by A. Beygelzimer et al. 2021. url: https://openreview. net/forum?id=a7APmM4B9d.

[2] Fangchen Liu et al. “Masked Autoencoding for Scalable and Generalizable Decision Making”. In: Ad- vances in Neural Information Processing Systems. Ed. by S. Koyejo et al. Vol. 35. Curran Associates, Inc., 2022, pp. 12608–12618. url: https://proceedings.neurips.cc/paper_files/paper/2022/file/ 51fda94414996902ddaaa35561b97294-Paper-Conference.pdf.