The Indoor-Training Effect: Unexpected Gains from Distribution Shifts in the Transition Function [video]
Date Posted:
January 29, 2025
Date Recorded:
January 29, 2025
CBMM Speaker(s):
Spandan Madan Speaker(s):
Serena Bono
All Captioned Videos Publication Releases
Description:
Authors Serena Bono (MIT Media Lab) and Spandan Madan (Harvard University) describe their latest paper and findings on training reinforcement learning models and their testing abilities.
[AUDIO LOGO] [MUSIC PLAYING]
SERENA BONO: Hi, my name is Serena and I am currently a PhD student at the Media lab. I've always been very interested in reinforcement learning, and in the summer of 2022, I was attending a CBMM summer school and this is where I met Spandan.
SPANDAN MADAN: I was one of the TAs at the CBMM summer school. And at the time, we just started talking a lot about the intersection of AI and neuroscience. My PhD focused a lot on this intersection, and so we talked about how we can bridge generalization and reinforcement learning, which is the things that we were both interested in. And we came up with this project.
SERENA BONO: During the project, we wanted to investigate generalization.
SPANDAN MADAN: One of the best ways to understand generalization is that one of the hallmarks of human intelligence is that we adapt very effortlessly. So if you look behind me, that's a building. You've probably never seen that building before, but you've seen many buildings, and you can generalize or adapt based on that knowledge and tell that that's a building. So this is something that comes very easily to humans. We can very easily adapt from some examples to a general concept. But it's a big struggle for modern AI.
SERENA BONO: The idea is that when we train an agent, we usually train it in a simulation environment, and then we deploy it in the real world. And what usually happens is that we have a generalization gap.
SPANDAN MADAN: So the generalization gap refers to the drop in the performance of an agent when it is trained and tested in different environments.
SERENA BONO: It's even more interesting that this generalization gap does not exist for humans. And this is what we wanted to investigate in this project. In order to investigate generalization, we created different variations of environments, and we use a very common benchmark, the Atari games. So we took, for instance, Pac-Man, and we created different variations of Pac-Man by adding noise to what is called a transition function.
SPANDAN MADAN: So you can think about two different versions of the Pac-Man game. In one version, the ghost mostly goes left, and the other one, the ghost, moves left or right with equal probability. And so in this case, what would differ between these two games is their transition functions.
SERENA BONO: What people thought for the longest time was that training and testing on the same environment will yield the best outcome. Though, picture it this way. If you were a tennis player and you were trying to prepare yourself for a competition, would you rather practice your fundamentals in an indoor non-windy environment, or would you rather do it in an outdoor, windy environment?
The tennis pro will probably prefer to train in the non-windy environment, and this is exactly what we found in our study. So training on an environment which is not noisy actually helps the agent perform better in the noisy environment. And we found this across many different variations of Pac-Man. One possible variation is for the ghosts to move mostly west. Another one is for the ghost to move randomly, et cetera, et cetera.
And we found this across many different Atari games-- for instance, Pong and Breakout. We also found an occurrence where it doesn't happen. It doesn't happen when the agent does not explore the same state spaces in both the training and the testing environment. And there is a pretty intuitive explanation of why that is. When our pro tennis player is training for the competition, he should practice both the forehand and backhand in order to win.
So what happens if our pro tennis player only practices the forehand in training? Well, he probably would have been better off by practicing his backhand in the outdoor, windy environment. And this is exactly what we found. When the state spaces explored are the same, it's actually better to train in the non-noisy environment and test in the noisier environment. Well, when the state spaces are different, we're better off by training and testing both in the noisy environment.
SPANDAN MADAN: So zooming out, big picture, generalization is one of the Achilles heels of machine learning and also reinforcement learning. And so there's obviously a lot of work required to do on understanding why models don't generalize and when they generalize. From here, I think the biggest impact for the field is that we would like to see if this is something that can be harnessed, and if it's possible to design such environments which are easier, which are, so to say, non-windy, where agents can learn to play in the fundamentals, and they learn much better.
So over the past couple of decades, a lot of research in reinforcement learning has focused on these very standardized benchmarks of Atari. And a lot of progress has come through advancing algorithms and producing new ways of training agents. One of the big findings or impact of this work was that we did not change the algorithm. Instead, we changed the data itself.
So that raises a very important question for AI, which is, what's the right data to train on? So we showed that you can take a different environment, which is different from the test environment, and you can perform much better on the test environment. But we still don't have a way of engineering these environments. So we hope that going forward, researchers can find a way to engineer such amazing environments which are optimal for training.
[MUSIC PLAYING]
Associated Research Module: