This about Exploration vs Exploitation, and we focus on multi-armed bandits (MABs). They are different from “Full RL” in the following sense:
- There’s only one state in MABs. The action taken does not change the state.
- The goal is online learning to minimize regret (note this is the same thing as maximizing reward). Learning and evaluation are the same thing. You pay a “price” for your exploration.
- In contrast, in full RL, we expect the agent to be terrible and make tons of mistakes (i.e., have massive regret) for millions of steps. This is just the “cost of training.” The final metric is the total reward of the finished, frozen policy when run on new, unseen test scenarios.
Some analogies:
You’ve moved to a new city and you’re trying to find the best restaurant for your daily lunch.
- Every day you don’t eat at the actual best restaurant, you’ve incurred “regret” (a sub-optimal lunch).
- You can’t “train” for a year by eating at 1,000 restaurants and then “deploy” your lunch policy. The act of “training” (trying a new place) is your life. Your goal is to have the best possible set of lunches starting from today. Other classic example involves showing different ads to a user (boring), clinical trials, dynamic pricing.
Why don’t I see it in general RL
Seems like this is a good way to do things if I want to balance exploration vs exploitation with no state state information. Why are we still using -greedy? The answer is that methods like UCB doesn’t scale well. Or, they are not able to handle massive or continuous state / action space. For more, see Appendix why UCB doesn’t scale.
Back to definition
The action value for action is the expected reward
The optimal value is
Regret of an action is
Note there is only one state, and there exist such an action that is ALWAYS the best.
We want to minimize the total regret:
Greedy and epsilon greedy
Greedy is just greedy. The -greedy algorithm:
- With probability select greedy action:
- With probability select a random action
- Equivalently:
Policy Gradient
We want to maximize total reward. We can do it by gradient ascent: in each step we make it better. Think about it this way: the total expected reward is a function of and picture in your head gradient ascent (since it’s reward, not loss).
This expected reward is not the “reward this time” since the policy can be stochastic. How can we compute the gradient? While we can use sample based methods to approximate expectation, here we still need to sample , which is not known.
Here comes the log-likelihood trick (also known as REINFORCE trick)
- In the first step we are expanding that expectation into and show is just probability of . Note is introduced.
- And the value doesn’t depend on , this is great.
- Some tricks so the whole thing is an expectation again by introducing back .
Hey hey, what changed? It goes from the gradient of theta w.r.t. an expectation to the expectation of a gradient, and that we can sample.
Baseline
For any ,
This means we can subtract a baseline, and instead use
Baselines do not change the expected update, but they do change variance
UCB
Regret grows at least logarithmically.
UCB’s core idea is “Optimism in the face of uncertainty”. Try the uncertain, maybe not promising enough action.
Recall in greedy, we just pick the . We can add another there that depends on the number of times action has been selected.
If we pick that , we can show by Hoeffding’s Inequality, that’s quite good.
So the formula is:
Thompson sampling
Really the reward we got from each action should be a distribution. UCB just ignores that though and uses count. Let’s just model a distribution for the q function of each action. Thompson sampling (Thompson 1933):
- Sample
- Select action maximising sample,
- Thompson sampling is sample-based probability matching
You can imagine for Bernoulli bandits, we can model it by Beta distribution.
Appendix: why UCB doesn’t scale
The short answer: UCB doesn’t work “out of the box” in complex environments.
The UCB algorithm’s action selection is:
The most important part of that formula is , the “visit count” for that arm. In the MAB problem, this is easy. You only have one state, so you just count how many times you’ve pulled each arm.
Now, apply this to a full RL problem like an Atari game:
- The State Space is Massive: The “state” is the pixel data from the screen. The number of possible screen images is astronomical ( pixels, many colors).
- You Never Visit the Same State Twice: Because the state space is so large (effectively continuous), your agent will never encounter the exact same pixel-perfect state more than once.
- The Count is Always 1 (or 0): If you try to maintain a visit count for every state-action pair, the count for any state you are currently in will be 1. This makes the UCB formula’s exploration bonus explode, and the agent just explores randomly.
So, while -greedy is “dumb” exploration (it just picks a random action percent of the time), it is simple and it scales. It doesn’t care how big the state space is.