Background
The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called“exploration”) and optimize their decisions based on existing knowledge (called“exploitation”). The agent attempts to balance these competing tasks in order to maximize their total value over the period of time considered. [Wikipedia]
Denote the action selected at time step ttt as AtA_tAt, the corresponding reward as RtR_tRt.Thetrue valueof an arbitrary action aaa is the mean reward when that action is selected, denote as q∗(a)q_*(a)q∗(a).Denote theestimated valueof action aaa at time step ttt as Qt(a)Q_t(a)Qt(a).We would like Qt(a)Q_t(a)Qt(a) to be close to q∗(a)q_*(a)q∗(a).Sample-average method:
We could use theaverage reward actually receivedat time step ttt to estimate the q∗(a)q_*(a)q∗(a):
Qt(a)≐sumofrewardswhenatakenpriortotnumberoftimesatakenpriortot=∑i=1t−1RiIAi=a∑i=1t−1IAi=aQ_t(a) \doteq \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}=\frac{\sum^{t-1}_{i=1}R_i\mathbb{I}_{A_i=a}}{\sum^{t-1}_{i=1}\mathbb{I}_{A_i=a}}Qt(a)≐numberoftimesatakenpriortotsumofrewardswhenatakenpriortot=∑i=1t−1IAi=a∑i=1t−1RiIAi=a
Ipredicate\mathbb{I_{predicate}}Ipredicatedenotes the random variable that is 1 if predicate is true and 0 if it is not.
As time step goes to ∞\infty∞, by the lawoflargenumber\text{law of large number}lawoflargenumber, Qt(a)→q∗(a)Q_t(a) \rightarrow q_*(a)Qt(a)→q∗(a)
Greedy method and ε-greedy method
At a time step, there must be at least one action whose estimate value is greatest, the action we called“greedy action”. If we choose one of these greedy actions at a time step, we are“exploiting”;otherwise, if we choose a non-greedy action, we are“exploring”.
Greedy method
If we always choose actions with highest estimate values in terms of the rule as follow: At≐argmaxaQt(a)A_t\doteq \underset{a}{argmax}Q_t(a)At≐aargmaxQt(a); then, we called thisGreedy method.
ε-greedy method
If we behave greedily most of the time, but every once in a while, say with small probability ϵ\epsilonϵ, we select actions randomly from among all actions with equal probability. We called thisε-greedy method.