第一句子网 > Greedy method and ε-greedy method

Greedy method and ε-greedy method

时间：2018-12-19 22:29:12

Background

The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called“exploration”) and optimize their decisions based on existing knowledge (called“exploitation”). The agent attempts to balance these competing tasks in order to maximize their total value over the period of time considered. [Wikipedia]

Denote the action selected at time step ttt as AtA_tAt, the corresponding reward as RtR_tRt.Thetrue valueof an arbitrary action aaa is the mean reward when that action is selected, denote as q∗(a)q_*(a)q∗(a).Denote theestimated valueof action aaa at time step ttt as Qt(a)Q_t(a)Qt(a).We would like Qt(a)Q_t(a)Qt(a) to be close to q∗(a)q_*(a)q∗(a).Sample-average method:We could use theaverage reward actually receivedat time step ttt to estimate the q∗(a)q_*(a)q∗(a):

Qt(a)≐sumofrewardswhenatakenpriortotnumberoftimesatakenpriortot=∑i=1t−1RiIAi=a∑i=1t−1IAi=aQ_t(a) \doteq \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}=\frac{\sum^{t-1}_{i=1}R_i\mathbb{I}_{A_i=a}}{\sum^{t-1}_{i=1}\mathbb{I}_{A_i=a}}Qt(a)≐numberoftimesatakenpriortotsumofrewardswhenatakenpriortot=∑i=1t−1IAi=a∑i=1t−1RiIAi=a

Ipredicate\mathbb{I_{predicate}}Ipredicatedenotes the random variable that is 1 if predicate is true and 0 if it is not.

As time step goes to ∞\infty∞, by the lawoflargenumber\text{law of large number}lawoflargenumber, Qt(a)→q∗(a)Q_t(a) \rightarrow q_*(a)Qt(a)→q∗(a)

Greedy method and ε-greedy method

At a time step, there must be at least one action whose estimate value is greatest, the action we called“greedy action”. If we choose one of these greedy actions at a time step, we are“exploiting”;otherwise, if we choose a non-greedy action, we are“exploring”.

Greedy method

If we always choose actions with highest estimate values in terms of the rule as follow: At≐argmaxaQt(a)A_t\doteq \underset{a}{argmax}Q_t(a)At≐aargmaxQt(a); then, we called thisGreedy method.

ε-greedy method

If we behave greedily most of the time, but every once in a while, say with small probability ϵ\epsilonϵ, we select actions randomly from among all actions with equal probability. We called thisε-greedy method.

Pros & Cons

If the reward variance are zero, the greedy method would perform the best, because it know the optimal action after trying once. If the reward variance is noiser, then theε-greedy methodwould perform better, because it takes some random actions to explore against the variance.The ε-greedy method takes some random actions to avoid getting into delimma and extend the probility of achieving optimal return.It also possible to reduce ϵ\epsilonϵ over time to try to get the best of both high and low values.

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。