Analysis of a Reinforcement Learning algorithm using Self

Analysis Of A Reinforcement Learning Algorithm Using Self-Free PDF

  • Date:28 May 2020
  • Views:25
  • Downloads:0
  • Pages:6
  • Size:709.87 KB

Share Pdf : Analysis Of A Reinforcement Learning Algorithm Using Self

Download and Preview : Analysis Of A Reinforcement Learning Algorithm Using Self


Report CopyRight/DMCA Form For : Analysis Of A Reinforcement Learning Algorithm Using Self


Transcription:

ESANN 2011 proceedings European Symposium on Artificial Neural Networks Computational Intelligence. and Machine Learning Bruges Belgium 27 29 April 2011 i6doc com publ ISBN 978 2 87419 044 5. Available from http www i6doc com en livre GCOI 28001100817300. where t stands for the iteration number t is the learning rate and h t the. neighborhood kernel whose center is located at the BMU The neighborhood. kernel determines which neurons around the BMU are updated and how this. update a ects each neuron, One of the SOM applications is the so called Visual Data Mining that basi. cally consists of extracting knowledge from those models A straightforward way. to do this is by clustering the map using any clustering algorithm e g K Means. 1 on the distances between synaptic weights of the neurons Afterwards the. di erent components of the input vectors to the SOM can be simultaneously. analyzed thus obtaining relationships among the di erent features that are used. as inputs to the model,1 2 Reinforcement Learning, RL algorithms are based on the interaction between an agent and its environment. as shown in Fig 1 The agent is the learner which interacts with the environment. making decisions according to observations made from it The environment is. every external condition that cannot be modi ed by the agent 3. st before action at,ENVIRONMENT,st 1 after action at. Fig 1 Characterization of the Reinforcement Learning model at is the action. taken by the agent at time t st is the state of the environment at time t and. rt 1 denotes the reward at time t 1, The long term reward Rt is the sum of all the immediate rewards through. out a complete decision process It is the objective function that the agent. is interested in maximizing by taking the right actions 3 Its mathematical. expression is,Rt k rt k 1 2, where is called the discount rate parameter which ranges between 0 and 1.
This factor determines the present value of future rewards. The goal is to maximize Rt by means of an optimal policy which tells the. agent the best action to take for an optimal performance Therefore an estima. tion of the expected Rt as the result of an action at from a state st is required. This estimation is usually called action value function. Q s a E Rt st s at a 3, ESANN 2011 proceedings European Symposium on Artificial Neural Networks Computational Intelligence. and Machine Learning Bruges Belgium 27 29 April 2011 i6doc com publ ISBN 978 2 87419 044 5. Available from http www i6doc com en livre GCOI 28001100817300. where Q s a is the expected Rt starting at state s and taking action a fol. lowing policy s a Therefore the optimal policy is given by the following. expression 3,s a arg max Q s a 4, where A stands for the set of possible actions Optimal policies thus share the. same optimal action value function Q s a max Q s a An iterative and. very popular procedure to determine Q s a is Q Learning 3. Qt 1 st at Qt st at rt 1 max Qt st 1 a Qt st at 5, As shown in expression 5 Q Learning has two parameters to take into ac. count is the learning rate that measures how fast the Q values change and. measures the relevance of future rewards Moreover there are two other param. eters that play a relevant role balances the trade o between a conservative. policy taking the best known action and the exploration of new actions take. new random actions in order to likely nd an even better action eventually. is related to traces that are also related with the relevance of future rewards. 2 Experimental Setup,2 1 Description of the problem. The problems analyzed in this paper were variants of a classical RL problem the. Gridworld in which there are a start position and a goal position The agent. tries to nd the goal position as soon as possible, In particular Gridworlds of size 10 10 were considered Four di erent.
variants Fig 2 were taken into account Gridworld 1 was the simplest one with. only a start cell and a goal cell Gridworld 2 also had some walls cells which. could not be accessed and mousetraps cells with negative reward Gridworld. 3 had cells in which the agent could only go one way and Gridworld 4 had cells. in which some actions made the agent go to non adjacent cells. The agent was guided towards the goal by means of the Watkins TD algo. rithm Q 3 that makes use of the parameters and After initial tests. to set the ranges of the di erent parameters ve values linearly spaced into its. range were tested for each parameter thus leading to 625 di erent combinations. Every parameter combination was run over 30 experiments of 300 episodes each. The criteria to evaluate learning performance was based in Kaelbling criteria 4. Convergence episode Number of the episode from which the standard de. viation of the number of steps for episode falls to value ten times lower than. the maximum standard deviation It s desired a low convergence episode. however the results will show it s di cult to obtain a good convergence. and also good values in the other performance measures. ESANN 2011 proceedings European Symposium on Artificial Neural Networks Computational Intelligence. and Machine Learning Bruges Belgium 27 29 April 2011 i6doc com publ ISBN 978 2 87419 044 5. Available from http www i6doc com en livre GCOI 28001100817300. G W G M Mousetrap,Map 1 Map 2 tra c,Map 3 Map 4, Fig 2 Gridworld environments Canonical example and three variants in which. an agent has to learn the best behavior, Consumption Number of steps actions followed within an episode start. ing from the convergence episode Once the algorithm is stable it is de. sirable a low number of steps to reach the goal, Standard deviation STD Standard deviation of the number of steps an. episode takes starting from the convergence episode A stability measure. Percentage of Deficient episodes Percentage of episodes that do not reach. convergence before a certain threshold that is set in order to allow the. calculation of the other criteria, Percentage of episodes in which the optimal path Optimum is reached. Therefore the aim was to relate four parameters with ve performance mea. sures SOM was used to show the relationships among them as it is explained. in detail in Section 2 2,2 2 SOM visualization of results.
Each Gridworld test data all input parameter value combinations and their. resulting performance measures was separately used to train SOMs with dif. ferent initializations and neighbor functions Those maps showing the lowest. Kaski and Lagus index 2 were nally chosen since the objective was to nd. maps that carried out an appropriate representation of the data and were easily. understandable, The resulting SOMs for Gridworlds 1 2 and 3 are presented in Figs 3 4 and. 5 respectively SOM for Gridworld 4 is not presented since results were very. ESANN 2011 proceedings European Symposium on Artificial Neural Networks Computational Intelligence. and Machine Learning Bruges Belgium 27 29 April 2011 i6doc com publ ISBN 978 2 87419 044 5. Available from http www i6doc com en livre GCOI 28001100817300. Alfa Gamma Epsilon Optimum Deficient,0 824 0 798 0 482 232 14 2. 0 467 0 46 0 348 120 7 18,0 11 0 122 0 215 7 49 0 164. Lambda Ep Convergencia Consumption STD,0 914 24 1 36 8 10 2 7 5. 0 46 15 3 29 3 6 87 4,0 00626 6 38 21 7 8, Fig 3 SOM for Gridworld 1 The component planes for the di erent features.
model parameters and performance measures as well as a K Means clustering. of the SOM are shown,Alfa Gamma Epsilon Optimum Deficient. 0 881 0 812 0 463 159 10 2,0 441 0 436 0 34 82 2 5 09. 0 00114 0 0594 0 218 5 51 0 0251,Lambda Ep Convergencia Consumption STD. 0 888 18 8 211 3960 4 1,0 444 12 2 117 1980,0 000424 5 68 23 7 6 48 2. Fig 4 SOM for Gridworld 2 The component planes for the di erent features. model parameters and performance measures as well as a K Means clustering. of the SOM are shown, similar to Gridworld 1 A K Means clustering was carried out over the SOM.
in order to emphasize the relevant areas of the map and hence make an easier. interpretation of the results, Figs 3 4 and 5 show up relevant relationships among the di erent features. Since Gridworlds 1 and 4 presented a similar behavior only Gridworld 1 is. shown being especially remarkable the di erences with Gridworld 2 there are. also some di erences with Gridworld 3 It should be emphasized the similarity. among the Component Planes of Consumption STD and Deficient episodes in. all the cases therefore only one of the indices would be necessary to explain. the others Component Planes also show an inverse relationship between and. Convergence There is also an inverse relationship between and Deficient. and hence also Comsumption and STD This kind of analysis can be done. at the cluster level thus extraction a set of rules about the relationship among. the di erent features For example g 3 shows how cluster 7 is related to a. good parameter selection as it means low consumption medium low STD low. de cient and high optimum percentages of episodes, ESANN 2011 proceedings European Symposium on Artificial Neural Networks Computational Intelligence. and Machine Learning Bruges Belgium 27 29 April 2011 i6doc com publ ISBN 978 2 87419 044 5. Available from http www i6doc com en livre GCOI 28001100817300. Alfa Gamma Epsilon Optimum Deficient,0 91 0 781 0 48 36 6 6 46. 0 51 0 442 0 344 19 3 23,0 11 0 102 0 208 1 29 0 000518. Lambda Ep Convergencia Consumption STD,0 835 18 6 52 3 39 5 2 4 5.
0 418 10 3 41 8 23 9 3 9,6 87e 005 2 01 31 2 6 81, Fig 5 SOM for Gridworld 3 The component planes for the di erent features. model parameters and performance measures as well as a K Means clustering. of the SOM are shown,3 Conclusions and further work. This work has presented the SOM as a tool for visual data mining It is used. in order to determine the rules that relate the di erent parameters of a Re. inforcement Learning model to its corresponding performance measures The. methodology can be extended to other Machine Learning models thus helping. non experts to optimize the paramaters of a certain model. 4 Acknowledgments, This work has been partially supported by the Spanish Ministry of Educa. tion and Science under projects with reference numbers TIN2007 61006 and. CSD2007 00018,References, 1 E Alpaydin Introduction to Machine Learning The MIT Press 2004. 2 T Kohonen Self Organizing Maps Springer Verlag Heidelberg Germany 2001. 3 Richard S Sutton and Andrew G Barto Reinforcement Learning An Introduction MIT. Press Cambridge MA 1998, 4 Leslie Pack Kaelbling Michael L Littman and Andrew W Moore Reinforcement learning.
A survey Journal of Artificial Intelligence Research 1996. Analysis of a Reinforcement Learning algorithm using Self Organizing Maps chosen from the data set on each in Section 2 2 2 2 SOM visualization of

Related Books