Skip to main content

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

  • Chapter
Book cover Reinforcement Learning

Part of the book series: The Springer International Series in Engineering and Computer Science ((SECS,volume 173))

Abstract

This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Barto, A.G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements.Human Neurobiology, 4, 229–256.

    Google Scholar 

  • Barto, A.G. & Anandan, P. (1985). Pattern recognizing stochastic learning automata.IEEE Transactions on Systems Man and Cybernetics, 15, 360–374.

    Article  MathSciNet  MATH  Google Scholar 

  • Barto, A.G. & Anderson, C.W. (1985). Structural learning in connectionist systems.Proceedings of the Seventh Annual Conference of the Cognitive Science Society(pp. 43–53). Irvine, CA.

    Google Scholar 

  • Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning control problems.IEEE Transactions on Systems,Man,and Cybernetics, 13, 835–846.

    Article  Google Scholar 

  • Barto, A.G., Sutton, R.S., & Brouwer, P.S. (1981). Associative search network: A reinforcement learning associative memory.Biological Cybernetics, 40, 201–211.

    Article  MATH  Google Scholar 

  • Barto, A.G., & Jordan, M.I. (1987). Gradient following without back-propagation in layered networks.Proceedings of the First Annual International Conference on Neural Networks, Vol. II (pp. 629–636).

    Google Scholar 

  • San Diego, CA. Barto, A.G., Sutton, R.S., & Watkins, C.J.C.H. (1990). Learning and sequential decision making. In: M. Gabriel & J.W. Moore (Eds.)Learning and computational neuroscience: Foundations of adaptive networks.Cambridge, MA: MIT Press.

    Google Scholar 

  • Dayan, P. (1990). Reinforcement comparison. In D.S. Touretzky, J.L. Elman, T.J. Sejnowski, & G.E. Hinton (Eds.)Proceedings of the 1990 Connectionist Models Summer School(pp. 45–51). San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Goodwin, G.C. & Sin, K.S. (1984).Adaptive filtering prediction and control. Englewood Cliffs, NJ: Prentice-Hall.

    MATH  Google Scholar 

  • Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions.Neural Networks, 3, 671–692.

    Article  Google Scholar 

  • Hinton, G.E. & Sejnowski, T.J. (1986). Learning and relearning in Boltzmann machines. In: D.E. Rumelhart & J.L. McClelland, (Eds.)Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations.Cambridge, MA: MIT Press.

    Google Scholar 

  • Jordan, M.I. & Rumelhart, D.E. (1990).Forward models: supervised learning with a distal teacher.(Occasional Paper - 40). Cambridge, MA: Massachusetts Institute of Technology, Center for Cognitive Science.

    Google Scholar 

  • leCun, Y. (1985). Une procedure d’apprentissage pour resau a sequil assymetrique [A learning procedure for asymmetric threshold networks].Proceedings of Cognitiva, 85,599–604.

    Google Scholar 

  • Munro, P. (1987). A dual back-propagation scheme for scalar reward learning.Proceedings of the Ninth Annual Conference of the Cognitive Science Society(pp. 165–176). Seattle, WA.

    Google Scholar 

  • Narendra, K.S. & Thathatchar, M.A.L. (1989).Learning Automata: An introduction.Englewood Cliffs, NJ: Prentice Hall.

    Google Scholar 

  • Narendra, K.S. & Wheeler, R. M., Jr. (1983). An N-player sequential stochastic game with identical payoffs. IEEE Transactions on Systems, Man, and Cybernetics, 13, 1154–1158.

    Article  MathSciNet  MATH  Google Scholar 

  • Nilsson, N.J. (1980).Principles of artificial intelligence.Palo Alto, CA: Tioga.

    Google Scholar 

  • Parker, D.B. (1985).Learning-logic.(Technical Report TR-47). Cambridge, MA: Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science.

    Google Scholar 

  • Rohatgi, V.K. (1976)An introduction to probability theory and mathematical statistics.New York: Wiley.

    MATH  Google Scholar 

  • Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation. In: D.E. Rumelhart & J.L. McClelland, (Eds.)Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations.Cambridge: MIT Press.

    Google Scholar 

  • Schmidhuber, J.H. & Huber, R. (1990). Learning to generate focus trajectories for attentive vision. (Technical Report FKI-128–90). Technische Universität München, Institut für Informatik.

    Google Scholar 

  • Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning.Ph.D. Dissertation, Dept. of Computer and Information Science, University of Massachusetts, Amherst, MA.

    Google Scholar 

  • Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9–44.

    Google Scholar 

  • Thathatchar, M.A.L. & Sastry, P.S. (1985). A new approach to the design of reinforcement schemes for learning automata. IEEE Transactions on Systems, Man, and Cybernetics, 15, 168–175.

    Article  Google Scholar 

  • Wheeler, R.M., Jr. & Narendra K.S. (1986). Decentralized learning in finite Markov chains. IEEE Transactions on Automatic Control, 31, 519–526.

    Article  MATH  Google Scholar 

  • Watkins, C.J.C.H. (1989).Learning from delayed rewards.Ph.D. Dissertation, Cambridge University, Cambridge, England.

    Google Scholar 

  • Werbos, P.J. (1974).Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph.D. Dissertation, Harvard University, Cambridge, MA.

    Google Scholar 

  • Williams, R.J. (1986).Reinforcement learning in connectionist networks: A mathematical analysis.(Technical Report 8605). San Diego: University of California, Institute for Cognitive Science.

    Google Scholar 

  • Williams, R.J. (1987a).Reinforcement-learning connectionist systems.(Technical Report NU-CCS-87–3). Boston, MA: Northeastern University, College of Computer Science.

    Google Scholar 

  • Williams, R.J. (1987b). A class of gradient-estimating algorithms for reinforcement learning in neural networks.Proceedings of the First Annual International Conference on Neural Networks, Vol. II (pp. 601–608). San Diego, CA.

    Google Scholar 

  • San Diego, CA. Williams, R.J. (1988a). On the use of backpropagation in associative reinforcement learning.Proceedings of the Second Annual International Conference on Neural Networks, Vol. I (pp. 263–270). San Diego, CA.

    Article  Google Scholar 

  • San Diego, CA. Williams, R.J. (1988b).Toward a theory of reinforcement-learning connectionist systems.(Technical Report NUCCS-88–3). Boston, MA: Northeastern University, College of Computer Science.

    Google Scholar 

  • Williams, R.J. & Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms.Connection Science, 3, 241–268.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1992 Springer Science+Business Media New York

About this chapter

Cite this chapter

Williams, R.J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. In: Sutton, R.S. (eds) Reinforcement Learning. The Springer International Series in Engineering and Computer Science, vol 173. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-3618-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-3618-5_2

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-6608-9

  • Online ISBN: 978-1-4615-3618-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics