2023 journal article

Blind Post-Decision State-Based Reinforcement Learning for Intelligent IoT

IEEE INTERNET OF THINGS JOURNAL, 10(12), 10605–10620.

By: J. Zhang*, X. He* & H. Dai n

author keywords: Learning speed; post-decision state (PDS); reinforcement learning (RL); two-timescale stochastic approximation
TL;DR: A novel blind PDS (b-PDS) learning algorithm is proposed in this work by leveraging the generic two-timescale stochastic approximation framework and can achieve a similar improvement of learning speed as conventional PDS learning while excluding the need for prior information. (via Semantic Scholar)
UN Sustainable Development Goal Categories
16. Peace, Justice and Strong Institutions (OpenAlex)
Source: Web Of Science
Added: July 3, 2023

Recent years have witnessed a renewed interest in reinforcement learning (RL) due to the rapid growth of the Internet of Things (IoT) and their associated intelligent information processing and decision-making demands. As the slow learning speed is one of the major stumbling blocks of the classic RL algorithms, substantial efforts have been devoted to developing faster RL algorithms. Among them, post-decision state (PDS) learning is a prominent one, which can often improve the learning speed by orders of magnitude by exploiting the structural property of the underlying Markov decision processes (MDPs). However, conventional PDS learning requires prior information about the PDS transition probability, which may not be always available in practice. To lift this limitation, a novel blind PDS (b-PDS) learning algorithm is proposed in this work by leveraging the generic two-timescale stochastic approximation framework. By introducing an extra estimating procedure about the PDS transition probability, b-PDS learning can achieve a similar improvement of learning speed as conventional PDS learning while excluding the need for prior information. In addition, by analyzing the globally asymptotically stable equilibrium of the corresponding ordinary differential equation (o.d.e.), the convergence and optimality of b-PDS learning are established. Moreover, extensive simulation results are provided to validate the effectiveness of the proposed algorithm. Over the considered random MDPs, it has been observed that, to reach 90% of the best possible time average reward, the proposed b-PDS learning can reduce the learning time by 70% compared to $Q$ -learning and 30% compared to Dyna.