系统工程与电子技术 ›› 2023, Vol. 45 ›› Issue (6): 1755-1761.doi: 10.12305/j.issn.1001-506X.2023.06.19

• 系统工程 • 上一篇    

基于非对称不可观测状态的强化学习技术

李欣致1,*, 董胜波1, 崔向阳2   

  1. 1. 北京遥感设备研究所, 北京 100854
    2. 传播内容认知国家重点实验室, 北京 100733
  • 收稿日期:2021-09-03 出版日期:2023-05-25 发布日期:2023-06-01
  • 通讯作者: 李欣致
  • 作者简介:李欣致(1990—), 女, 工程师, 博士, 主要研究方向为控制科学与工程
    董胜波(1960—), 男, 研究员, 博士, 主要研究方向为控制科学与工程
    崔向阳(1987—), 男, 工程师, 硕士, 主要研究方向为智能信息处理

Reinforcement learning technology based on asymmetric unobservable state

Xinzhi LI1,*, Shengbo DONG1, Xiangyang CUI2   

  1. 1. Beijing Institute of Remote Sensing Equipment, Beijing 100854, China
    2. State Key Laboratory of Communication Content Cognition, Beijing 100733, China
  • Received:2021-09-03 Online:2023-05-25 Published:2023-06-01
  • Contact: Xinzhi LI

摘要:

真实动态博弈场景下对抗双方存在信息不对等、工作机理和规则不相同等特征, 但现有的强化学习算法通过假设状态可观测或部分可观测来采用近似模型拟合。因此, 在难以准确获取或者无法获取对方状态信息时, 假设条件难以成立, 导致现有强化学习模型无法直接适用。针对这个问题, 提出一种基于非对称不可观测强化学习新框架, 在该框架下, 智能体仅根据价值反馈即可实现在线学习。为验证可行性和通用性, 将3种典型强化学习算法移植到该算法框架, 搭建了博弈对抗模型, 进行对比验证。结果表明, 3种算法都可成功应用于不可观测状态的动态博弈环境, 且收敛速度大幅提高, 证明了该框架的可行性和通用性。

关键词: 强化学习, 动态博弈, 非对称不可观测状态

Abstract:

In real dynamic game scenarios, there are characteristics such as unequal information, various working mechanisms, and different rules between adversaries. However, the existing reinforcement learning algorithms use approximate model fitting by assuming that the state is fully observable or partially observable. Therefore, it is hard to establish assumptions when it is hard to accurately obtain or unable to obtain the status information of the other party, which result in existing reinforcement learning models that cannot be directly applied. To solve this problem, a new framework based on asymmetric unobservable reinforcement learning is proposed. Under this framework, agents can achieve online learning only based on value feedback. In order to verify the feasibility and versatility of the proposed framework, three typical reinforcement learning algorithms are transplanted into the proposed algorithm framework, and a game confrontation model is built for comparative verification. The results show that the three algorithms can be successfully applied to dynamic game environments with unobservable states, and the convergence speed is greatly improved, which proves the feasibility and versatility of the proposed framework.

Key words: reinforcement learning, dynamic game, asymmetric unobservable state

中图分类号: