广东工业大学学报 ›› 2023, Vol. 40 ›› Issue (04): 9-17,23.doi: 10.12052/gdutxb.220122

戴彬1, 曾碧1, 魏鹏飞1, 黄永健2   

  1. 1. 广东工业大学 计算机学院, 广东 广州 510006;
    2. 广州轩辕研究院有限公司 广东 广州 510000
  • 收稿日期:2022-07-20 出版日期:2023-07-25 发布日期:2023-08-02
  • 通信作者: 魏鹏飞(1991–), 男,助理实验师,主要研究方向为任务型对话系统和强化学习,
  • 作者简介:戴彬(1997–), 男,硕士研究生,主要研究方向为自然语言处理、任务型对话系统和强化学习
  • 基金资助:
    国家自然科学基金联合基金资助重点项目(U21A20478) ;广东省自然科学基金资助项目(2019A1515011056) ;顺德区核心技术攻关项目(2130218003002)

A Task-oriented Dialogue Policy Learning Method of Improved Discriminative Deep Dyna-Q

Dai Bin1, Zeng Bi1, Wei Peng-fei1, Huang Yong-jian2   

  1. 1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China;
    2. Guangzhou Xuanyuan Research Institute Co., Ltd., Guangzhou 510000, China
  • Received:2022-07-20 Online:2023-07-25 Published:2023-08-02

摘要: 作为任务型对话系统中的关键一环,对话策略可以通过判别式深度Dyna-Q框架训练得到。然而,该框架在直接强化学习阶段采用原始的深度Q网络方法学习对话策略,在世界模型方面采用多层感知机作为模型的基本结构,导致对话策略的训练效率、性能和稳定性降低。本文提出了一种改进判别式深度Dyna-Q的任务对话策略学习方法。在改进后的直接强化学习阶段,利用噪声网络改进了智能体的探索方式,同时将竞争网络的双流架构、双Q网络与$ n $步自举法三者相结合,优化了$ Q $值的计算过程。在世界模型方面,设计了一种基于软注意力的模型替代多层感知机结构。实验结果表明,本文提出的方法在对话成功率、平均对话轮数以及平均奖励3个指标上均优于现有的最佳结果,最后本文通过消融分析和鲁棒性分析,进一步验证了方法的有效性。

关键词: 任务型对话系统, 对话策略学习, 强化学习, 用户模拟器

Abstract: As a pivotal part of the task-oriented dialogue system, dialogue policy can be trained by using the discriminative deep Dyna-Q framework. However, the framework uses vanilla deep Q-network method in the direct reinforcement learning phase and adopts MLPs as the basic network of world model, which limits the efficiency and stability of the dialogue policy learning. In this paper, we purpose an improved discriminative deep Dyna-Q method for task-oriented dialogue policy learning. In the improved direct RL phase, we first employ a NoisyNet to improve the exploration method, and then combine the dual-stream architecture of Dueling Network, Double-Q Network and n-step bootstrapping to optimize the calculation of the Q values. Moreover, we design a soft-attention-based model to replace the MLPs in the world model. The experimental results show that our proposed method achieves better results than other baseline models in terms of task success rate, average dialog turns and average reward. We further validate the effectiveness of proposed method by conducting both ablation and robustness analysis.

Key words: task-oriented dialogue system, dialogue policy learning, reinforcement learning, user simulator


  • TP391
