广东工业大学学报 ›› 2023, Vol. 40 ›› Issue (04): 9-17,23.doi: 10.12052/gdutxb.220122

• 计算机科学与技术 • 上一篇    下一篇

改进判别式深度Dyna-Q的任务对话策略学习方法

戴彬1, 曾碧1, 魏鹏飞1, 黄永健2   

  1. 1. 广东工业大学 计算机学院, 广东 广州 510006;
    2. 广州轩辕研究院有限公司 广东 广州 510000
  • 收稿日期:2022-07-20 出版日期:2023-07-25 发布日期:2023-08-02
  • 通信作者: 魏鹏飞(1991–), 男,助理实验师,主要研究方向为任务型对话系统和强化学习,E-mail:wpf@mail2.gdut.edu.cn
  • 作者简介:戴彬(1997–), 男,硕士研究生,主要研究方向为自然语言处理、任务型对话系统和强化学习
  • 基金资助:
    国家自然科学基金联合基金资助重点项目(U21A20478) ;广东省自然科学基金资助项目(2019A1515011056) ;顺德区核心技术攻关项目(2130218003002)

A Task-oriented Dialogue Policy Learning Method of Improved Discriminative Deep Dyna-Q

Dai Bin1, Zeng Bi1, Wei Peng-fei1, Huang Yong-jian2   

  1. 1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China;
    2. Guangzhou Xuanyuan Research Institute Co., Ltd., Guangzhou 510000, China
  • Received:2022-07-20 Online:2023-07-25 Published:2023-08-02

摘要: 作为任务型对话系统中的关键一环,对话策略可以通过判别式深度Dyna-Q框架训练得到。然而,该框架在直接强化学习阶段采用原始的深度Q网络方法学习对话策略,在世界模型方面采用多层感知机作为模型的基本结构,导致对话策略的训练效率、性能和稳定性降低。本文提出了一种改进判别式深度Dyna-Q的任务对话策略学习方法。在改进后的直接强化学习阶段,利用噪声网络改进了智能体的探索方式,同时将竞争网络的双流架构、双Q网络与$ n $步自举法三者相结合,优化了$ Q $值的计算过程。在世界模型方面,设计了一种基于软注意力的模型替代多层感知机结构。实验结果表明,本文提出的方法在对话成功率、平均对话轮数以及平均奖励3个指标上均优于现有的最佳结果,最后本文通过消融分析和鲁棒性分析,进一步验证了方法的有效性。

关键词: 任务型对话系统, 对话策略学习, 强化学习, 用户模拟器

Abstract: As a pivotal part of the task-oriented dialogue system, dialogue policy can be trained by using the discriminative deep Dyna-Q framework. However, the framework uses vanilla deep Q-network method in the direct reinforcement learning phase and adopts MLPs as the basic network of world model, which limits the efficiency and stability of the dialogue policy learning. In this paper, we purpose an improved discriminative deep Dyna-Q method for task-oriented dialogue policy learning. In the improved direct RL phase, we first employ a NoisyNet to improve the exploration method, and then combine the dual-stream architecture of Dueling Network, Double-Q Network and n-step bootstrapping to optimize the calculation of the Q values. Moreover, we design a soft-attention-based model to replace the MLPs in the world model. The experimental results show that our proposed method achieves better results than other baseline models in terms of task success rate, average dialog turns and average reward. We further validate the effectiveness of proposed method by conducting both ablation and robustness analysis.

Key words: task-oriented dialogue system, dialogue policy learning, reinforcement learning, user simulator

中图分类号: 

  • TP391
[1] ZHANG Z, TAKANOBU R, ZHU Q, et al. Recent advances and challenges in task-oriented dialog systems[J]. Science China Technological Sciences, 2020, 63(10): 2011-2027.
[2] LEVIN E, PIERACCINI R, ECKERT W. Learning dialogue strategies within the markov decision process framework[C]//1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings. Santa Barbara: IEEE, 1997: 72-79.
[3] YOUNG S, GAŠIĆ M, THOMSON B, et al. POMDP-based statistical spoken dialog systems: a review[J]. Proceedings of the IEEE, 2013, 101(5): 1160-1179.
[4] 万里鹏, 兰旭光, 张翰博, 等. 深度强化学习理论及其应用综述[J]. 模式识别与人工智能, 2019(1): 67-81.WAN L P, LAN X G, ZHANG H B, et al. A review of deep reinforcement learning theory and application[J]. Pattern Recognition and Artificial Intelligence, 2019(1): 67-81.
[5] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[6] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]// Proceedings of the 35th International Conference on Machine Learning. Stockholm: PMLR, 2018: 1861-1870.
[7] SCHATZMANN J, THOMSON B, WEILHAMMER K, et al. Agenda-based user simulation for bootstrapping a POMDP dialogue system[C]// Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics. Rochester: Association for Computational Linguistics, 2007: 149-152.
[8] LI X, LIPTON Z C, DHINGRA B, et al. A user simulator for task-completion dialogues[EB/OL]. arxiv: 1612. 05688(2017-11-13)[2022-03-28].https://arxiv.org/abs/1612.05688.
[9] PENG B, LI X, GAO J, et al. Deep Dyna-Q: integrating planning for task-completion dialogue policy learning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: Association for Computational Linguistics, 2018: 2182-2192.
[10] SU S Y, LI X, GAO J, et al. Discriminative deep Dyna-Q: robust planning for dialogue policy learning[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 2018: 3813-3823.
[11] DHINGRA B, LI L, LI X, et al. Towards end-to-end reinforcement learning of dialogue agents for information access[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver: Association for Computational Linguistics, 2017: 484-495.
[12] SUTTON R S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming[M]//Machine Learning Proceedings 1990. Burlington: Morgan Kaufmann, 1990: 216-224.
[13] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[14] HAKKANI-TÜR D, TÜR G, CELIKYILMAZ A, et al. Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM[C]//Interspeech 2016. San Francisco: ISCA, 2016: 715-719.
[15] MRKŠIĆ N, SÉAGHDHA D Ó, WEN T H, et al. Neural belief tracker: data-driven dialogue state tracking[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver: Association for Computational Linguistics, 2017: 1777-1788.
[16] WEN T H, GASIC M, MRKŠIĆ N, et al. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics, 2015: 1711-1721.
[17] FORTUNATO M, AZAR M G, PIOT B, et al. Noisy networks for exploration[C]//International Conference on Learning Representations. Vancouver: ICLR, 2018: 1-21.
[18] WANG Z, SCHAUL T, HESSEL M, et al. Dueling network architectures for deep reinforcement learning[C]// Proceedings of The 33rd International Conference on Machine Learning. New York: PMLR, 2016: 1995-2003.
[19] VAN H H, GUEZ A, SILVER D. Deep reinforcement learning with double Q-learning[C]//Proceedings of the 30 th AAAI Conference on Artificial Intelligence. New York: AAAI, 2016, 2094-2100.
[20] SUTTON R S. Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988, 3(1): 9-44.
[21] ZHANG Y, YU X, CUI Z, et al. Every document owns its structure: inductive text classification via graph neural networks[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020: 334-339.
[22] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[23] LI X, CHEN Y N, LI L, et al. End-to-end task-completion neural dialogue systems[C]//Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Taipei: IJCNLP, 2017: 733-743.
[1] 苏天赐, 何梓楠, 崔苗, 张广驰. 多无人机辅助数据收集系统的智能路径规划算法[J]. 广东工业大学学报, 2023, 40(04): 77-84.
[2] 何一汕, 王永华, 万频, 王磊, 伍文韬. 面向多用户动态频谱接入的改进双深度Q网络方法研究[J]. 广东工业大学学报, 2023, 40(04): 85-93.
[3] 陈辞, 谢立华. 具有指定收敛速度的离散系统鲁棒跟踪数据驱动设计[J]. 广东工业大学学报, 2021, 38(06): 29-34.
[4] 李明磊, 章阳, 康嘉文, 徐敏锐, Dusit Niyato. 基于多智能体强化学习的区块链赋能车联网中的安全数据共享[J]. 广东工业大学学报, 2021, 38(06): 62-69.
[5] 郭心德, 丁宏强. 离散制造智能工厂场景的AGV路径规划方法[J]. 广东工业大学学报, 2021, 38(06): 70-76.
[6] 郑思远, 崔苗, 张广驰. 基于强化学习的无人机安全通信轨迹在线优化策略[J]. 广东工业大学学报, 2021, 38(04): 59-64.
[7] 叶伟杰, 高军礼, 蒋丰, 郭靖. 一种提升机器人强化学习开发效率的训练模式研究[J]. 广东工业大学学报, 2020, 37(05): 46-50.
[8] 吴运雄, 曾碧. 基于深度强化学习的移动机器人轨迹跟踪和动态避障[J]. 广东工业大学学报, 2019, 36(01): 42-50.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!