Journal of Guangdong University of Technology ›› 2023, Vol. 40 ›› Issue (04): 9-17,23.doi: 10.12052/gdutxb.220122

• Computer Science and Technology • Previous Articles     Next Articles

A Task-oriented Dialogue Policy Learning Method of Improved Discriminative Deep Dyna-Q

Dai Bin1, Zeng Bi1, Wei Peng-fei1, Huang Yong-jian2   

  1. 1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China;
    2. Guangzhou Xuanyuan Research Institute Co., Ltd., Guangzhou 510000, China
  • Received:2022-07-20 Online:2023-07-25 Published:2023-08-02

Abstract: As a pivotal part of the task-oriented dialogue system, dialogue policy can be trained by using the discriminative deep Dyna-Q framework. However, the framework uses vanilla deep Q-network method in the direct reinforcement learning phase and adopts MLPs as the basic network of world model, which limits the efficiency and stability of the dialogue policy learning. In this paper, we purpose an improved discriminative deep Dyna-Q method for task-oriented dialogue policy learning. In the improved direct RL phase, we first employ a NoisyNet to improve the exploration method, and then combine the dual-stream architecture of Dueling Network, Double-Q Network and n-step bootstrapping to optimize the calculation of the Q values. Moreover, we design a soft-attention-based model to replace the MLPs in the world model. The experimental results show that our proposed method achieves better results than other baseline models in terms of task success rate, average dialog turns and average reward. We further validate the effectiveness of proposed method by conducting both ablation and robustness analysis.

Key words: task-oriented dialogue system, dialogue policy learning, reinforcement learning, user simulator

CLC Number: 

  • TP391
[1] ZHANG Z, TAKANOBU R, ZHU Q, et al. Recent advances and challenges in task-oriented dialog systems[J]. Science China Technological Sciences, 2020, 63(10): 2011-2027.
[2] LEVIN E, PIERACCINI R, ECKERT W. Learning dialogue strategies within the markov decision process framework[C]//1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings. Santa Barbara: IEEE, 1997: 72-79.
[3] YOUNG S, GAŠIĆ M, THOMSON B, et al. POMDP-based statistical spoken dialog systems: a review[J]. Proceedings of the IEEE, 2013, 101(5): 1160-1179.
[4] 万里鹏, 兰旭光, 张翰博, 等. 深度强化学习理论及其应用综述[J]. 模式识别与人工智能, 2019(1): 67-81.WAN L P, LAN X G, ZHANG H B, et al. A review of deep reinforcement learning theory and application[J]. Pattern Recognition and Artificial Intelligence, 2019(1): 67-81.
[5] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[6] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]// Proceedings of the 35th International Conference on Machine Learning. Stockholm: PMLR, 2018: 1861-1870.
[7] SCHATZMANN J, THOMSON B, WEILHAMMER K, et al. Agenda-based user simulation for bootstrapping a POMDP dialogue system[C]// Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics. Rochester: Association for Computational Linguistics, 2007: 149-152.
[8] LI X, LIPTON Z C, DHINGRA B, et al. A user simulator for task-completion dialogues[EB/OL]. arxiv: 1612. 05688(2017-11-13)[2022-03-28].https://arxiv.org/abs/1612.05688.
[9] PENG B, LI X, GAO J, et al. Deep Dyna-Q: integrating planning for task-completion dialogue policy learning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: Association for Computational Linguistics, 2018: 2182-2192.
[10] SU S Y, LI X, GAO J, et al. Discriminative deep Dyna-Q: robust planning for dialogue policy learning[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 2018: 3813-3823.
[11] DHINGRA B, LI L, LI X, et al. Towards end-to-end reinforcement learning of dialogue agents for information access[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver: Association for Computational Linguistics, 2017: 484-495.
[12] SUTTON R S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming[M]//Machine Learning Proceedings 1990. Burlington: Morgan Kaufmann, 1990: 216-224.
[13] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[14] HAKKANI-TÜR D, TÜR G, CELIKYILMAZ A, et al. Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM[C]//Interspeech 2016. San Francisco: ISCA, 2016: 715-719.
[15] MRKŠIĆ N, SÉAGHDHA D Ó, WEN T H, et al. Neural belief tracker: data-driven dialogue state tracking[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver: Association for Computational Linguistics, 2017: 1777-1788.
[16] WEN T H, GASIC M, MRKŠIĆ N, et al. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics, 2015: 1711-1721.
[17] FORTUNATO M, AZAR M G, PIOT B, et al. Noisy networks for exploration[C]//International Conference on Learning Representations. Vancouver: ICLR, 2018: 1-21.
[18] WANG Z, SCHAUL T, HESSEL M, et al. Dueling network architectures for deep reinforcement learning[C]// Proceedings of The 33rd International Conference on Machine Learning. New York: PMLR, 2016: 1995-2003.
[19] VAN H H, GUEZ A, SILVER D. Deep reinforcement learning with double Q-learning[C]//Proceedings of the 30 th AAAI Conference on Artificial Intelligence. New York: AAAI, 2016, 2094-2100.
[20] SUTTON R S. Learning to predict by the methods of temporal differences[J]. Machine Learning, 1988, 3(1): 9-44.
[21] ZHANG Y, YU X, CUI Z, et al. Every document owns its structure: inductive text classification via graph neural networks[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020: 334-339.
[22] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[23] LI X, CHEN Y N, LI L, et al. End-to-end task-completion neural dialogue systems[C]//Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Taipei: IJCNLP, 2017: 733-743.
[1] Su Tian-ci, He Zi-nan, Cui Miao, Zhang Guang-chi. Intelligent Path Planning Algorithm for Multi-UAV-assisted Data Collection Systems [J]. Journal of Guangdong University of Technology, 2023, 40(04): 77-84.
[2] He Yi-shan, Wang Yong-hua, Wan Pin, Wang Lei, Wu Wen-tao. An Improved Double Deep Q Network for Multi-user Dynamic Spectrum Access [J]. Journal of Guangdong University of Technology, 2023, 40(04): 85-93.
[3] Chen Ci, Xie Li-hua. A Data-Driven Prescribed Convergence Rate Design for Robust Tracking of Discrete-Time Systems [J]. Journal of Guangdong University of Technology, 2021, 38(06): 29-34.
[4] Li Ming-lei, Zhang Yang, Kang Jia-wen, Xu Min-rui, Dusit Niyato. Multi-Agent Reinforcement Learning for Secure Data Sharing in Blockchain-Empowered Vehicular Networks [J]. Journal of Guangdong University of Technology, 2021, 38(06): 62-69.
[5] Guo Xin-de, Chris Hong-qiang Ding. An AGV Path Planning Method for Discrete Manufacturing Smart Factory [J]. Journal of Guangdong University of Technology, 2021, 38(06): 70-76.
[6] Zheng Si-yuan, Cui Miao, Zhang Guang-chi. Reinforcement Learning-Based Online Trajectory Optimization for Secure UAV Communications [J]. Journal of Guangdong University of Technology, 2021, 38(04): 59-64.
[7] Ye Wei-jie, Gao Jun-li, Jiang Feng, Guo Jing. A Research on a Training Model to Improve the Development Efficiency of Robot Reinforcement Learning [J]. Journal of Guangdong University of Technology, 2020, 37(05): 46-50.
[8] Wu Yun-xiong, Zeng Bi. Trajectory Tracking and Dynamic Obstacle Avoidance of Mobile Robot Based on Deep Reinforcement Learning [J]. Journal of Guangdong University of Technology, 2019, 36(01): 42-50.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!