广东工业大学学报 ›› 2020, Vol. 37 ›› Issue (04): 42-50.doi: 10.12052/gdutxb.190129

• • 上一篇    下一篇

基于NWD集成算法的多粒度微博用户兴趣画像构建

张舒, 莫赞, 柳建华, 杨培琛, 刘洪伟   

  1. 广东工业大学 管理学院,广东 广州 510520
  • 收稿日期:2019-10-21 出版日期:2020-07-11 发布日期:2020-07-11
  • 通信作者: 莫赞(1962-),男,教授,博士生导师,主要研究方向为电子商务、数据挖掘等,E-mail:mozan@126.com E-mail:mozan@126.com
  • 作者简介:张舒(1993-),男,硕士研究生,主要研究方向为机器学习、数据挖掘等
  • 基金资助:
    国家自然科学基金资助项目(71671048)

Multi-granularity Microblog User Interest Portrait Construction Based on NWD Integrated Algorithm

Zhang Shu, Mo Zan, Liu Jian-hua, Yang Pei-chen, Liu Hong-wei   

  1. School of Management, Guangdong University of Technology, Guangzhou 510520, China
  • Received:2019-10-21 Online:2020-07-11 Published:2020-07-11

摘要: 微博文本特殊性的存在使得微博用户兴趣画像难以有效构建。为此, 提出了一种集成算法——新词发现-双向长短期记忆网络-梯度提升算法。首先针对微博文本的非正式性, 提出了一种基于支持度视角的新词发现(New Word Discovery, NWD)算法, 发掘其中大量存在的网络用语以实现更加准确的分词及语义把握; 其次, 引入Simhash算法使得微博文本中的“信息过载”现象得到改观; 再次, 为改善微博文本的简洁性而引起的特征稀疏问题, 采用双向长短期记忆网络(Bidirectional Long Short-term Memory,Bi-LSTM)模型提取博文语义特征; 最后, 通过融合微博用户静态特征训练梯度提升(extreme Gradient Boosting,XGBoost)模型, 从而有效构建多粒度微博用户兴趣画像。实验结果表明, 粗粒度(一级)兴趣标签模型NWD-Bi-LSTM和细粒度(二级)兴趣标签模型NWD-Bi-LSTM-XGBoost的宏平均F1值(Macro-average F1 score, mF1)和受试者工作特征曲线下面积(Area Under ROC Crave, AUC)分别高达83.6%, 79.7%和70.4%, 63.6%, 相对于基准模型, NWD算法的集成使得模型的mF1值和AUC值均能提升3%~5%, 其促进作用优于现有的新词发现方法。

关键词: 新词发现, 双向长短期记忆网络, XGBoost梯度提升, 多粒度, 微博用户兴趣画像

Abstract: The special features of microblog text cause difficulties in building microblog user interest portrait. To address the problem, an ensemble algorithm based on NWD-Bi-LSTM-XGBoost is proposed. Firstly, a new word discovery algorithm from the perspective of support is raised to deal with the informality of microblog text, exploring the ubiquitous internet phrases and achieving more accurate word segmentation and semantic understanding. Then, a Simhash algorithm is introduced to mitigate the information overload of microblog text. To improve the feature sparsity caused by microblog text’s conciseness, bidirectional long short-term memory networks are used to extract semantic features. Finally, the XGBoost model is trained by combining the static features of microblog users with the semantic features of the blog text for constructing the multi-granularity microblog user interest portrait efficiently. The experimental results show that the macro-average F1 score and AUC value of coarse-granularity (primary) interest tag model are up to 83.6% and 79.7% and that of fine-granularity (secondary) interest tag model are 70.4% and 63.6%, respectively. Compared with other benchmark models, the macro-average F1 score and AUC value of the models increase by 3%~5% due to ensemble of the NWD algorithm, which is superior to the existing new word discovery methods.

Key words: new word discovery, bidirectional long short-term memory, extreme Gradient Boosting, multi-granularity, microblog user interest portrait

中图分类号: 

  • TP391
[1] 新浪数据中心. 2018新浪媒体白皮书[EB/OL]. (2018-12-03)[2019-10-03]. http://data.weibo.com/report/reportDetail?id=423.
[2] QUINTANA R M, HALEY S R, LEVICK A, et al. The persona party: using personas to design for learning at scale[C]// The 2017 CHI Conference Extended Abstracts. Denver, Colorado: ACM Press, 2017: 933-941.
[3] 林穗, 郑志豪. 基于关联规则的客户行为建模与商品推荐研究[J]. 广东工业大学学报, 2018, 35(3): 90-94
LIN S, ZHENG Z H. A research of a recommender system based on customer behavior modeling by mining association rules [J]. Journal of Guangdong University of Technology, 2018, 35(3): 90-94
[4] 雷一鸣, 刘勇, 霍华. 面向网络语言基于微博语料的新词发现方法[J]. 计算机工程与设计, 2017, 38(3): 789-794
LEI Y M, LIU Y, HUO H. New word discovery based on microblog corpus for network language [J]. Computer Engineering and Design, 2017, 38(3): 789-794
[5] 刘伟童, 刘培玉, 刘文锋, 等. 基于互信息和邻接熵的新词发现算法[J]. 计算机应用研究, 2019, 36(5): 1-7
LIU W T, LIU P Y, LIU W F, et al. New word discovery algorithm based on mutual information and branch entropy [J]. Application Research of Computers, 2019, 36(5): 1-7
[6] LI W, GUO K, SHI Y, et al. DWWP: domain-specific new words detection and word propagation system for sentiment analysis in the tourism domain [J]. Knowledge-Based Systems, 2018, 146(18): 203-214
[7] 秦永彬, 孙玉洁, 魏笑. 基于文本聚类与兴趣衰减的微博用户兴趣挖掘方法[J]. 计算机应用研究, 2019, 36(5): 28-35
QIN Y B, SUN Y J, WEI X. Microblog user interest mining based on text clustering and interest decay [J]. Application Research of Computers, 2019, 36(5): 28-35
[8] 林燕霞, 谢湘生. 基于社会认同理论的微博群体用户画像[J]. 情报理论与实践, 2018, 41(3): 142-148
LIN Y X, XIE X S. User portrait of diversified groups in micro-blog based on social identity theory [J]. Information Studies: Theory & Application, 2018, 41(3): 142-148
[9] KUZMA M, GABRIELA A. Predicting user’s preferences using neural networks and psychology models [J]. Applied Intelligence, 2016, 44(3): 526-538
[10] DI NG, L H, SUN B, S HI, P. Chinese microblog topic detection through POS-based semantic expansion [J]. Information, 2018, 9(8): 2078-2489
[11] 蒋盛益, 王连喜. 聚类分析研究的挑战性问题[J]. 广东工业大学学报, 2014, 31(3): 32-38
JANG S Y, WANG L X. Some challenges in clustering analysis [J]. Journal of Guangdong University of Technology, 2014, 31(3): 32-38
[12] T U, HUANG. Mining microblog user interests based on TextRank with TF-IDF factor [J]. The Journal of China Universities of Posts and Telecommunications, 2016, 23(5): 40-46
[13] 熊才伟, 曹亚男. 基于发文内容的微博用户兴趣挖掘方法研究[J]. 计算机应用研究, 2018, 35(6): 1619-1623
XIONG C W, CAO Y N. Research of microblog user interest mining based on microblog posts [J]. Application Research of Computers, 2018, 35(6): 1619-1623
[14] MIKE S, KUIDIP K P. Bidirectional recurrent neural networks [J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681
[15] CHEN T Q, GUESTRIN C. XGBoost: a scalable tree boosting system[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2016: 785-794.
[16] MANKUS, JAIN A, SARMA D, et al. Detecting near-duplicates for web crawling[C]// Proceedings of the 16th internationalconference on world wide web. Canada: ACM Press, 2007: 141-150.
[17] 赵勤鲁, 蔡晓东, 李波, 等. 基于LSTM-Attention神经网络的文本特征提取方法[J]. 现代电子技术, 2018, 41(8): 167-170
ZHAO Q L, CAI X D, LI B, et al. Text feature extraction method based on LSTM-Attention neural network [J]. Modern Electronics Technique, 2018, 41(8): 167-170
[18] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780
[19] FRIEDMAN J H. Greedy function approximation: a gradient boosting machine [J]. The Annals of Statistics, 2001, 29(5): 1189-1232
[20] AGRAWAL R, SRIKANT R. Fast algorithms for mining association rules[M]. Readings in database systems (3rd ed). San Francisco: Morgan Kaufmann Publishers Inc, 1996: 580-592.
[21] SUN X, WANG H F, LI W J. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection[C]// Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea: ACL, 2012: 253-262.
[22] 付鹏, 林政, 袁凤程, 等. 基于卷积神经网络和用户信息的微博话题追踪模型[J]. 模式识别与人工智能, 2017, 30(1): 73-80
FU P, LIN Z, YUAN F C. Convolutional neural network and user information based model for microblog topic tracking [J]. Pattern Recognition and Artificial Intelligence, 2017, 30(1): 73-80
[23] JAMES B, YOSHUA B. Random search for hyper-parameter optimization [J]. Journal of Machine Learning Research, 2012, 13(1): 281-305
[24] 孙立远, 周亚东, 管晓宏. 利用信息传播特性的中文网络新词发现方法[J]. 西安交通大学学报, 2015, 49(12): 59-64
SUN L Y, ZHOU Y D, GUAN X H. A method of discovering new Chinese words from internet based on information propagation [J]. Journal of XI’AN Jiaotong University, 2015, 49(12): 59-64
[25] DZIKOVSKA M, NIELSEN D, LEACORK C. The joint student response analysis and recognizing textual entailment challenge: making sense of student responses in educational applications [J]. Language Resources and Evaluation, 2016, 50(1): 67-93
[1] 吴家湖, 熊华, 宗睿, 赵曜, 周贤中. 基于循环神经网络的目标转弯机动类型识别[J]. 广东工业大学学报, 2020, 37(02): 67-73.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!