广东工业大学学报 ›› 2020, Vol. 37 ›› Issue (04): 42-50.doi: 10.12052/gdutxb.190129
张舒, 莫赞, 柳建华, 杨培琛, 刘洪伟
Zhang Shu, Mo Zan, Liu Jian-hua, Yang Pei-chen, Liu Hong-wei
摘要: 微博文本特殊性的存在使得微博用户兴趣画像难以有效构建。为此, 提出了一种集成算法——新词发现-双向长短期记忆网络-梯度提升算法。首先针对微博文本的非正式性, 提出了一种基于支持度视角的新词发现(New Word Discovery, NWD)算法, 发掘其中大量存在的网络用语以实现更加准确的分词及语义把握; 其次, 引入Simhash算法使得微博文本中的“信息过载”现象得到改观; 再次, 为改善微博文本的简洁性而引起的特征稀疏问题, 采用双向长短期记忆网络(Bidirectional Long Short-term Memory,Bi-LSTM)模型提取博文语义特征; 最后, 通过融合微博用户静态特征训练梯度提升(extreme Gradient Boosting,XGBoost)模型, 从而有效构建多粒度微博用户兴趣画像。实验结果表明, 粗粒度(一级)兴趣标签模型NWD-Bi-LSTM和细粒度(二级)兴趣标签模型NWD-Bi-LSTM-XGBoost的宏平均F1值(Macro-average F1 score, mF1)和受试者工作特征曲线下面积(Area Under ROC Crave, AUC)分别高达83.6%, 79.7%和70.4%, 63.6%, 相对于基准模型, NWD算法的集成使得模型的mF1值和AUC值均能提升3%~5%, 其促进作用优于现有的新词发现方法。
中图分类号:
[1] 新浪数据中心. 2018新浪媒体白皮书[EB/OL]. (2018-12-03)[2019-10-03]. http://data.weibo.com/report/reportDetail?id=423. [2] QUINTANA R M, HALEY S R, LEVICK A, et al. The persona party: using personas to design for learning at scale[C]// The 2017 CHI Conference Extended Abstracts. Denver, Colorado: ACM Press, 2017: 933-941. [3] 林穗, 郑志豪. 基于关联规则的客户行为建模与商品推荐研究[J]. 广东工业大学学报, 2018, 35(3): 90-94 LIN S, ZHENG Z H. A research of a recommender system based on customer behavior modeling by mining association rules [J]. Journal of Guangdong University of Technology, 2018, 35(3): 90-94 [4] 雷一鸣, 刘勇, 霍华. 面向网络语言基于微博语料的新词发现方法[J]. 计算机工程与设计, 2017, 38(3): 789-794 LEI Y M, LIU Y, HUO H. New word discovery based on microblog corpus for network language [J]. Computer Engineering and Design, 2017, 38(3): 789-794 [5] 刘伟童, 刘培玉, 刘文锋, 等. 基于互信息和邻接熵的新词发现算法[J]. 计算机应用研究, 2019, 36(5): 1-7 LIU W T, LIU P Y, LIU W F, et al. New word discovery algorithm based on mutual information and branch entropy [J]. Application Research of Computers, 2019, 36(5): 1-7 [6] LI W, GUO K, SHI Y, et al. DWWP: domain-specific new words detection and word propagation system for sentiment analysis in the tourism domain [J]. Knowledge-Based Systems, 2018, 146(18): 203-214 [7] 秦永彬, 孙玉洁, 魏笑. 基于文本聚类与兴趣衰减的微博用户兴趣挖掘方法[J]. 计算机应用研究, 2019, 36(5): 28-35 QIN Y B, SUN Y J, WEI X. Microblog user interest mining based on text clustering and interest decay [J]. Application Research of Computers, 2019, 36(5): 28-35 [8] 林燕霞, 谢湘生. 基于社会认同理论的微博群体用户画像[J]. 情报理论与实践, 2018, 41(3): 142-148 LIN Y X, XIE X S. User portrait of diversified groups in micro-blog based on social identity theory [J]. Information Studies: Theory & Application, 2018, 41(3): 142-148 [9] KUZMA M, GABRIELA A. Predicting user’s preferences using neural networks and psychology models [J]. Applied Intelligence, 2016, 44(3): 526-538 [10] DI NG, L H, SUN B, S HI, P. Chinese microblog topic detection through POS-based semantic expansion [J]. Information, 2018, 9(8): 2078-2489 [11] 蒋盛益, 王连喜. 聚类分析研究的挑战性问题[J]. 广东工业大学学报, 2014, 31(3): 32-38 JANG S Y, WANG L X. Some challenges in clustering analysis [J]. Journal of Guangdong University of Technology, 2014, 31(3): 32-38 [12] T U, HUANG. Mining microblog user interests based on TextRank with TF-IDF factor [J]. The Journal of China Universities of Posts and Telecommunications, 2016, 23(5): 40-46 [13] 熊才伟, 曹亚男. 基于发文内容的微博用户兴趣挖掘方法研究[J]. 计算机应用研究, 2018, 35(6): 1619-1623 XIONG C W, CAO Y N. Research of microblog user interest mining based on microblog posts [J]. Application Research of Computers, 2018, 35(6): 1619-1623 [14] MIKE S, KUIDIP K P. Bidirectional recurrent neural networks [J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681 [15] CHEN T Q, GUESTRIN C. XGBoost: a scalable tree boosting system[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2016: 785-794. [16] MANKUS, JAIN A, SARMA D, et al. Detecting near-duplicates for web crawling[C]// Proceedings of the 16th internationalconference on world wide web. Canada: ACM Press, 2007: 141-150. [17] 赵勤鲁, 蔡晓东, 李波, 等. 基于LSTM-Attention神经网络的文本特征提取方法[J]. 现代电子技术, 2018, 41(8): 167-170 ZHAO Q L, CAI X D, LI B, et al. Text feature extraction method based on LSTM-Attention neural network [J]. Modern Electronics Technique, 2018, 41(8): 167-170 [18] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780 [19] FRIEDMAN J H. Greedy function approximation: a gradient boosting machine [J]. The Annals of Statistics, 2001, 29(5): 1189-1232 [20] AGRAWAL R, SRIKANT R. Fast algorithms for mining association rules[M]. Readings in database systems (3rd ed). San Francisco: Morgan Kaufmann Publishers Inc, 1996: 580-592. [21] SUN X, WANG H F, LI W J. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection[C]// Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea: ACL, 2012: 253-262. [22] 付鹏, 林政, 袁凤程, 等. 基于卷积神经网络和用户信息的微博话题追踪模型[J]. 模式识别与人工智能, 2017, 30(1): 73-80 FU P, LIN Z, YUAN F C. Convolutional neural network and user information based model for microblog topic tracking [J]. Pattern Recognition and Artificial Intelligence, 2017, 30(1): 73-80 [23] JAMES B, YOSHUA B. Random search for hyper-parameter optimization [J]. Journal of Machine Learning Research, 2012, 13(1): 281-305 [24] 孙立远, 周亚东, 管晓宏. 利用信息传播特性的中文网络新词发现方法[J]. 西安交通大学学报, 2015, 49(12): 59-64 SUN L Y, ZHOU Y D, GUAN X H. A method of discovering new Chinese words from internet based on information propagation [J]. Journal of XI’AN Jiaotong University, 2015, 49(12): 59-64 [25] DZIKOVSKA M, NIELSEN D, LEACORK C. The joint student response analysis and recognizing textual entailment challenge: making sense of student responses in educational applications [J]. Language Resources and Evaluation, 2016, 50(1): 67-93 |
[1] | 吴家湖, 熊华, 宗睿, 赵曜, 周贤中. 基于循环神经网络的目标转弯机动类型识别[J]. 广东工业大学学报, 2020, 37(02): 67-73. |
|