Journal of Guangdong University of Technology ›› 2020, Vol. 37 ›› Issue (04): 42-50.doi: 10.12052/gdutxb.190129

Previous Articles     Next Articles

Multi-granularity Microblog User Interest Portrait Construction Based on NWD Integrated Algorithm

Zhang Shu, Mo Zan, Liu Jian-hua, Yang Pei-chen, Liu Hong-wei   

  1. School of Management, Guangdong University of Technology, Guangzhou 510520, China
  • Received:2019-10-21 Online:2020-07-11 Published:2020-07-11

Abstract: The special features of microblog text cause difficulties in building microblog user interest portrait. To address the problem, an ensemble algorithm based on NWD-Bi-LSTM-XGBoost is proposed. Firstly, a new word discovery algorithm from the perspective of support is raised to deal with the informality of microblog text, exploring the ubiquitous internet phrases and achieving more accurate word segmentation and semantic understanding. Then, a Simhash algorithm is introduced to mitigate the information overload of microblog text. To improve the feature sparsity caused by microblog text’s conciseness, bidirectional long short-term memory networks are used to extract semantic features. Finally, the XGBoost model is trained by combining the static features of microblog users with the semantic features of the blog text for constructing the multi-granularity microblog user interest portrait efficiently. The experimental results show that the macro-average F1 score and AUC value of coarse-granularity (primary) interest tag model are up to 83.6% and 79.7% and that of fine-granularity (secondary) interest tag model are 70.4% and 63.6%, respectively. Compared with other benchmark models, the macro-average F1 score and AUC value of the models increase by 3%~5% due to ensemble of the NWD algorithm, which is superior to the existing new word discovery methods.

Key words: new word discovery, bidirectional long short-term memory, extreme Gradient Boosting, multi-granularity, microblog user interest portrait

CLC Number: 

  • TP391
[1] 新浪数据中心. 2018新浪媒体白皮书[EB/OL]. (2018-12-03)[2019-10-03]. http://data.weibo.com/report/reportDetail?id=423.
[2] QUINTANA R M, HALEY S R, LEVICK A, et al. The persona party: using personas to design for learning at scale[C]// The 2017 CHI Conference Extended Abstracts. Denver, Colorado: ACM Press, 2017: 933-941.
[3] 林穗, 郑志豪. 基于关联规则的客户行为建模与商品推荐研究[J]. 广东工业大学学报, 2018, 35(3): 90-94
LIN S, ZHENG Z H. A research of a recommender system based on customer behavior modeling by mining association rules [J]. Journal of Guangdong University of Technology, 2018, 35(3): 90-94
[4] 雷一鸣, 刘勇, 霍华. 面向网络语言基于微博语料的新词发现方法[J]. 计算机工程与设计, 2017, 38(3): 789-794
LEI Y M, LIU Y, HUO H. New word discovery based on microblog corpus for network language [J]. Computer Engineering and Design, 2017, 38(3): 789-794
[5] 刘伟童, 刘培玉, 刘文锋, 等. 基于互信息和邻接熵的新词发现算法[J]. 计算机应用研究, 2019, 36(5): 1-7
LIU W T, LIU P Y, LIU W F, et al. New word discovery algorithm based on mutual information and branch entropy [J]. Application Research of Computers, 2019, 36(5): 1-7
[6] LI W, GUO K, SHI Y, et al. DWWP: domain-specific new words detection and word propagation system for sentiment analysis in the tourism domain [J]. Knowledge-Based Systems, 2018, 146(18): 203-214
[7] 秦永彬, 孙玉洁, 魏笑. 基于文本聚类与兴趣衰减的微博用户兴趣挖掘方法[J]. 计算机应用研究, 2019, 36(5): 28-35
QIN Y B, SUN Y J, WEI X. Microblog user interest mining based on text clustering and interest decay [J]. Application Research of Computers, 2019, 36(5): 28-35
[8] 林燕霞, 谢湘生. 基于社会认同理论的微博群体用户画像[J]. 情报理论与实践, 2018, 41(3): 142-148
LIN Y X, XIE X S. User portrait of diversified groups in micro-blog based on social identity theory [J]. Information Studies: Theory & Application, 2018, 41(3): 142-148
[9] KUZMA M, GABRIELA A. Predicting user’s preferences using neural networks and psychology models [J]. Applied Intelligence, 2016, 44(3): 526-538
[10] DI NG, L H, SUN B, S HI, P. Chinese microblog topic detection through POS-based semantic expansion [J]. Information, 2018, 9(8): 2078-2489
[11] 蒋盛益, 王连喜. 聚类分析研究的挑战性问题[J]. 广东工业大学学报, 2014, 31(3): 32-38
JANG S Y, WANG L X. Some challenges in clustering analysis [J]. Journal of Guangdong University of Technology, 2014, 31(3): 32-38
[12] T U, HUANG. Mining microblog user interests based on TextRank with TF-IDF factor [J]. The Journal of China Universities of Posts and Telecommunications, 2016, 23(5): 40-46
[13] 熊才伟, 曹亚男. 基于发文内容的微博用户兴趣挖掘方法研究[J]. 计算机应用研究, 2018, 35(6): 1619-1623
XIONG C W, CAO Y N. Research of microblog user interest mining based on microblog posts [J]. Application Research of Computers, 2018, 35(6): 1619-1623
[14] MIKE S, KUIDIP K P. Bidirectional recurrent neural networks [J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681
[15] CHEN T Q, GUESTRIN C. XGBoost: a scalable tree boosting system[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2016: 785-794.
[16] MANKUS, JAIN A, SARMA D, et al. Detecting near-duplicates for web crawling[C]// Proceedings of the 16th internationalconference on world wide web. Canada: ACM Press, 2007: 141-150.
[17] 赵勤鲁, 蔡晓东, 李波, 等. 基于LSTM-Attention神经网络的文本特征提取方法[J]. 现代电子技术, 2018, 41(8): 167-170
ZHAO Q L, CAI X D, LI B, et al. Text feature extraction method based on LSTM-Attention neural network [J]. Modern Electronics Technique, 2018, 41(8): 167-170
[18] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780
[19] FRIEDMAN J H. Greedy function approximation: a gradient boosting machine [J]. The Annals of Statistics, 2001, 29(5): 1189-1232
[20] AGRAWAL R, SRIKANT R. Fast algorithms for mining association rules[M]. Readings in database systems (3rd ed). San Francisco: Morgan Kaufmann Publishers Inc, 1996: 580-592.
[21] SUN X, WANG H F, LI W J. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection[C]// Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea: ACL, 2012: 253-262.
[22] 付鹏, 林政, 袁凤程, 等. 基于卷积神经网络和用户信息的微博话题追踪模型[J]. 模式识别与人工智能, 2017, 30(1): 73-80
FU P, LIN Z, YUAN F C. Convolutional neural network and user information based model for microblog topic tracking [J]. Pattern Recognition and Artificial Intelligence, 2017, 30(1): 73-80
[23] JAMES B, YOSHUA B. Random search for hyper-parameter optimization [J]. Journal of Machine Learning Research, 2012, 13(1): 281-305
[24] 孙立远, 周亚东, 管晓宏. 利用信息传播特性的中文网络新词发现方法[J]. 西安交通大学学报, 2015, 49(12): 59-64
SUN L Y, ZHOU Y D, GUAN X H. A method of discovering new Chinese words from internet based on information propagation [J]. Journal of XI’AN Jiaotong University, 2015, 49(12): 59-64
[25] DZIKOVSKA M, NIELSEN D, LEACORK C. The joint student response analysis and recognizing textual entailment challenge: making sense of student responses in educational applications [J]. Language Resources and Evaluation, 2016, 50(1): 67-93
[1] Xie Guo-bo, Lin Li, Lin Zhi-yi, He Di-xuan, Wen Gang. An Insulator Burst Defect Detection Method Based on YOLOv4-MP [J]. Journal of Guangdong University of Technology, 2023, 40(02): 15-21.
[2] Chen Jing-yu, Lyu Yi. Frost Detection Method of Cold Chain Refrigerating Machine Based on Spiking Neural Network [J]. Journal of Guangdong University of Technology, 2023, 40(01): 29-38.
[3] Ye Wen-quan, Li Si, Ling Jie. Sparse-view SPECT Image Reconstruction Based on Multilevel-residual U-Net [J]. Journal of Guangdong University of Technology, 2023, 40(01): 61-67.
[4] Zou Heng, Gao Jun-li, Zhang Shu-wen, Song Hai-tao. Design and Implementation of a Dropping Guidance Device for Go Robot [J]. Journal of Guangdong University of Technology, 2023, 40(01): 77-82,91.
[5] Xie Guang-qiang, Xu Hao-ran, Li Yang, Chen Guang-fu. Consensus Opinion Enhancement in Social Network with Multi-agent Reinforcement Learning [J]. Journal of Guangdong University of Technology, 2022, 39(06): 36-43.
[6] Liu Xin-hong, Su Cheng-yue, Chen Jing, Xu Sheng, Luo Wen-jun, Li Yi-hong, Liu Ba. Real Time Detection of High Resolution Bridge Crack Image [J]. Journal of Guangdong University of Technology, 2022, 39(06): 73-79.
[7] Xiong Wu, Liu Yi. Application of Particle Filter Algorithm in Static Deformation Monitoring of BDS High-Speed Rail [J]. Journal of Guangdong University of Technology, 2022, 39(04): 66-72.
[8] Yi Min-qi, Liu Hong-wei, Gao Hong-ming. Research on the Factors Influencing the Co-purchase Network of Products on E-commerce Platforms [J]. Journal of Guangdong University of Technology, 2022, 39(03): 16-24.
[9] Qiu Zhan-chun, Fei Lun-ke, Teng Shao-hua, Zhang Wei. Palmprint Recognition Based on Cosine Similarity [J]. Journal of Guangdong University of Technology, 2022, 39(03): 55-62.
[10] Zheng Jia-bi, Yang Zhen-guo, Liu Wen-yin. Marketing-Effect Estimation Based on Fine-grained Confounder Balancing [J]. Journal of Guangdong University of Technology, 2022, 39(02): 55-61.
[11] Gary Yen, Li Bo, Xie Sheng-li. An Evolutionary Optimization of LSTM for Model Recovery of Geophysical Fluid Dynamics [J]. Journal of Guangdong University of Technology, 2021, 38(06): 1-8.
[12] Li Guang-cheng, Zhao Qing-lin, Xie Kan. A Design of Decentralized Data Processing Scheme [J]. Journal of Guangdong University of Technology, 2021, 38(06): 77-83.
[13] Xie Guang-qiang, Zhao Jun-wei, Li Yang, Xu Hao-ran. Cooperative Lane-changing Based on Multi-cluster System [J]. Journal of Guangdong University of Technology, 2021, 38(05): 1-9.
[14] Zhang Wei, Zhang Zhen-bin. Joint Graph Embedding and Feature Weighting for Unsupervised Feature Selection [J]. Journal of Guangdong University of Technology, 2021, 38(05): 16-23.
[15] Deng Jie-hang, Yuan Zhong-ming, Lin Hao-run, Gu Guo-sheng. Superpixel and Visual Saliency Synergetic Image Quality Assessment [J]. Journal of Guangdong University of Technology, 2021, 38(05): 33-39.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!