广东工业大学学报 ›› 2020, Vol. 37 ›› Issue (05): 51-61.doi: 10.12052/gdutxb.200019

• 综合研究 • 上一篇    下一篇

短文本特征的组合加权方法

谭有新, 滕少华   

  1. 广东工业大学 计算机学院,广东 广州 510006
  • 收稿日期:2020-02-02 出版日期:2020-09-17 发布日期:2020-09-17
  • 通信作者: 滕少华(1962-),男,教授,博士,主要研究方向为大数据、角色分配、网络安全、协同计算、机器学习,E-mail:Shteng@gdut.edu.cn E-mail:Shteng@gdut.edu.cn
  • 作者简介:谭有新(1995-),男,硕士研究生,主要研究方向为机器学习、自然语言处理
  • 基金资助:
    国家自然科学基金资助项目(61972102);广东省科技计划项目(2016B010108007,2019B110210002,2019B020208001);广东省教育厅项目(粤教高函〔2018〕179号);广州市科技计划项目(201802010042,201802030011,201802010026,201903010107)

Combined Weighting Method for Short Text Features

Tan You-xin, Teng Shao-hua   

  1. School of Computers, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2020-02-02 Online:2020-09-17 Published:2020-09-17

摘要: 文本情感分析是自然语言处理的典型任务,但是现有情感分析正确率不高,其中词的特征化是一个重要原因。本文提出了一种短文本特征的组合加权方法(a Combined Weighting method for Short Text Features,CWSTF),可以有效提高情感分析正确率。CWSTF方法以随机森林为基础评估特征对于情感的贡献度并排序,进而依排序来进行特征选择。然后考虑特征在文档中的重要性TF-IDF (Term Frequency-Inverse Document Frequency),以特征在文档中的重要性和情感贡献度确定该特征的权重。最后,用支持向量SVM (Support Vector Machine)、朴素贝叶斯NB (Naive Bayes)、最大熵ME (Maximum Entropy)、K最近邻KNN (K-NearestNeighbor)等分类器进行比较实验,实验结果表明采用本文方法处理的特征,比其余方法能有效提高情感分类正确率。

关键词: 情感分析, 特征选择, 组合加权

Abstract: Text sentiment analysis is a typical task of natural language processing, but the accuracy of existing sentiment analysis is not high, and word characterization is an important reason. A combined weighting method for short text features (CWSTF) is proposed, which can effectively improve the accuracy of sentiment analysis. The CWSTF method evaluates the contribution of features to emotions based on random forests and ranks them, and then filters features based on ranks. Then, the importance of the feature in the document is calculated by TF-IDF (Term Frequency-Inverse Document Frequency), and the final weight of the feature is determined by the importance of the feature in the document and the contribution to the sentiment; Finally, four such classifiers as SVM (Support Vector Machine), NB (Naive Bayes), ME (Maximum Entropy), and KNN (K-Nearest Neighbor) are used for comparison experiments. The experimental results show that the features processed by proposed method can more effectively improve the accuracy of sentiment classification than other methods.

Key words: sentiment analysis, feature selection, combined weighting

中图分类号: 

  • TP391
[1] 张巍, 史文鑫, 刘冬宁, 等. 一种改进的带有情感信息的词向量学习方法[J]. 计算机应用研究, 2017, 34(8): 2287-2290
ZHANG W, SHI W X, LIU D N, et al. Improved approach of word vector learning via sentiment information [J]. Application Research of Computers, 2017, 34(8): 2287-2290
[2] DENG Z H, LUO K H, YU H L. A study of supervised term weighting scheme for sentiment analysis [J]. Expert Systems with Applications, 2014, 41(7): 3506-3513
[3] DAS O, BALABANTARAY R C. Sentiment analysis of movie reviews using POS tags and term frequencies [J]. International Journal of Computer Applications, 2014, 96(25): 34-41
[4] BEHDENNA S, BARIGOU F, BELALEM G. Sentiment analysis at document level[C]//International Conference on Smart Trends for Information Technology and Computer Communications. Singapore: Springer, 2016: 159-168.
[5] PARLAR T, ÖZEL S A. A new feature selection method for sentiment analysis of Turkish reviews[C]//2016 International Symposium on INnovations in Intelligent SysTems and Applications (INISTA). Sinaia: IEEE, 2016: 1-6.
[6] JOULIN A, GRAVE É, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. East Stroudsburg: The Association for Computational Linguistics, 2017: 427-431.
[7] CLAYPO N, JAIYEN S. Opinion mining for Thai restaurant reviews using neural networks and mRMR feature selection[C]//2014 International Computer Science and Engineering Conference (ICSEC). Khon Kaen: IEEE, 2014: 394-397.
[8] ZHANG L, QIAN G Q, FAN W G, et al. Sentiment analysis based on light reviews [J]. Ruan Jian Xue Bao/Journal of Software, 2014, 25(12): 2790-2807
[9] PANG B, LEE L. Opinion mining and sentiment analysis [J]. Foundations and Trends® in Information Retrieval, 2008, 2(1-2): 1-135
[10] MA B J, YUAN H, WU Y. Exploring performance of clustering methods on document sentiment analysis [J]. Journal of Information Science, 2017, 43(1): 54-74
[11] RAMTEKE J, SHAH S, GODHIA D, et al. Election result prediction using Twitter sentiment analysis[C]//2016 International Conference on Inventive Computation Technologies (ICICT). Coimbatore: IEEE, 2016, 1: 1-5.
[12] WU W, GAO B, YANG H, et al. The impacts of reviews on hotel satisfaction: a sentiment analysis method [J]. Data Analysis and Knowledge Discovery, 2017, 1(3): 62-71
[13] SHIVAPRASAD T K, SHETTY J. Sentiment analysis of product reviews: a review[C]//2017 International Conference on Inventive Communication and Computational Technologies (ICICCT). Coimbatore: IEEE, 2017: 298-301.
[14] 莫赞, 罗敏瑶. 在线评论对消费者购买决策的影响研究——基于评论可信度和信任倾向的中介、调节作用[J]. 广东工业大学学报, 2019, 36(2): 58-65
MO Z, LUO M Y. A research of the influence of online reviews on consumer purchase decision based on mediation and adjustment of reliability comments and trust tendency [J]. Journal of Guangdong University of Technology, 2019, 36(2): 58-65
[15] 张巍, 黄健华, 刘冬宁, 等. 一种改进的结合评分和评论信息的推荐方法[J]. 广东工业大学学报, 2017, 34(6): 31-35, 52
ZHANG W, HUANG J H, LIU D N, et al. An improved recommendation method combining scoring and comment information [J]. Journal of Guangdong University of Technology, 2017, 34(6): 31-35, 52
[16] ALLEN T T, SUI Z, PARKER N L. Timely decision analysis enabled by efficient social media modeling [J]. Decision Analysis, 2017, 14(4): 250-260
[17] KNOX G, EORGE, VAN OEST, RUTGER. Customer complaints and recovery effectiveness: a customer base approach [J]. Journal of Marketing A Quarterly Publication of the American Marketing Association, 2014, 78(5): 42-57
[18] SALEH M R, MARTÍN-VALDIVIA M T, M-ONTEJO-RÁEZ A, et al. Experiments with SVM to classify opinions in different domains [J]. Expert Systems with Applications, 2011, 38(12): 14799-14804
[19] WU K, LU B L, UCHIYAMA M, et al. A probabilistic approach to feature selection for multiclass text categorization[C]//International Symposium on Neural Networks. Heidelberg: Springer, 2007: 1310-1317.
[20] BIDI N, ELBERRICHI Z. Feature selection for text classification using genetic algorithms[C]//2016 8th International Conference on Modelling, Identification and Control (ICMIC). Algiers: IEEE, 2016: 806-810.
[21] WANG X, CAO J, LIU Y, et al. Text clustering based on the improved tfidf by the iterative algorithm[C]//2012 IEEE Symposium on Electrical & Electronics Engineering (EEESYM). Malaysia: IEEE, 2012: 140-143.
[22] LEE J, KIM D. Memetic feature selection algorithm for multilabel classification [J]. Information Sciences, 2015, 293(293): 80-96
[23] XUE B, ZHANG M, BROWNE W N, et al. A survey on evolutionary computation approaches to feature selection [J]. IEEE Transactions on Evolutionary Computation, 2015, 20(4): 606-626
[24] AL-JADIR I, WONG K W, FUNG C C, et al. Text dimensionality reduction for document clustering using hybrid memetic feature selection[C]//International Workshop on Multi-Disciplinary Trends in Artificial Intelligence. Cham: Springer, 2017: 281-289.
[25] KUMBHAR P, MALI M, ATIQUE M. A genetic-fuzzy approach for automatic text categorization[C]//2017 IEEE 7th International Advance Computing Conference (IACC). Hyderabad: IEEE, 2017: 572-578.
[26] WANG S, MANNING C D. Baselines and bigrams: Simple, good sentiment and topic classification[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Pennsylvania: Association for Computational Linguistics, 2012: 90-94.
[27] KUMAR H M, HARISH B S, DARSHAN H K, et al. Sentiment analysis on IMDB movie reviews using hybrid feature extraction method [J]. International Journal of Interactive Multimedia and Artificial Intelligence, 2019, 5(5): 109-114
[28] MAAS A L, DALY R E, PHAM P T, et al. Learning word vectors for sentiment analysis[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Portland: Association for Computational Linguistics, 2011: 142-150.
[29] ZHENG L, WANG H, GAO S. Sentimental feature selection for sentiment analysis of Chinese online reviews [J]. International Journal of Machine Learning and Cybernetics, 2018, 9(1): 75-84
[30] AYMEN ABU-ERRUB. Arabic text classification algorithm using TFIDF and Chi Square measurements [J]. International Journal of Computer Applications, 2014, 93(6): 40-45
[31] ISMAIL H M, BELKHOUCHE B, ZAKI N. Semantic Twitter sentiment analysis based on a fuzzy thesaurus [J]. Soft Computing, 2018, 22(18): 6011-6024
[1] 刘洪伟, 林伟振, 温展明, 陈燕君, 易闽琦. 基于MABM的消费者情感倾向识别模型——以电影评论为例[J]. 广东工业大学学报, 2022, 39(06): 1-9.
[2] 张巍, 张圳彬. 联合图嵌入与特征加权的无监督特征选择[J]. 广东工业大学学报, 2021, 38(05): 16-23.
[3] 滕少华, 冯镇业, 滕璐瑶, 房小兆. 联合低秩表示与图嵌入的无监督特征选择[J]. 广东工业大学学报, 2019, 36(05): 7-13.
[4] 陈平华, 黄辉, 麦淼, 周宏虹. 结合ReliefF和互信息的多标签特征选择算法[J]. 广东工业大学学报, 2018, 35(05): 20-25,50.
[5] 饶东宁, 黄思宏. 基于THUCTC的金融语料情感分析模型优化[J]. 广东工业大学学报, 2018, 35(03): 37-42.
[6] 陈炳丰, 郝志峰, 蔡瑞初, 温雯, 王丽娟, 黄浩, 蔡晓凤. 面向汽车评论的细粒度情感分析方法研究[J]. 广东工业大学学报, 2017, 34(03): 8-14.
[7] 梁礼欣, 郝志峰, 蔡瑞初, 温雯. 基于混合高斯分布伪样本生成的情感分析方法[J]. 广东工业大学学报, 2016, 33(06): 85-90.
[8] 贺科达, 朱铮涛, 程昱. 基于改进TF-IDF算法的文本分类方法研究[J]. 广东工业大学学报, 2016, 33(05): 49-53.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!