广东工业大学学报 ›› 2018, Vol. 35 ›› Issue (03): 37-42.doi: 10.12052/gdutxb.180016

• 综合研究 • 上一篇    下一篇

基于THUCTC的金融语料情感分析模型优化

饶东宁, 黄思宏   

  1. 广东工业大学 计算机学院, 广东 广州 510006
  • 收稿日期:2018-01-01 出版日期:2018-05-09 发布日期:2018-04-26
  • 通信作者: 黄思宏(1993-),男,硕士研究生,主要研究方向为金融智能.E-mail:584654848@qq.com E-mail:584654848@qq.com
  • 作者简介:饶东宁(1977-),男,副教授,博士,主要研究方向为金融智能、智能规划.
  • 基金资助:
    广东省自然科学基金资助项目(2016A030313084,2016A030313700)

Model Optimization of Financial Corpus Sentiment Analysis Based on THUCTC

Rao Dong-ning, Huang Si-hong   

  1. School of Computers, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2018-01-01 Online:2018-05-09 Published:2018-04-26

摘要: 近几年,情感分析技术引起人们的兴趣,在金融应用上,可以作为投资者投资前的参考.但是现有方法存在应用过于专一、数据偏差、结果过于笼统和不够精确的问题.因此本文优化一个通用的中文文本分类器,用于对在线评论数据和股票新闻数据进行情感分析.收集整理了2万条数据作为语料库,每条数据分别由3个人进行独立标注.之后对THUCTC进行优化,具体从3个方面对中文文本分类器进行优化,首先是词语切分,使用词干词典方法结合不同的分词法,实验比较后得到二分法为最好的结果;其次,为分类器选择最好的内核,发现Liblinear内核对即时性要求较高的投资人更好,另一方面Libsvm在提高准确率方面更有优势;最后在金融导向的情绪字典方面,它由Chi-square和TF-IDF方法构建,可用在普通文本分类器上.通过这种方式,本文的结果可以被推广且不会失去准确性.

关键词: 情感分析, 文本分类, 股价趋势预测, 中文分词

Abstract: Sentiment analysis has attracted interest recently. In financial applications, it can be a reference for investors. However, existing approaches are either so specific as to cause data drift or too general to be precise. Therefore, a general Chinese text classifier for online reviews and news on stocks is optimized. A corpus with 20000 items is first collected. Then, each item is labeled by three persons as ground truth. After that, the THUCTC is optimized, thus optimizing a general Chinese text classifier in three aspects. First, by tokenization, the THUCTC is modified to a 2-gram with a stemming dictionary method and got better results. Second, the best kernel is selected for classifier. The Liblinear kernel is found to be better for people pressed for time. On the other hand, the Libsvm kernel is good at promoting accuracy. Third, a finance-oriented sentiment dictionary is set based on Chi-square and TF-IDF approach. It can be used by on-the-shelf general text classifiers. In this way, the result can be generalized without the loss of preciseness.

Key words: sentiment analysis, text categorization, stock price trend prediction, Chinese word segmentation

中图分类号: 

  • TP181
[1] SAIF H, HE Y, FERNANDEZ M, et al. Contextual semantics for sentiment analysis of Twitter[J]. Information Processing & Management, 2015, 52(1):5-19.
[2] QIN Z, CONG Y, WAN T. Topic modeling of Chinese language beyond a bag-of-words[J]. Computer Speech & Language, 2016, 40:60-78.
[3] QIAN Q, HUANG M, LEI J, et al. Linguistically regularized LSTMs for Sentiment Classification[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Canada:ACL, 2017:1679-1689.
[4] LI J, SUN M S. Scalable term selection for text categorization[C]//Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Prague:EMNLP-CoNLL, 2007:774-782.
[5] LI J Y, SUN M S, et al. A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney:ACL, 2006:17-21.
[6] 时永宾, 余青松. 基于共现词卡方值的关键词提取算法[J]. 计算机工程, 2016, 42(6):191-195.SHI Y B, YU Q S. Key words extraction algorithm based on Chi-square value of co-concurrence words[J]. Computer Engineering, 2016, 42(6):191-195.
[7] ZHU J, WANG H, ZHU M, et al. Aspect-based opinion polling from customer reviews[J]. IEEE Transactions on Affective Computing, 2011, 2(1):37-49.
[8] 李心丹. 行为金融理论:研究体系及展望[J]. 金融研究, 2005,(1):175-190.LI X D. Behavioral finance theory:Research system and prospects[J]. Journal of Financial Research, 2005,(1):175-190.
[9] BEKAERT G, EHRMANN M, FRATZSCHER M, et al. The global crisis and equity market contagion[J]. Journal of Finance, 2014, 69(6):2597-2649.
[10] GIANNETTI M, WANG T Y. Corporate scandals and household stock market participation[J]. Social Science Electronic Publishing, 2016, 71(6):2591-2636.
[11] AVDIS E. Information tradeoffs in dynamic financial markets[J]. Journal of Financial Economics, 2016, 122(3):568-584.
[12] EDELEN R M, INCE O S, KADLEC G B. Institutional investors and stock return anomalies[J]. Journal of Financial Economics, 2016, 119(3):472-488.
[13] CHANG T Y, HARTZMARK S M, SOLOMON D H, et al. Being surprised by the unsurprising:earnings seasonality and stock returns[J]. Social Science Electronic Publishing, 2016, 30(8):281-323.
[14] RUAN X, WILSON S, MIHALCEA R. Finding optimists and pessimists on Twitter[C]//Meeting of the Association for Computational Linguistics. Berlin:ACL, 2016:320-325.
[15] 张对. 网络股评影响股市走势吗——基于股票情感分析的视角[J]. 现代经济信息, 2015,(1):355-357.ZHANG D. Internet stock analysts do affect the stock market trend_stock-based sentiment analysis perspective[J]. Modern Economic Information, 2015,(1):355-357.
[16] 江腾蛟, 万常选, 刘德喜, 等. 基于语义分析的评价对象-情感词对抽取[J]. 计算机学报, 2017, 40(3):617-633.JIANG T J, WANG C X, LIU D X, et al. Extracting target-opinion pairs based on semantic analysis[J]. Chinese Journal of Computers, 2017, 40(3):617-633.
[17] 饶东宁, 温远丽, 魏来, 等. 基于Spark平台的社交网络在不同文化环境中的中心度加权算法[J]. 广东工业大学学报, 2017, 34(3):15-20.RAO D N, WEN Y L, WEI L, et al. A weighted centrality algorithm for social networks based on Spark platform in different cultural environments[J]. Journal of Guangdong University of Technology, 2017, 34(3):15-20.
[18] 林穗, 赵菲. 基于Spark的线性模型在广告投放系统中的应用研究[J]. 广东工业大学学报, 2016, 33(5):28-33.LIN S, ZHAO F. An application research of linear model in the advertising system based on Spark[J]. Journal of Guangdong University of Technology, 2016, 33(5):28-33.
[19] 王洪伟, 郑丽娟, 刘仲英, 等. 中文网络评论的情感特征项选择研究[J]. 信息系统学报, 2012,(1):76-86.WANG H W, ZHENG L J, LIU Z Y, et al. Emotional feature selection of Chinese web comments[J]. China Journal of Information Systems, 2012,(1):76-86.
[20] CATAL C, GULDAN S. Product review management software based on multiple classifiers[J]. Iet Software, 2017, 11(3):89-92.
[1] 刘洪伟, 林伟振, 温展明, 陈燕君, 易闽琦. 基于MABM的消费者情感倾向识别模型——以电影评论为例[J]. 广东工业大学学报, 2022, 39(06): 1-9.
[2] 谭有新, 滕少华. 短文本特征的组合加权方法[J]. 广东工业大学学报, 2020, 37(05): 51-61.
[3] 陈炳丰, 郝志峰, 蔡瑞初, 温雯, 王丽娟, 黄浩, 蔡晓凤. 面向汽车评论的细粒度情感分析方法研究[J]. 广东工业大学学报, 2017, 34(03): 8-14.
[4] 梁礼欣, 郝志峰, 蔡瑞初, 温雯. 基于混合高斯分布伪样本生成的情感分析方法[J]. 广东工业大学学报, 2016, 33(06): 85-90.
[5] 贺科达, 朱铮涛, 程昱. 基于改进TF-IDF算法的文本分类方法研究[J]. 广东工业大学学报, 2016, 33(05): 49-53.
[6] 邹丽娜,凌捷. 一种基于特征提取的二级文本分类方法[J]. 广东工业大学学报, 2012, 29(4): 65-68.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!