广东工业大学学报 ›› 2016, Vol. 33 ›› Issue (05): 49-53.doi: 10.3969/j.issn.1007-7162.2016.05.009

• 综合研究 • 上一篇    下一篇

基于改进TF-IDF算法的文本分类方法研究

贺科达, 朱铮涛, 程昱   

  1. 广东工业大学 信息工程学院,广东 广州 510006
  • 收稿日期:2015-09-22 出版日期:2016-09-10 发布日期:2016-09-10
  • 通信作者: 朱铮涛(1967-),男,副教授,博士,主要研究方向为计算机视觉检测技术.E-mail:511972136@qq.com
  • 作者简介:贺科达(1989-),男,硕士研究生,主要研究方向为数据与文本挖掘.
  • 基金资助:

    国家自然科学基金资助项目(11204043)

A Research on Text Classification Method Based on Improved TF-IDF Algorithm

He Ke-da, Zhu Zheng-tao,Cheng Yu   

  1. School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2015-09-22 Online:2016-09-10 Published:2016-09-10

摘要:

类别关键词是文本分类首先要解决的关键问题,在研究利用类别关键词及TF-IDF算法对文本进行分类的基础上,提出了一种改进的TF-IDF算法.首先建立类别关键词库,并对其进行扩充及去重,克服了向量空间模型不能很好调节权重的缺点.通过加入文档长度权值修正文档中关键词的权重,有效地解决了原有特征词条类别区分能力不足的问题.采用贝叶斯分类方法,结合实验验证了该算法的有效性,提高了文本分类的准确度.

关键词: 关键词提取; 特征选择; 文本分类; 预处理

Abstract:

Establishing category keywords is the key problem in text classification, which should be solved first. On the basis of the classification of text by using the category keywords and TF-IDF algorithm, an improved TF-IDF algorithm has been proposed to overcome the shortcomings of the vector space model, which cannot well adjust the weights. Firstly, category keyword library should be established, and the expansion and duplication be carried out. The weight of keywords in the document is modified by the addition of the length of the document, and the shortage of the original features of the entry class distinction ability is solved effectively. By using Bayesian classification method, combined with the experiments, the effectiveness of the algorithm is verified, and the accuracy of text classification improved.

Key words: keyword extraction; feature selection; text classification; pretreatment

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!