广东工业大学学报 ›› 2017, Vol. 34 ›› Issue (03): 83-88.doi: 10.12052/gdutxb.170042

• 大数据基础理论与应用专题 • 上一篇    下一篇

一种基于Storm的在线产品评论信息采集的方法

骆魁永, 郝志峰, 蔡瑞初, 温雯, 袁琴   

  1. 广东工业大学 计算机学院, 广东 广州 510006
  • 收稿日期:2017-02-28 出版日期:2017-05-09 发布日期:2017-05-09
  • 作者简介:骆魁永(1991-),男,硕士研究生,主要研究方向为数据挖掘、实时处理.E-mail:ykui_luo@126.com
  • 基金资助:

    国家自然科学基金资助项目(61202269,61472089,61572143,61502108,61502109)

An Online Product Review Information Collection Method Based on Storm

Luo Kui-yong, Hao Zhi-feng, Cai Rui-chu, Wen Wen, Yuan Qin   

  1. School of Computers, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2017-02-28 Online:2017-05-09 Published:2017-05-09

摘要:

针对如何尽早地获取电商网站中产品的评论信息,进而实时地把握产品舆情,提出了一种基于Storm的在线产品评论信息采集方法.该方法将流计算的概念应用于网络爬虫中,并通过SHHD算法对采集周期进行动态调整.实验结果表明:基于Storm平台进行信息采集具有吞吐量大、可扩展性强等优点;SHHD算法可以有效地降低采集系统对网络带宽和系统资源的消耗,实现了适应性的增量的在线产品评论信息采集过程;SHHD在产品的评论信息获取的滞后时间上较Poisson、SART等方法具有明显的优势.

关键词: 产品评论信息, Storm, 适应性, 增量采集

Abstract:

With regard to getting comment information of the products in the electricity sales website as soon as possible and grasping product public opinion in real time, a method of online product reviews information collection based on Storm is presented. The concept of flow computation is applied to the web crawler, and the SHHD (Simhash Hamming Distance) algorithm is used to dynamically adjust the acquisition period. Experimental results show that information collection based on Storm has the advantages of large throughput and easy updating. The SHHD algorithm can effectively reduce the acquisition system on the network bandwidth and system resources consumption and achieve an adaptive incremental online product review information collection process. SHHD has certain advantages in the lag of product comment information acquisition than Poisson and SART.

Key words: product review information, Storm, adaptability, Incremental acquisition

中图分类号: 

  • TP391

[1] 中国互联网信息中心(CNNIC). 2015年中国网络购物市场研究报告[R]. 北京:CNNIC, 2016. 6
[2] NIRAJ S, ASHUTOSH D, SHARMA A K. Design of a priority based frequency regulated incremental crawler[J]. International Journal of Computer Applications, 2010, 1(1):42-47.
[3] SHARMA AK, DIXIT A. Self adjusting refresh time based architecture for incremental web crawler[J]. International Journal of Computer Science and Network Security, 2008, 8(12):349-354.
[4] TESSERA D, CALZAROSSA M. Modeling and predicting temporal patterns of web content changes[J]. Journal of Network and Computer Applications, 2015, 2015(56):115-123.
[5] SIA K C, CHO J, CHO H K. Efficient monitoring algorithm for fast news alerts[J]. IEEE Transactions on Knowledge & Data Engineering, 2007, 19(7):950-961.
[6] 孟涛, 王继民, 闫宏飞. 网页变化与增量搜集技术[J]. 软件学报, 2006, 17(5):1051-1067. MENG T, WANG J M, YAN H F. Web evolution and incremental crawling[J]. Journal of Software, 2006, 17(5):1051-1067.
[7] CHO J, GARCIA-MOLINA H. Estimating frequency of change[J]. Acm Transactions on Internet Technology, 2003, 3(3):256-290.
[8] DIXIT A, SHARMA A K. A mathematical model for crawler revisit frequency[C]//Advance Computing Conference (IACC), 2010 IEEE 2nd International.[S.l.]:IEEE, 2010:316-319.
[9] 崔星灿, 禹晓辉, 刘洋, 等. 分布式流处理技术综述[J]. 计算机研究与发展, 2015, 52(2):318-332. Cui X C, Yu X H, LIU Y, et al. Distributed stream processing:a survey[J]. Journal of Computer Research and Development, 2015, 52(2):318-332.
[10] 邓立龙, 徐海水. Storm实现的应用模型研究[J]. 广东工业大学学报, 2014, 31(3):114-118. Deng L L, Xu H S. Research on applied models based on Storm[J]. Journal of Guangdong University of Technology, 2014, 31(3):114-118.
[11] YANG W, LIU X, ZHANG L, et al. Big data real-time pro-cessing based on Storm[C]//201312th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.[S.l.]:IEEE, 2013:1784-1787.
[12] UDAPURE T V. Study of web crawler and its different types[J]. IOSR Journals (IOSR Journal of Computer Engineering), 2014, 1(16):1-5.
[13] 董博, 郑庆华, 宋凯磊, 等. 基于多SimHash指纹的近似文本检测[J]. 小型微型计算机系统, 2011, 32(11):2152-2157. DONG B, ZHENG Q H, SONG K L, et al. Efficient near-duplicate detection based on multiple simhash fingerprints[J]. Journal of Chinese Computer Systems, 2011, 32(11):2152-2157.
[14] 寇月, 李冬, 申德荣等. D-EEM:一种基于DOM树的Deep Web实体抽取机制[J]. 计算机研究与发展, 2010, 47(5):858-86. KOU Y, LI D, SHEN D R. D-EEM:A DOM-tree based entity extraction mechanism for deep web[J]. Journal of Computer Research and Development, 2010, 47(5):858-86.
[15] MANKU G S, JAIN A, DAS SARMA A. Detecting near-duplicates for web crawling[C]//International Conference on World Wide Web.[S.l.]:ACM, 2007:141-150.

[1] 杨兴雨, 何锦安, 沈健华. 基于移动窗口的适应性在线投资组合策略[J]. 广东工业大学学报, 2018, 35(03): 61-66.
[2] 邓立龙, 徐海水. Storm实现的应用模型研究[J]. 广东工业大学学报, 2014, 31(3): 114-118.
[3] 曾志辉; 陆琦; 郭鹏飞; . 佛山东华里民居热环境实测分析[J]. 广东工业大学学报, 2009, 26(4): 70-74.
[4] 严光衡; 陈婉儿; . 最小均方模糊适应性滤波器[J]. 广东工业大学学报, 1998, 15(s1): 49-53.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!