Abstract:
With regard to getting comment information of the products in the electricity sales website as soon as possible and grasping product public opinion in real time, a method of online product reviews information collection based on Storm is presented. The concept of flow computation is applied to the web crawler, and the SHHD (Simhash Hamming Distance) algorithm is used to dynamically adjust the acquisition period. Experimental results show that information collection based on Storm has the advantages of large throughput and easy updating. The SHHD algorithm can effectively reduce the acquisition system on the network bandwidth and system resources consumption and achieve an adaptive incremental online product review information collection process. SHHD has certain advantages in the lag of product comment information acquisition than Poisson and SART.