Journal of Guangdong University of Technology ›› 2018, Vol. 35 ›› Issue (02): 51-56.doi: 10.12052/gdutxb.170152

Previous Articles     Next Articles

Text Extraction Based on Text Block Density with Tag Path and Other Features

Yang Xian1, Tang Chao-lan1, Li Hang2   

  1. 1. School of Art and Design, Guangdong University of Technology, Guangzhou, 510090, China;
    2. School of computers, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2017-10-16 Online:2018-03-09 Published:2018-03-13
  • Supported by:
     

Abstract: Most of web pages contain content information as well as a lot of noisy information. In order to address this problem and improve the accuracy of web page extraction, a web page extraction method is proposed via text block density with tap path and other features. The proposed method mostly combines the advantages of text block extraction method and label path extraction method. First, the block of the text is determined according to the density feature of the text block, and then the tag path method is used to remove the noisy node in the block, the text node in the text block is extracted from the content finally. This solution effectively solves the problem that the noisy information in the text block is difficult to filter and the tag path method is easy to extract the long text from the noisy block. In the end, experiments show that the solution is better than CETR and CETD in most cases.

Key words: content extraction, text block, tag path, text density

CLC Number: 

  • TP391
[1] Xie Guo-bo, Lin Li, Lin Zhi-yi, He Di-xuan, Wen Gang. An Insulator Burst Defect Detection Method Based on YOLOv4-MP [J]. Journal of Guangdong University of Technology, 2023, 40(02): 15-21.
[2] Chen Jing-yu, Lyu Yi. Frost Detection Method of Cold Chain Refrigerating Machine Based on Spiking Neural Network [J]. Journal of Guangdong University of Technology, 2023, 40(01): 29-38.
[3] Ye Wen-quan, Li Si, Ling Jie. Sparse-view SPECT Image Reconstruction Based on Multilevel-residual U-Net [J]. Journal of Guangdong University of Technology, 2023, 40(01): 61-67.
[4] Zou Heng, Gao Jun-li, Zhang Shu-wen, Song Hai-tao. Design and Implementation of a Dropping Guidance Device for Go Robot [J]. Journal of Guangdong University of Technology, 2023, 40(01): 77-82,91.
[5] Xie Guang-qiang, Xu Hao-ran, Li Yang, Chen Guang-fu. Consensus Opinion Enhancement in Social Network with Multi-agent Reinforcement Learning [J]. Journal of Guangdong University of Technology, 2022, 39(06): 36-43.
[6] Liu Xin-hong, Su Cheng-yue, Chen Jing, Xu Sheng, Luo Wen-jun, Li Yi-hong, Liu Ba. Real Time Detection of High Resolution Bridge Crack Image [J]. Journal of Guangdong University of Technology, 2022, 39(06): 73-79.
[7] Xiong Wu, Liu Yi. Application of Particle Filter Algorithm in Static Deformation Monitoring of BDS High-Speed Rail [J]. Journal of Guangdong University of Technology, 2022, 39(04): 66-72.
[8] Yi Min-qi, Liu Hong-wei, Gao Hong-ming. Research on the Factors Influencing the Co-purchase Network of Products on E-commerce Platforms [J]. Journal of Guangdong University of Technology, 2022, 39(03): 16-24.
[9] Qiu Zhan-chun, Fei Lun-ke, Teng Shao-hua, Zhang Wei. Palmprint Recognition Based on Cosine Similarity [J]. Journal of Guangdong University of Technology, 2022, 39(03): 55-62.
[10] Zheng Jia-bi, Yang Zhen-guo, Liu Wen-yin. Marketing-Effect Estimation Based on Fine-grained Confounder Balancing [J]. Journal of Guangdong University of Technology, 2022, 39(02): 55-61.
[11] Gary Yen, Li Bo, Xie Sheng-li. An Evolutionary Optimization of LSTM for Model Recovery of Geophysical Fluid Dynamics [J]. Journal of Guangdong University of Technology, 2021, 38(06): 1-8.
[12] Li Guang-cheng, Zhao Qing-lin, Xie Kan. A Design of Decentralized Data Processing Scheme [J]. Journal of Guangdong University of Technology, 2021, 38(06): 77-83.
[13] Xie Guang-qiang, Zhao Jun-wei, Li Yang, Xu Hao-ran. Cooperative Lane-changing Based on Multi-cluster System [J]. Journal of Guangdong University of Technology, 2021, 38(05): 1-9.
[14] Zhang Wei, Zhang Zhen-bin. Joint Graph Embedding and Feature Weighting for Unsupervised Feature Selection [J]. Journal of Guangdong University of Technology, 2021, 38(05): 16-23.
[15] Deng Jie-hang, Yuan Zhong-ming, Lin Hao-run, Gu Guo-sheng. Superpixel and Visual Saliency Synergetic Image Quality Assessment [J]. Journal of Guangdong University of Technology, 2021, 38(05): 33-39.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!