广东工业大学学报 ›› 2017, Vol. 34 ›› Issue (03): 49-53.doi: 10.12052/gdutxb.170011

• 大数据基础理论与应用专题 • 上一篇    下一篇

一种改进的mpts-HDBSCAN算法

王荣荣, 傅秀芬   

  1. 广东工业大学 计算机学院, 广东 广州 510006
  • 收稿日期:2017-01-15 出版日期:2017-05-09 发布日期:2017-05-09
  • 作者简介:王荣荣(1990-),男,硕士研究生,主要研究方向为数据挖掘、云计算.
  • 基金资助:

    广东省科技计划项目(2013B010401034)

An Improved mpts-HDBSCAN Algorithm

Wang Rong-rong, Fu Xiu-fen   

  1. School of Computers, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2017-01-15 Online:2017-05-09 Published:2017-05-09

摘要:

聚类分析是非监督模式分类的一个重要分支.DBSCAN算法是基于密度聚类的最常见算法,且具有可发现任意形状的簇并且对噪声点不敏感等优点而得到广泛研究与应用.本文首先研究了DBSCAN所存在的一些问题,以及当前基于DBSCAN算法改进算法所存在的不足.其次,对于mpts-HDBSCAN算法处理密度分布不均匀数据聚类效果不理想的情况,提出了一种新的分区算法.分区算法根据数据分布的直方图确定分组数据,根据分区阈值这个标准来确定是否对数据进行划分处理;然后运用mpts-HDBSCAN算法对划分后的子数据进行聚类,并对聚类的结果进行合并.实验结果表明,改进后的算法对于处理密度不均匀数据具有更好的效果.

关键词: 聚类, 数据分区, mpts-HDBSCAN算法, 合并子类

Abstract:

Cluster analysis is an important branch of non-supervised model classification, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is one of the most common algorithms in density-based clustering methods. It's widely researched and applied in many fields as it can find clusters of arbitrary shapes with noises. Some shortcomings of DBSCAN and also recently improved algorithms based on DBSCAN are focused on. A new data partitioning method is proposed to solve the problem that mpts-HDBSCAN clustering quality will degrade when applied in varied density dataset. Firstly the proposed partitioning method calculates the numbers of the group based on the histogram of the data distribution. Secondly it is determined whether to partition the dataset based on the threshold value. Sub-datasets generated by partitioning method will bind with mpts-HDBSCAN to find clusters and finally merge the sub-clusters to one. Experiment shows the proposed binding algorithm is more effective than mpts-HDBSCAN in finding clusters when dataset density is not even.

Key words: clustering, data partitioning, mpts-HDBSCAN, merging sub clusters

中图分类号: 

  • TP391

[1] 滕少华, 吴昊, 李日贵等. 可调多趟聚类挖掘在电信数据分析中的应用[J]. 广东工业大学学报, 2014, 31(3):1-7. TENG S H, WU H, LI R G, et al. The application of the adjustable multi-time clustering algorithm in telecom data[J]. Journal of Guangdong University of Technology, 2014, 31(3):1-7.
[2] HINNEBURG A, KEIM D. An efficient approach to clustering large muiti-media databases with noise[C]//Proceedings of the 4th ACM SIGKDD, American Association for Artificial Intelligence. New York:KDD, 1998:58-65.
[3] NISA K K, ANDRIANTO H A, Mardhiyyah R. Hotspot clustering using DBSCAN algorithm and shiny web framework[C]//Advanced Computer Science and Information Systems (ICACSIS), 2014 International Conference on.[S.l.]:IEEE, 2014:129-132.
[4] 潘玲玲, 张育平, 徐涛. 核DBSCAN算法在民航客户细分中的应用[J]. 计算机工程, 2012, 38(10):70-73. PAN L L, ZHANG Y P, XU T. Application of kernel DBSCAN algorithm in civil aviation customer segmentation[J]. Computer Engineering, 2012, 38(10):70-73.
[5] 朱烜璋. 基于DBSCAN的无线传感网定位方法[J]. 计算机工程与应用, 2013, 49(11):80-83. ZHU X Z. Location method based on DBSCAN in wireless sensor networks[J]. Computer Engineering and Applications, 2013, 49(11):80-83.
[6] VIJAYALAKSMI S, PUNITHAVALLI M. A fast approach to clustering datasets using DBSCAN and pruning algorithms[J]. International Journal of Computer Applications, 2012, 60(14):1-7.
[7] 李双庆, 慕升弟. 一种改进的DBSCAN算法及其应用[J]. 计算机工程与应用, 2014, 50(8):72-76. Li S Q, MU S D. Improved DBSCAN algorithm and its application[J]. Computer Engineering and Applications, 2014, 50(8):72-76.
[8] LOH W K, YU H. Fast density-based clustering through dataset partition using graphics processing units[J]. Information Sciences, 2015, 308(7):94-112.
[9] WANG S M, LIU Y, SHEN B. MDBSCAN:Multi-level density based spatial clustering of applications with noise[C]//Proceedings of the The 11th International Knowledge Management in Organizations Conference on The changing face of Knowledge Management Impacting Society. Hagen, Germany:ACM, 2016:21-27.
[10] FU J S, LIU Y, CHAO H C. ICA:An incremental clustering algorithm based on OPTICS[J]. Wireless Personal Communications, 2015, 84(3):2151-2170.
[11] ANKERST M, BREUNIG M M, KRIEGEL H P, et al. OPTICS:ordering points to identify the clustering structure[J]//ACM Sigmod record. ACM, 1999, 28(2):49-60.
[12] CAMPELLO R J G B, MOULAVI D, SANDER J. Density-based clustering based on hierarchical density estimates[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin Heidelberg:Springer, 2013:160-172.
[13] DOCKHORN A, BRAUNE C, KRUSE R. An alternating optimization approach based on hierarchical adaptations of DBSCAN[C]//Computational Intelligence, 2015 IEEE Symposium Series on.[S.l.]:IEEE, 2015:749-755.
[14] 冯少荣, 肖文俊. DBSCAN聚类算法的研究与改进[J]. 中国矿业大学学报, 2008, 37(1):105-111. FENG S R, XIAO W J. An improved DBSCAN clustering algorithm[J]. Journal of China University of Mining & Technology, 2008, 37(1):105-111.
[15] HE Y B, TAN H Y, LUO W M, et al. MR-DBSCAN:a scalable map reduce-based DBSCAN algorithm for heavily skewed data[J]. Frontiers of Computer Science, 2014, 8(1):83-99.
[16] DAI B R, LIN I C. Efficient map/reduce-based dbscan algorithm with optimized data partition[C]//Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on.[S.l.]:IEEE, 2012:59-66.
[17] 周水庚, 周傲英, 曹晶, 等. 基于数据分区的DBSCAN算法[J]. 计算机研究与发展, 2000, 37(10):1153-1159. ZHOU S G, ZHOU A Y, CAO J. A data-partitioning-based dbscan algorithm[J]. Journal of Computer Research & Development, 2000, 37(10):1153-1159.

[1] 樊娟, 邓秀勤, 刘玉兰. 一种基于Fréchet距离的谱聚类算法[J]. 广东工业大学学报, 2023, 40(02): 39-44.
[2] 莫赞, 范梦婷, 刘洪伟, 严杨帆. 基于在线用户行为的产品非对称竞争市场结构研究[J]. 广东工业大学学报, 2023, 40(02): 111-119.
[3] 范梦婷, 刘洪伟, 高鸿铭, 何锐超. 电子商务平台下的竞争产品市场结构研究[J]. 广东工业大学学报, 2019, 36(06): 32-37.
[4] 何庆祥, 张巍. 改进的聚类算法在恐怖袭击事件中的应用[J]. 广东工业大学学报, 2019, 36(04): 24-30.
[5] 谢振东, 冷梦甜, 吴金成. 基于一卡通数据的公交站点识别方法分析与研究[J]. 广东工业大学学报, 2019, 36(01): 23-28.
[6] 张巍, 麦志深. 核模糊谱聚类LOF降噪方法研究[J]. 广东工业大学学报, 2018, 35(06): 77-82.
[7] 马飞, 李娟. 基于聚类算法的MOOCs学习者分类及学习行为模式研究[J]. 广东工业大学学报, 2018, 35(03): 18-23.
[8] 陈丽, 曹熙, 林俊杰, 高鸿铭, 刘飞雅, 李艳艳. 基于数据挖掘的短期电力负荷风险预测分析[J]. 广东工业大学学报, 2017, 34(03): 105-109.
[9] 陈继峰, 刘广聪, 彭成平. 一种改进的无线传感器网络DV-Hop定位算法[J]. 广东工业大学学报, 2017, 34(02): 80-85.
[10] 申小敏, 李保俊, 孙旭, 徐维超. 基于卷积神经网络的大规模人脸聚类[J]. 广东工业大学学报, 2016, 33(06): 77-84.
[11] 王波, 钟映春, 陈俊彬. 融合AP和GMM的说话人识别方法研究[J]. 广东工业大学学报, 2015, 32(04): 145-149.
[12] 滕少华, 吴昊, 李日贵, 张巍, 刘冬宁, 梁路. 可调多趟聚类挖掘在电信数据分析中的应用[J]. 广东工业大学学报, 2014, 31(3): 1-7.
[13] 蒋盛益, 王连喜. 聚类分析研究的挑战性问题[J]. 广东工业大学学报, 2014, 31(3): 32-38.
[14] 李云, 鲍鸿. 语音分组识别技术的研究[J]. 广东工业大学学报, 2014, 31(2): 54-57.
[15] 姚蕾. 一种粒子群-Mamdani模糊神经网络的参数优化新算法[J]. 广东工业大学学报, 2014, 31(1): 36-39.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!