广东工业大学学报 ›› 2023, Vol. 40 ›› Issue (04): 109-116.doi: 10.12052/gdutxb.220051

• 综合研究 • 上一篇    下一篇

基于图结构的分类数据距离度量

郑丽苹1, 邓秀勤1, 张逸群2   

  1. 1. 广东工业大学 数学与统计学院,广东 广州 510520;
    2. 广东工业大学 计算机学院,广东 广州 510006
  • 收稿日期:2022-03-17 出版日期:2023-07-25 发布日期:2023-08-02
  • 通信作者: 邓秀勤(1966–),女,教授,本科,主要研究方向为机器学习、数据挖掘,E-mail:dxq706@gdut.edu.cn
  • 作者简介:郑丽苹(1998–),女,硕士研究生,主要研究方向为机器学习、数据挖掘
  • 基金资助:
    广东省研究生教育创新计划项目(2021SFKC030) ;广东省自然科学基金资助面上项目(2022A1515011592)

Distance Metric of Categorical Data Based on Graph Structure

Zheng Li-ping1, Deng Xiu-qin1, Zhang Yi-qun2   

  1. 1. School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou 510520, China;
    2. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2022-03-17 Online:2023-07-25 Published:2023-08-02

摘要: 针对现有的大多数分类数据的度量方法效果不佳的问题,本文提出了一种基于有序属性和标称属性图结构的分类数据距离度量方法(New Distance Metric,NewDM) 。首先总结了分类数据距离定义的基本框架公式并分析度量该类型数据的挑战,然后利用不同属性的图结构定义了2个概率分布列距离,紧接着联立权重给出了分类数据的距离度量新方法,最后在6个公开数据集上进行实验,结果表明本文提出的NewDM度量性能优于其他度量方法。

关键词: 分类数据, 距离度量, 图结构, 有序属性

Abstract: This paper proposes a new distance metric (NewDM) of categorical data based on the graph structure of ordinal and nominal attributes to address the poor effect of most existing measurement methods for categorical data. In NewDM, it first summarizes the basic framework formula of distance definition of categorical data and analyzes the challenges of measuring categorical data. Then, the graph structures of ordinal attributes and nominal ones are utilized to define the distance between two probability distribution columns. Finally, a new distance metric of categorical data is obtained through simultaneous weight. Experimental results on six public datasets show that the proposed NewDM is superior to the state-of-the-art approaches.

Key words: categorical data, distance metric, graph structure, ordinal attribute

中图分类号: 

  • TP311.13
[1] SONG S, SUN Y, ZHANG A, et al. Enriching data imputation under similarity rule constraints[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(2): 275-287.
[2] ALABDULMOHSIN I, CISSE M, GAO X, et al. Large margin classification with indefinite similarities[J]. Machine Learning, 2016, 103(2): 215-237.
[3] 张巍, 张圳彬. 联合图嵌入与特征加权的无监督特征选择[J]. 广东工业大学学报, 2021, 38(5): 16-23.ZHANG W, ZHANG Z B. Joint graph embedding and feature weighting for unsupervised feature selection[J]. Journal of Guangdong University of Technology, 2021, 38(5): 16-23.
[4] ZENG S, WANG X, DUAN X, et al. Kernelized mahalanobis distance for fuzzy clustering[J]. IEEE Transactions on Fuzzy Systems, 2021, 29(10): 3103-3117.
[5] AGRESTI A. An introduction to categorical data analysis[M]. New York: John Wiley & Sons, 2018.
[6] HAMMING R W. Error detecting and error correcting codes[J]. Bell Labs Technical Journal, 2014, 29(2): 147-160.
[7] AHMAD A, DEY L. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set[J]. Pattern Recognition Letters, 2007, 28(1): 110-118.
[8] IENCO D, PENSA R G, MEO R. Context-based distance learning for categorical data clustering[C]//Advances in Intelligent Data Analysis VIII: 8th International Symposium on Intelligent Data Analysis. Lyon: Springer, 2009: 83-94.
[9] IENCO D, PENSA R G, MEO R. From context to distance: learning dissimilarity for categorical data clustering[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): 1-25.
[10] JIA H, CHEUNG Y, LIU J. A new distance metric for unsupervised learning of categorical data[J]. IEEE Transactions on Neural Networks Learning Systems, 2015, 27(5): 1065-1079.
[11] CHEN L, WANG S, WANG K, et al. Soft subspace clustering of categorical data with probabilistic distance[J]. Pattern Recognition, 2016, 51(3): 322-332.
[12] JIAN S, CAO L, LU K, et al. Unsupervised coupled metric similarity for non-iid categorical data[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9): 1810-1823.
[13] AGRESTI A. Analysis of ordinal categorical data[M]. New York: John Wiley & Sons, 2010.
[14] ZHANG Y, CHEUNG Y. An ordinal data clustering algorithm with automated distance learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI Press, 2020: 6869-6876.
[15] 林强, 唐加山. 一种适用于混合型分类数据的聚类算法[J]. 计算机工程与应用, 2019, 55(1): 168-173.LIN Q, TANG J S. Clustering algorithm for mixed categorical data[J]. Computer Engineering and Applications, 2019, 55(1): 168-173.
[16] ZHANG Y, CHEUNG Y M, TAN K C. A unified entropy-based distance metric for ordinal-and-nominal-attribute data clustering[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(1): 39-52.
[17] ZHANG Y, CHEUNG Y M. A new distance metric exploiting heterogeneous interattribute relationship for ordinal-and-nominal-attribute data clustering[J]. IEEE Transactions on Cybernetics, 2022, 52(2): 758-771.
[18] ZHANG Y, CHEUNG Y M, ZHANG Y, et al. Learnable weighting of intra-attribute distances for categorical data clustering with nominal and ordinal attributes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(7): 3560-3576.
[19] WEST D B. Introduction to graph theory[M]. Upper Saddle River: Prentice hall, 2001.
[20] CHAKRABORTY M, CHOWDHURY S, CHAKRABORTY J, et al. Algorithms for generating all possible spanning trees of a simple undirected connected graph: an extensive review[J]. Complex & Intelligent Systems, 2019, 5(3): 265-281.
[21] DEVRIENDT K, VAN M P. The simplex geometry of graphs[J]. Journal of Complex Networks, 2019, 7(4): 469-490.
[22] CAI D, ZHANG C, HE X. Unsupervised feature selection for multi-cluster data[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2010: 333-342.
[23] SANTOS J M, EMBRECHTS M. On the use of the adjusted rand index as a metric for evaluating supervised classification[C]//Artificial Neural Networks–ICANN 2009: 19th International Conference. Cyprus: Springer, 2009: 175-184.
[24] ESTÉVEZ P A, TESMER M, PEREZ C A, et al. Normalized mutual information feature selection[J]. IEEE Transactions on Neural Networks, 2009, 20(2): 189-201.
[25] HUANG Z. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(3): 283-304.
[26] JING L, NG M K, HUANG J Z. An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(8): 1026-1041.
[27] JIA H, CHEUNG Y M. Subspace clustering of categorical and numerical data with an unknown number of clusters[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(8): 3308-3325.
[1] 王岩林, 杨卫东, 刘洋. 基于区块链的粮食储备多链监管体系建模[J]. 广东工业大学学报, 2023, 40(03): 25-31,37.
[2] 陈文伟, 赵侠, 黄金才. 进化创新的绕行变换[J]. 广东工业大学学报, 2017, 34(01): 1-5.
[3] 赵杰, 李卫华. 基于知网的矛盾问题语义二义性研究[J]. 广东工业大学学报, 2014, 31(2): 21-26.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!