Journal of Guangdong University of Technology ›› 2023, Vol. 40 ›› Issue (04): 109-116.doi: 10.12052/gdutxb.220051

• Comprehensive Studies • Previous Articles     Next Articles

Distance Metric of Categorical Data Based on Graph Structure

Zheng Li-ping1, Deng Xiu-qin1, Zhang Yi-qun2   

  1. 1. School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou 510520, China;
    2. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2022-03-17 Online:2023-07-25 Published:2023-08-02

Abstract: This paper proposes a new distance metric (NewDM) of categorical data based on the graph structure of ordinal and nominal attributes to address the poor effect of most existing measurement methods for categorical data. In NewDM, it first summarizes the basic framework formula of distance definition of categorical data and analyzes the challenges of measuring categorical data. Then, the graph structures of ordinal attributes and nominal ones are utilized to define the distance between two probability distribution columns. Finally, a new distance metric of categorical data is obtained through simultaneous weight. Experimental results on six public datasets show that the proposed NewDM is superior to the state-of-the-art approaches.

Key words: categorical data, distance metric, graph structure, ordinal attribute

CLC Number: 

  • TP311.13
[1] SONG S, SUN Y, ZHANG A, et al. Enriching data imputation under similarity rule constraints[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(2): 275-287.
[2] ALABDULMOHSIN I, CISSE M, GAO X, et al. Large margin classification with indefinite similarities[J]. Machine Learning, 2016, 103(2): 215-237.
[3] 张巍, 张圳彬. 联合图嵌入与特征加权的无监督特征选择[J]. 广东工业大学学报, 2021, 38(5): 16-23.ZHANG W, ZHANG Z B. Joint graph embedding and feature weighting for unsupervised feature selection[J]. Journal of Guangdong University of Technology, 2021, 38(5): 16-23.
[4] ZENG S, WANG X, DUAN X, et al. Kernelized mahalanobis distance for fuzzy clustering[J]. IEEE Transactions on Fuzzy Systems, 2021, 29(10): 3103-3117.
[5] AGRESTI A. An introduction to categorical data analysis[M]. New York: John Wiley & Sons, 2018.
[6] HAMMING R W. Error detecting and error correcting codes[J]. Bell Labs Technical Journal, 2014, 29(2): 147-160.
[7] AHMAD A, DEY L. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set[J]. Pattern Recognition Letters, 2007, 28(1): 110-118.
[8] IENCO D, PENSA R G, MEO R. Context-based distance learning for categorical data clustering[C]//Advances in Intelligent Data Analysis VIII: 8th International Symposium on Intelligent Data Analysis. Lyon: Springer, 2009: 83-94.
[9] IENCO D, PENSA R G, MEO R. From context to distance: learning dissimilarity for categorical data clustering[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): 1-25.
[10] JIA H, CHEUNG Y, LIU J. A new distance metric for unsupervised learning of categorical data[J]. IEEE Transactions on Neural Networks Learning Systems, 2015, 27(5): 1065-1079.
[11] CHEN L, WANG S, WANG K, et al. Soft subspace clustering of categorical data with probabilistic distance[J]. Pattern Recognition, 2016, 51(3): 322-332.
[12] JIAN S, CAO L, LU K, et al. Unsupervised coupled metric similarity for non-iid categorical data[J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(9): 1810-1823.
[13] AGRESTI A. Analysis of ordinal categorical data[M]. New York: John Wiley & Sons, 2010.
[14] ZHANG Y, CHEUNG Y. An ordinal data clustering algorithm with automated distance learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI Press, 2020: 6869-6876.
[15] 林强, 唐加山. 一种适用于混合型分类数据的聚类算法[J]. 计算机工程与应用, 2019, 55(1): 168-173.LIN Q, TANG J S. Clustering algorithm for mixed categorical data[J]. Computer Engineering and Applications, 2019, 55(1): 168-173.
[16] ZHANG Y, CHEUNG Y M, TAN K C. A unified entropy-based distance metric for ordinal-and-nominal-attribute data clustering[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(1): 39-52.
[17] ZHANG Y, CHEUNG Y M. A new distance metric exploiting heterogeneous interattribute relationship for ordinal-and-nominal-attribute data clustering[J]. IEEE Transactions on Cybernetics, 2022, 52(2): 758-771.
[18] ZHANG Y, CHEUNG Y M, ZHANG Y, et al. Learnable weighting of intra-attribute distances for categorical data clustering with nominal and ordinal attributes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(7): 3560-3576.
[19] WEST D B. Introduction to graph theory[M]. Upper Saddle River: Prentice hall, 2001.
[20] CHAKRABORTY M, CHOWDHURY S, CHAKRABORTY J, et al. Algorithms for generating all possible spanning trees of a simple undirected connected graph: an extensive review[J]. Complex & Intelligent Systems, 2019, 5(3): 265-281.
[21] DEVRIENDT K, VAN M P. The simplex geometry of graphs[J]. Journal of Complex Networks, 2019, 7(4): 469-490.
[22] CAI D, ZHANG C, HE X. Unsupervised feature selection for multi-cluster data[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2010: 333-342.
[23] SANTOS J M, EMBRECHTS M. On the use of the adjusted rand index as a metric for evaluating supervised classification[C]//Artificial Neural Networks–ICANN 2009: 19th International Conference. Cyprus: Springer, 2009: 175-184.
[24] ESTÉVEZ P A, TESMER M, PEREZ C A, et al. Normalized mutual information feature selection[J]. IEEE Transactions on Neural Networks, 2009, 20(2): 189-201.
[25] HUANG Z. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(3): 283-304.
[26] JING L, NG M K, HUANG J Z. An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(8): 1026-1041.
[27] JIA H, CHEUNG Y M. Subspace clustering of categorical and numerical data with an unknown number of clusters[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 29(8): 3308-3325.
[1] Wang Yan-lin, Yang Wei-dong, Liu Yang. Modeling of Grain Reserve Multi-chain Supervision System Based on Blockchain [J]. Journal of Guangdong University of Technology, 2023, 40(03): 25-31,37.
[2] Chen Wen-wei, Zhao Xia, Huang Jin-cai. Detour Transformation of Evolutionary Innovations [J]. Journal of Guangdong University of Technology, 2017, 34(01): 1-5.
[3] Zhao Jie, Li Weihua. Research on the Ambiguity of Contradiction Based on HowNet [J]. Journal of Guangdong University of Technology, 2014, 31(2): 21-26.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!