Sunday, June 10, 2007

几种mixed type data聚类算法解读与讨论

考虑到很多数据库同时含有数值+类型型的数据,最近看书,看到3个不错的算法,

算法1,2007年1月由台湾人提出:给类型数据用distance hierarchy来衡量其距离(保留了语义semantic,比k-mode/k-prototype只管类型是否相同,要好)。
Mining of mixed data with application to catalog marketing
To express the similarity between categorical values, distance hierarchy has been proposed. Accordingly, the similarity of the categorical part is measured based on entropy weighted by the distances in the hierarchies. A new validity index for evaluating the clustering results has also been proposed
算法2,由大陆人提出,步骤如下:
1。强行把一个库分为数值属性和类型2个部分; 2。分别用简单高效的算法对数值,类型聚类,这样自然避免了其他折衷(改进)带来的问题 3。再把2头生成的簇合并,考虑到聚类就是一个打标签(类型)的过程,用类型数据聚类的办法来解决之
这种分而治之的思路很强悍!(divide and conquer)
Mining of mixed data with application to catalog marketing · ARTICLE Expert Systems with Applications, Volume 32, Issue 1, January 2007, Pages 12-23
Clustering Categorical Data:A Cluster Ensemble Approach <<高技术通讯(英文版) >>2003年04期 Deng Shengchun , Xu Xiaofei , He Zengyou(何增友)
Clustering categorical data, an integral part of data mining,has attracted much attention recently. In this paper, the authors formally define the categorical data clustering problem as an optimization problem from the viewpoint of cluster ensemble, and apply cluster ensemble approach for clustering categorical data. Experimental results on real datasets show that better clustering accuracy can be obtained by comparing with existing categorical data clustering algorithms
算法3,希腊人提出,在文献2得引文19,采用同一算法,运行多次,对逐次结果进行合并,并统计对各簇的归属度,再用某种criteria来优化确定簇 数(root square之类),效果比起普通的greedy EM算法来,确定的簇数和分簇形状十分准确。
A Multi-Clustering Fusion Algorithm, Proc. of the Second Hellenic Conference on AI


讨论:
算法2对如何合并两种不同类型数据的结果簇,语焉不详,比希腊人的论文差太多了!鄙视之! 第三篇论文推荐认真读,很爽的!

Labels:

0 Comments:

Post a Comment

<< Home