博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
PAM for Kmedoids algorithm, PAM算法的实现, kmeans 算法实现. 利用scikit-learn toolbox.
阅读量:5093 次
发布时间:2019-06-13

本文共 4985 字,大约阅读时间需要 16 分钟。

最近对clustering感兴趣就自己写了一个k mediods的实现. 这个算法据说是比kmeans要robust. 我觉得关键的不同就是cluster的中心点是一个真实的数据点  而不是构想出来的mean.

写起来倒是很简单, 最后vectorize用了cdist() 函数 很好用.

先看结果 : 这是3个类 总共3000 个点的结果.

 

上代码: 

各种imports:

import matplotlib as pltfrom pylab import *import collectionsimport copyimport pdbimport numpy as npfrom scipy.spatial.distance import cdistimport random

这里介绍一下我主要的数据结构 :

medoids 是一个字典,

-2 是total cost

-1: 一个numpy array 存了medoids的index.

0 - k 存了各个cluster的成员的index

 

data是一个array, 每一行是一个数据点.

下面这个函数计算total cost 根据当前的情况:

def total_cost(data, medoids):    '''compute the total cost based on current setting.'''    med_idx = medoids[-1];    k = len(med_idx);    cost = 0.0;    med = data[ med_idx]    dis = cdist( data, med, 'euclidean')    cost = dis.min(axis = 1).sum()        # rewrite using the cdist() function, which should be way faster    # for i in range(k):    # med = data[med_idx[i]]    # for j in medoids[i]:    # cost = cost + np.linalg.norm(med - data[j])    #    medoids[-2] = [cost]

clustering()函数 分配每一点的归属 根据当前的情况.

def clustering(data, medoids):    '''compute the belonging of each data point according to current medoids centers, and eucludiean distance.'''        # pdb.set_trace()    med_idx = medoids[-1]    med = data[med_idx]    k = len(med_idx)        dis = cdist(data, med)    best_med_it_belongs_to = dis.argmin(axis = 1)    for i in range(k):        medoids[i] =where(best_med_it_belongs_to == i)

 

最重要的kmedoids() 函数, 用来循环找到最优的clutering

old... 和cur... 比较 直到没有变化了 就退出循环

cur...每次交换一对 ( 中心点, 非中心点) 构成tmp.... , 从所有的tmp...中找到最优的best....

def kmedoids( data, k):    '''given the data and # of clusters, compute the best clustering based on the algorithm provided in wikipedia: google pam algorithm.'''    # cur_medoids compare with old_medoids, convergence achieved if no change in the list of medoids in consecutive iterations.    # tmp_medoids is cur_medoids swapped only one pair of medoid and non-medoid data point.    # best_medoids is the best tmp_medoids through all possible swaps.    N = len(data)    cur_medoids = {}    cur_medoids[-1] = range(k)    clustering(data, cur_medoids)    total_cost(data, cur_medoids)    old_medoids = {}    old_medoids[-1] = []        iter_counter = 1    # stop if not improvement.    while not set(old_medoids[-1]) == set(cur_medoids[-1]):        print 'iteration couter:' , iter_counter        iter_counter = iter_counter + 1        best_medoids = copy.deepcopy(cur_medoids)        old_medoids = copy.deepcopy(cur_medoids)        # pdb.set_trace()        # iterate over all medoids and non-medoids        for i in range(N):            for j in range(k):                if not i ==j :                    # swap only a pair                    tmp_medoids = copy.deepcopy(cur_medoids)                    tmp_medoids[-1][j] = i                    clustering(data, tmp_medoids)                    total_cost(data, tmp_medoids)                    # pick out the best configuration.                    if( best_medoids[-2] > tmp_medoids[-2]):                        best_medoids = copy.deepcopy(tmp_medoids)        cur_medoids = copy.deepcopy(best_medoids)        print 'current total cost is ', cur_medoids[-2]    return cur_medoids

最后写一个test() 函数, 3个gaussian类 总共1000个点.

def test():    dim = 2    N =1000    # create datas with different normal distributions.    d1 = np.random.normal(1, .2, (N,dim))    d2 = np.random.normal(2, .5, (N,dim))    d3 = np.random.normal(3, .3, (N,dim))    data = np.vstack((d1,d2,d3))        # need to change if more clusters are needed .    k = 3    medoids = kmedoids(data, k)    # plot different clusters with different colors.    scatter( data[medoids[0], 0] ,data[medoids[0], 1], c = 'r')    scatter( data[medoids[1], 0] ,data[medoids[1], 1], c = 'g')    scatter( data[medoids[2], 0] ,data[medoids[2], 1], c = 'y')    scatter( data[medoids[-1], 0],data[medoids[-1], 1] , marker = 'x' , s = 500)    show()

最后执行代码就好了  只要2min

if __name__ =='__main__':    test()

 另外, 我还找到了python很好使的scikit-learn toolbox , 是machine learning相关的. 就练了下手 写个kmeans

先上图: 这是gaussian 3 个类, 每个类3000 个点

代码: 首先还是imports

import timeimport numpy as npimport pylab as plfrom sklearn.cluster import KMeansfrom sklearn.metrics.pairwise import euclidean_distancesfrom sklearn.datasets.samples_generator import make_blobs

然后是生成数据, 并且很简单的train kmeans:

np.random.seed(0)centers = [[1,1], [-1,-1], [1, -1]]k = len(centers)x , labels = make_blobs(n_samples=3000, centers=centers, cluster_std=.7)kmeans = KMeans(init='k-means++', n_clusters=3, n_init = 10)t0 = time.time()kmeans.fit(x)t_end = time.time() - t0

最后是画图:

colors = ['r', 'b', 'g']for k , col in zip( range(k) , colors):    members = (kmeans.labels_ == k )    pl.plot( x[members, 0] , x[members,1] , 'w', markerfacecolor=col, marker='.')    pl.plot(kmeans.cluster_centers_[k,0], kmeans.cluster_centers_[k,1], 'o', markerfacecolor=col,\            markeredgecolor='k', markersize=10)pl.show()

很多时候都觉得python比matlab好用多了, 最重要的是matlab和vim的结合不是很好 用起来很不顺手.

 

 

转载于:https://www.cnblogs.com/wenhoujx/p/3147563.html

你可能感兴趣的文章
map基本用法
查看>>
Redis快速入门
查看>>
BootStrap---2.表格和按钮
查看>>
CRC计算模型
查看>>
Ajax之404,200等查询
查看>>
Aizu - 1378 Secret of Chocolate Poles (DP)
查看>>
csv HTTP简单表服务器
查看>>
OO设计的接口分隔原则
查看>>
数据库连接字符串大全 (转载)
查看>>
java类加载和对象初始化
查看>>
IO流写出到本地 D盘demoIO.txt 文本中
查看>>
Screening technology proved cost effective deal
查看>>
spring boot配置跨域
查看>>
BZOJ 1996 合唱队(DP)
查看>>
安卓学习资料推荐-25
查看>>
Mysql数据库备份和还原常用的命令
查看>>
经典入门_排序
查看>>
Redis Cluster高可用集群在线迁移操作记录【转】
查看>>
二、spring中装配bean
查看>>
VIM工具
查看>>