您的位置 首页 资料

标签传达算法(Label Propagation)及Python完成

众所周知,机器学习可以大体分为三大类:监督学习、非监督学习和半监督学习。监督学习可以认为是我们有非常多的labeled标注数据来train一个模型,期待

众所周知,机器学习能够大体分为三大类:监督学习、非监督学习和半监督学习。监督学习能够认为是咱们有十分多的labeled标示数据来train一个模型,等待这个模型能学习到数据的散布,以期对未来没有见到的样本做猜测。那这个功能的源头–练习数据,就显得十分感觉。你有必要有满足的练习数据,以掩盖真实实践数据中的样本散布才能够,这样学习到的模型才有含义。那非监督学习便是没有任何的labeled数据,便是平常所说的聚类了,使用他们自身的数据散布,给他们区分类别。而半监督学习,望文生义便是处于两者之间的,只要少数的labeled数据,咱们企图从这少数的labeled数据和许多的unlabeled数据中学习到有用的信息。

一、半监督学习

半监督学习(Semi-supervised learning)发挥作用的场合是:你的数据有一些有label,一些没有。并且一般是绝大部分都没有,只要少量几个有label。半监督学习算法会充沛的使用unlabeled数据来捕捉咱们整个数据的潜在散布。它根据三大假定:

1)Smoothness滑润假定:类似的数据具有相同的label。

2)Cluster聚类假定:处于同一个聚类下的数据具有相同label。

3)Manifold流形假定:处于同一流形结构下的数据具有相同label。

例如下图,只要两个labeled数据,假设直接用他们来练习一个分类器,例如LR或许SVM,那么学出来的分类面便是左图那样的。假设实践中,这个数据是右图那儿散布的话,猪都看得出来,左图练习的这个分类器烂的乌烟瘴气、不忍目睹。由于咱们的labeled练习数据太少了,都没办法掩盖咱们未来或许遇到的状况。可是,假设右图那样,把许多的unlabeled数据(黑色的)都考虑进来,有个全局观念,牛逼的算法会发现,哎哟,原来是两个圈圈(别离处于两个圆形的流形之上)!那算法就很聪明,把大圈的数据都归类为赤色类别,把内圈的数据都归类为蓝色类别。由于,实践中,labeled数据是贵重,很难取得的,但unlabeled数据就不是了,写个脚本在网上爬就能够了,因而假设能充沛使用许多的unlabeled数据来辅佐提高咱们的模型学习,这个价值就十分大。

半监督学习算法有许多,下面咱们介绍最简略的标签传达算法(label propagation),最喜欢简略了,哈哈。

二、标签传达算法

标签传达算法(label propagation)的中心思维十分简略:类似的数据应该具有相同的label。LP算法包括两大进程:1)结构类似矩阵;2)英勇的传达吧。

2.1、类似矩阵构建

LP算法是根据Graph的,因而咱们需求先构建一个图。咱们为一切的数据构建一个图,图的节点便是一个数据点,包括labeled和unlabeled的数据。节点i和节点j的边表明他们的类似度。这个图的构建办法有许多,这儿咱们假定这个图是全衔接的,节点i和节点j的边权重为:

这儿,α是超参。

还有个十分常用的图构建办法是knn图,也便是只保存每个节点的k近邻权重,其他的为0,也便是不存在边,因而是稀少的类似矩阵。

2.2、LP算法

标签传达算法十分简略:经过节点之间的边传达label。边的权重越大,表明两个节点越类似,那么label越简略传达曩昔。咱们界说一个NxN的概率搬运矩阵P:

Pij表明从节点i搬运到节点j的概率。假定有C个类和L个labeled样本,咱们界说一个LxC的label矩阵YL,第i行表明第i个样本的标签指示向量,即假设第i个样本的类别是j,那么该行的第j个元素为1,其他为0。相同,咱们也给U个unlabeled样本一个UxC的label矩阵YU。把他们兼并,咱们得到一个NxC的soft label矩阵F=[YL;YU]。soft label的意思是,咱们保存样本i归于每个类别的概率,而不是互斥性的,这个样本以概率1只归于一个类。当然了,终究确认这个样本i的类别的时分,是取max也便是概率最大的那个类作为它的类别的。那F里边有个YU,它一开端是不知道的,那最开端的值是多少?无所谓,随意设置一个值就能够了。

千呼万唤始出来,简略的LP算法如下:

1)履行传达:F=PF

2)重置F中labeled样本的标签:FL=YL

3)重复进程1)和2)直到F收敛。

进程1)便是将矩阵P和矩阵F相乘,这一步,每个节点都将自己的label以P确认的概率传达给其他节点。假设两个节点越类似(在欧式空间中距离越近),那么对方的label就越简略被自己的label赋予,便是更简略拉帮结派。进程2)十分要害,由于labeled数据的label是事前确认的,它不能被带跑,所以每次传达完,它都得回归它原本的label。跟着labeled数据不断的将自己的label传达出去,终究的类鸿沟会穿越高密度区域,而停留在低密度的距离中。相当于每个不同类别的labeled样本区分了势力范围。

2.3、变身的LP算法

咱们知道,咱们每次迭代都是核算一个soft label矩阵F=[YL;YU],可是YL是已知的,核算它没有什么用,在进程2)的时分,还得把它弄回来。咱们关怀的仅仅YU,那咱们能不能只核算YU呢?Yes。咱们将矩阵P做以下区分:

这时分,咱们的算法就一个运算:

迭代上面这个进程直到收敛就ok了,是不是很cool。能够看到FU不光取决于labeled数据的标签及其搬运概率,还取决了unlabeled数据的当时label和搬运概率。因而LP算法能额定运用unlabeled数据的散布特色。

这个算法的收敛性也十分简略证明,详细见参阅文献[1]。实践上,它是能够收敛到一个凸解的:

所以咱们也能够直接这样求解,以取得终究的YU。可是在实践的使用进程中,由于矩阵求逆需求O(n3)的复杂度,所以假设unlabeled数据十分多,那么I – PUU矩阵的求逆将会十分耗时,因而这时分一般挑选迭代算法来完成。

三、LP算法的Python完成

Python环境的建立就不嗦了,能够参阅前面的博客。需求额定依靠的库是经典的numpy和matplotlib。代码中包括了两种图的构建办法:RBF和KNN指定。一起,自己生成了两个toy数据库:两条长形形状和两个圈圈的数据。第四部分咱们用大点的数据库来做试验,先简略的可视化验证代码的正确性,再前哨。

算法代码:

#***************************************************************************

#*

#* Description: label propagation

#* Author: Zou Xiaoyi (zouxy09@qq.com)

#* Date: 2015-10-15

#* HomePage: http://blog.csdn.net/zouxy09

#*

#**************************************************************************

import time

import numpy as np

# return k neighbors index

def navie_knn(dataSet, query, k):

numSamples = dataSet.shape[0]

## step 1: calculate Euclidean distance

diff = np.tile(query, (numSamples, 1)) – dataSet

squaredDiff = diff ** 2

squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row

## step 2: sort the distance

sortedDistIndices = np.argsort(squaredDist)

if k > len(sortedDistIndices):

k = len(sortedDistIndices)

return sortedDistIndices[0:k]

# build a big graph (normalized weight matrix)

def buildGraph(MatX, kernel_type, rbf_sigma = None, knn_num_neighbors = None):

num_samples = MatX.shape[0]

affinity_matrix = np.zeros((num_samples, num_samples), np.float32)

if kernel_type == ‘rbf’:

if rbf_sigma == None:

raise ValueError(‘You should input a sigma of rbf kernel!’)

for i in xrange(num_samples):

row_sum = 0.0

for j in xrange(num_samples):

diff = MatX[i, :] – MatX[j, :]

affinity_matrix[i][j] = np.exp(sum(diff**2) / (-2.0 * rbf_sigma**2))

row_sum += affinity_matrix[i][j]

affinity_matrix[i][:] /= row_sum

elif kernel_type == ‘knn’:

if knn_num_neighbors == None:

raise ValueError(‘You should input a k of knn kernel!’)

for i in xrange(num_samples):

k_neighbors = navie_knn(MatX, MatX[i, :], knn_num_neighbors)

affinity_matrix[i][k_neighbors] = 1.0 / knn_num_neighbors

else:

raise NameError(‘Not support kernel type! You can use knn or rbf!’)

return affinity_matrix

# label propagation

def labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = ‘rbf’, rbf_sigma = 1.5,

knn_num_neighbors = 10, max_iter = 500, tol = 1e-3):

# initialize

num_label_samples = Mat_Label.shape[0]

num_unlabel_samples = Mat_Unlabel.shape[0]

num_samples = num_label_samples + num_unlabel_samples

labels_list = np.unique(labels)

num_classes = len(labels_list)

MatX = np.vstack((Mat_Label, Mat_Unlabel))

clamp_data_label = np.zeros((num_label_samples, num_classes), np.float32)

for i in xrange(num_label_samples):

clamp_data_label[i][labels[i]] = 1.0

label_function = np.zeros((num_samples, num_classes), np.float32)

label_function[0 : num_label_samples] = clamp_data_label

label_function[num_label_samples : num_samples] = -1

# graph construction

affinity_matrix = buildGraph(MatX, kernel_type, rbf_sigma, knn_num_neighbors)

# start to propagation

iter = 0; pre_label_function = np.zeros((num_samples, num_classes), np.float32)

changed = np.abs(pre_label_function – label_function).sum()

while iter max_iter and changed > tol:

if iter % 1 == 0:

print —> Iteration %d/%d, changed: %f % (iter, max_iter, changed)

pre_label_function = label_function

iter += 1

# propagation

label_function = np.dot(affinity_matrix, label_function)

# clamp

label_function[0 : num_label_samples] = clamp_data_label

# check converge

changed = np.abs(pre_label_function – label_function).sum()

# get terminate label of unlabeled data

unlabel_data_labels = np.zeros(num_unlabel_samples)

for i in xrange(num_unlabel_samples):

unlabel_data_labels[i] = np.argmax(label_function[i+num_label_samples])

return unlabel_data_labels

测验代码:

#***************************************************************************

#*

#* Description: label propagation

#* Author: Zou Xiaoyi (zouxy09@qq.com)

#* Date: 2015-10-15

#* HomePage: http://blog.csdn.net/zouxy09

#*

#**************************************************************************

import time

import math

import numpy as np

from label_propagation import labelPropagation

# show

def show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels):

import matplotlib.pyplot as plt

for i in range(Mat_Label.shape[0]):

if int(labels[i]) == 0:

plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], ‘Dr’)

elif int(labels[i]) == 1:

plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], ‘Db’)

else:

plt.plot(Mat_Label[i, 0], Mat_Label[i, 1], ‘Dy’)

for i in range(Mat_Unlabel.shape[0]):

if int(unlabel_data_labels[i]) == 0:

plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], ‘or’)

elif int(unlabel_data_labels[i]) == 1:

plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], ‘ob’)

else:

plt.plot(Mat_Unlabel[i, 0], Mat_Unlabel[i, 1], ‘oy’)

plt.xlabel(‘X1’); plt.ylabel(‘X2’)

plt.xlim(0.0, 12.)

plt.ylim(0.0, 12.)

plt.show()

def loadCircleData(num_data):

center = np.array([5.0, 5.0])

radiu_inner = 2

radiu_outer = 4

num_inner = num_data / 3

num_outer = num_data – num_inner

data = []

theta = 0.0

for i in range(num_inner):

pho = (theta % 360) * math.pi / 180

tmp = np.zeros(2, np.float32)

tmp[0] = radiu_inner * math.cos(pho) + np.random.rand(1) + center[0]

tmp[1] = radiu_inner * math.sin(pho) + np.random.rand(1) + center[1]

data.append(tmp)

theta += 2

theta = 0.0

for i in range(num_outer):

pho = (theta % 360) * math.pi / 180

tmp = np.zeros(2, np.float32)

tmp[0] = radiu_outer * math.cos(pho) + np.random.rand(1) + center[0]

tmp[1] = radiu_outer * math.sin(pho) + np.random.rand(1) + center[1]

data.append(tmp)

theta += 1

Mat_Label = np.zeros((2, 2), np.float32)

Mat_Label[0] = center + np.array([-radiu_inner + 0.5, 0])

Mat_Label[1] = center + np.array([-radiu_outer + 0.5, 0])

labels = [0, 1]

Mat_Unlabel = np.vstack(data)

return Mat_Label, labels, Mat_Unlabel

def loadBandData(num_unlabel_samples):

#Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]])

#labels = [0, 1]

#Mat_Unlabel = np.array([[5.1, 2.], [5.0, 8.1]])

Mat_Label = np.array([[5.0, 2.], [5.0, 8.0]])

labels = [0, 1]

num_dim = Mat_Label.shape[1]

Mat_Unlabel = np.zeros((num_unlabel_samples, num_dim), np.float32)

Mat_Unlabel[:num_unlabel_samples/2, :] = (np.random.rand(num_unlabel_samples/2, num_dim) – 0.5) * np.array([3, 1]) + Mat_Label[0]

Mat_Unlabel[num_unlabel_samples/2 : num_unlabel_samples, :] = (np.random.rand(num_unlabel_samples/2, num_dim) – 0.5) * np.array([3, 1]) + Mat_Label[1]

return Mat_Label, labels, Mat_Unlabel

# main function

if __name__ == __main__:

num_unlabel_samples = 800

#Mat_Label, labels, Mat_Unlabel = loadBandData(num_unlabel_samples)

Mat_Label, labels, Mat_Unlabel = loadCircleData(num_unlabel_samples)

## Notice: when use ‘rbf’ as our kernel, the choice of hyper parameter ‘sigma’ is very import! It should be

## chose according to your dataset, specific the distance of two data points. I think it should ensure that

## each point has about 10 knn or w_i,j is large enough. It also influence the speed of converge. So, may be

## ‘knn’ kernel is better!

#unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = ‘rbf’, rbf_sigma = 0.2)

unlabel_data_labels = labelPropagation(Mat_Label, Mat_Unlabel, labels, kernel_type = ‘knn’, knn_num_neighbors = 10, max_iter = 400)

show(Mat_Label, labels, Mat_Unlabel, unlabel_data_labels)

该注释的,代码都注释的,有看不了解的,欢迎沟通。不同迭代次数时分的成果如下:

是不是很漂亮的传达进程?!在数值上也是能够看到跟着迭代的进行逐步收敛的,迭代的数值改变进程如下:

—> Iteration 0/400, changed: 1602.000000

—> Iteration 1/400, changed: 6.300182

—> Iteration 2/400, changed: 5.129996

—> Iteration 3/400, changed: 4.301994

—> Iteration 4/400, changed: 3.819295

—> Iteration 5/400, changed: 3.501743

—> Iteration 6/400, changed: 3.277122

—> Iteration 7/400, changed: 3.105952

—> Iteration 8/400, changed: 2.967030

—> Iteration 9/400, changed: 2.848606

—> Iteration 10/400, changed: 2.743997

—> Iteration 11/400, changed: 2.649270

—> Iteration 12/400, changed: 2.562057

—> Iteration 13/400, changed: 2.480885

—> Iteration 14/400, changed: 2.404774

—> Iteration 15/400, changed: 2.333075

—> Iteration 16/400, changed: 2.265301

—> Iteration 17/400, changed: 2.201107

—> Iteration 18/400, changed: 2.140209

—> Iteration 19/400, changed: 2.082354

—> Iteration 20/400, changed: 2.027376

—> Iteration 21/400, changed: 1.975071

—> Iteration 22/400, changed: 1.925286

—> Iteration 23/400, changed: 1.877894

—> Iteration 24/400, changed: 1.832743

—> Iteration 25/400, changed: 1.789721

—> Iteration 26/400, changed: 1.748706

—> Iteration 27/400, changed: 1.709593

—> Iteration 28/400, changed: 1.672284

—> Iteration 29/400, changed: 1.636668

—> Iteration 30/400, changed: 1.602668

—> Iteration 31/400, changed: 1.570200

—> Iteration 32/400, changed: 1.539179

—> Iteration 33/400, changed: 1.509530

—> Iteration 34/400, changed: 1.481182

—> Iteration 35/400, changed: 1.454066

—> Iteration 36/400, changed: 1.428120

—> Iteration 37/400, changed: 1.403283

—> Iteration 38/400, changed: 1.379502

—> Iteration 39/400, changed: 1.356734

—> Iteration 40/400, changed: 1.334906

—> Iteration 41/400, changed: 1.313983

—> Iteration 42/400, changed: 1.293921

—> Iteration 43/400, changed: 1.274681

—> Iteration 44/400, changed: 1.256214

—> Iteration 45/400, changed: 1.238491

—> Iteration 46/400, changed: 1.221474

—> Iteration 47/400, changed: 1.205126

—> Iteration 48/400, changed: 1.189417

—> Iteration 49/400, changed: 1.174316

—> Iteration 50/400, changed: 1.159804

—> Iteration 51/400, changed: 1.145844

—> Iteration 52/400, changed: 1.132414

—> Iteration 53/400, changed: 1.119490

—> Iteration 54/400, changed: 1.107032

—> Iteration 55/400, changed: 1.095054

—> Iteration 56/400, changed: 1.083513

—> Iteration 57/400, changed: 1.072397

—> Iteration 58/400, changed: 1.061671

—> Iteration 59/400, changed: 1.051324

—> Iteration 60/400, changed: 1.041363

—> Iteration 61/400, changed: 1.031742

—> Iteration 62/400, changed: 1.022459

—> Iteration 63/400, changed: 1.013494

—> Iteration 64/400, changed: 1.004836

—> Iteration 65/400, changed: 0.996484

—> Iteration 66/400, changed: 0.988407

—> Iteration 67/400, changed: 0.980592

—> Iteration 68/400, changed: 0.973045

—> Iteration 69/400, changed: 0.965744

—> Iteration 70/400, changed: 0.958682

—> Iteration 71/400, changed: 0.951848

—> Iteration 72/400, changed: 0.945227

—> Iteration 73/400, changed: 0.938820

—> Iteration 74/400, changed: 0.932608

—> Iteration 75/400, changed: 0.926590

—> Iteration 76/400, changed: 0.920765

—> Iteration 77/400, changed: 0.915107

—> Iteration 78/400, changed: 0.909628

—> Iteration 79/400, changed: 0.904309

—> Iteration 80/400, changed: 0.899143

—> Iteration 81/400, changed: 0.894122

—> Iteration 82/400, changed: 0.889259

—> Iteration 83/400, changed: 0.884530

—> Iteration 84/400, changed: 0.879933

—> Iteration 85/400, changed: 0.875464

—> Iteration 86/400, changed: 0.871121

—> Iteration 87/400, changed: 0.866888

—> Iteration 88/400, changed: 0.862773

—> Iteration 89/400, changed: 0.858783

—> Iteration 90/400, changed: 0.854879

—> Iteration 91/400, changed: 0.851084

—> Iteration 92/400, changed: 0.847382

—> Iteration 93/400, changed: 0.843779

—> Iteration 94/400, changed: 0.840274

—> Iteration 95/400, changed: 0.836842

—> Iteration 96/400, changed: 0.833501

—> Iteration 97/400, changed: 0.830240

—> Iteration 98/400, changed: 0.827051

—> Iteration 99/400, changed: 0.823950

—> Iteration 100/400, changed: 0.820906

—> Iteration 101/400, changed: 0.817946

—> Iteration 102/400, changed: 0.815053

—> Iteration 103/400, changed: 0.812217

—> Iteration 104/400, changed: 0.809437

—> Iteration 105/400, changed: 0.806724

—> Iteration 106/400, changed: 0.804076

—> Iteration 107/400, changed: 0.801480

—> Iteration 108/400, changed: 0.798937

—> Iteration 109/400, changed: 0.796448

—> Iteration 110/400, changed: 0.794008

—> Iteration 111/400, changed: 0.791612

—> Iteration 112/400, changed: 0.789282

—> Iteration 113/400, changed: 0.786984

—> Iteration 114/400, changed: 0.784728

—> Iteration 115/400, changed: 0.782516

—> Iteration 116/400, changed: 0.780355

—> Iteration 117/400, changed: 0.778216

—> Iteration 118/400, changed: 0.776139

—> Iteration 119/400, changed: 0.774087

—> Iteration 120/400, changed: 0.772072

—> Iteration 121/400, changed: 0.770085

—> Iteration 122/400, changed: 0.768146

—> Iteration 123/400, changed: 0.766232

—> Iteration 124/400, changed: 0.764356

—> Iteration 125/400, changed: 0.762504

—> Iteration 126/400, changed: 0.760685

—> Iteration 127/400, changed: 0.758889

—> Iteration 128/400, changed: 0.757135

—> Iteration 129/400, changed: 0.755406

四、LP算法MPI并行完成

这儿,咱们测验的是LP的变身版别。从公式,咱们能够看到,第二项PULYL迭代进程并没有发生改变,所以这部分实践上从迭代开端就能够核算好,然后防止重复核算。不过,不管怎样,LP算法都要核算一个UxU的矩阵PUU和一个UxC矩阵FU的乘积。当咱们的unlabeled数据十分多,并且类别也许多的时分,核算是很慢的,一起占用的内存量也十分大。别的,结构Graph需求核算两两的类似度,也是O(n2)的复杂度,当咱们数据的特征维度很大的时分,这个核算量也是十分客观的。所以咱们就得考虑并行处理了。并且最好是能放到集群上并行。那怎么并行呢?

对算法的并行化,一般分为两种:数据并行和模型并行。

数据并行很好了解,便是将数据区分,每个节点只处理一部分数据,例如咱们结构图的时分,核算每个数据的k近邻。例如咱们有1000个样本和20个CPU节点,那么就均匀分发,让每个CPU节点核算50个样本的k近邻,然后终究再兼并咱们的成果。可见这个加快比也是十分可观的。

模型并行一般发生在模型很大,无法放到单机的内存里边的时分。例如巨大的深度神经网络练习的时分,就需求把这个网络切开,然后别离求解梯度,终究有个leader的节点来搜集咱们的梯度,再反馈给咱们去更新。当然了,其间存在更细致和高效的工程处理办法。在咱们的LP算法中,也是能够做模型并行的。假设咱们的类别数C很大,把类别数切开,让不同的CPU节点处理,实践上就相当于模型并行了。

那为啥不切大矩阵PUU,而是切小点的矩阵FU,由于大矩阵PUU无法独立分块,并行的一个原则是处理有必要是独立的。 矩阵FU依靠的是一切的U,而把PUU切开分发到其他节点的时分,每次FU的更新都需求和其他的节点通讯,这个通讯的价值是很大的(实践上,许多并行体系无法到达线性的加快度的瓶颈是通讯!线性加快比是,我增加了n台机器,速度就提高了n倍)。可是对类别C也便是矩阵FU切分,就不会有这个问题,由于他们的核算是独立的。仅仅决议样本的终究类别的时分,将一切的FU搜集回来求max就能够了。

所以,鄙人面的代码中,是一起包括了数据并行和模型并行的雏形的。别的,还值得一提的是,咱们是迭代算法,那决议什么时分迭代算法中止?除了判别收敛外,咱们还能够让每迭代几步,就用测验label测验一次成果,看模型的全体练习功能怎么。特别是判别练习是否过拟合的时分十分有用。因而,代码中包括了这部分内容。

好了,代码总算来了。咱们能够搞点大数据库来测验,假设有MPI集群条件的话就更好了。

下面的代码依靠numpy、scipy(用其稀少矩阵加快核算)和mpi4py。其间mpi4py需求依靠openmpi和Cpython,能够参阅我之前的博客进行装置。

#***************************************************************************

#*

#* Description: label propagation

#* Author: Zou Xiaoyi (zouxy09@qq.com)

#* Date: 2015-10-15

#* HomePage: http://blog.csdn.net/zouxy09

#*

#**************************************************************************

import os, sys, time

import numpy as np

from scipy.sparse import csr_matrix, lil_matrix, eye

import operator

import cPickle as pickle

import mpi4py.MPI as MPI

#

# Global variables for MPI

#

# instance for invoking MPI related functions

comm = MPI.COMM_WORLD

# the node rank in the whole community

comm_rank = comm.Get_rank()

# the size of the whole community, i.e., the total number of working nodes in the MPI cluster

comm_size = comm.Get_size()

# load mnist dataset

def load_MNIST():

import gzip

f = gzip.open(mnist.pkl.gz, rb)

train, val, test = pickle.load(f)

f.close()

Mat_Label = train[0]

labels = train[1]

Mat_Unlabel = test[0]

groundtruth = test[1]

labels_id = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

return Mat_Label, labels, labels_id, Mat_Unlabel, groundtruth

# return k neighbors index

def navie_knn(dataSet, query, k):

numSamples = dataSet.shape[0]

## step 1: calculate Euclidean distance

diff = np.tile(query, (numSamples, 1)) – dataSet

squaredDiff = diff ** 2

squaredDist = np.sum(squaredDiff, axis = 1) # sum is performed by row

## step 2: sort the distance

sortedDistIndices = np.argsort(squaredDist)

if k > len(sortedDistIndices):

k = len(sortedDistIndices)

return sortedDistIndices[0:k]

# build a big graph (normalized weight matrix)

# sparse U x (U + L) matrix

def buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors):

num_unlabel_samples = Mat_Unlabel.shape[0]

data = []; indices = []; indptr = [0]

Mat_all = np.vstack((Mat_Label, Mat_Unlabel))

values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors

for i in xrange(num_unlabel_samples):

k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors)

indptr.append(np.int32(indptr[-1]) + knn_num_neighbors)

indices.extend(k_neighbors)

data.append(values)

return csr_matrix((np.hstack(data), indices, indptr))

# build a big graph (normalized weight matrix)

# sparse U x (U + L) matrix

def buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors):

num_unlabel_samples = Mat_Unlabel.shape[0]

local_data = []; local_indices = []; local_indptr = [0]

Mat_all = np.vstack((Mat_Label, Mat_Unlabel))

values = np.ones(knn_num_neighbors, np.float32) / knn_num_neighbors

sample_offset = np.linspace(0, num_unlabel_samples, comm_size + 1).astype(‘int’)

for i in range(sample_offset[comm_rank], sample_offset[comm_rank+1]):

k_neighbors = navie_knn(Mat_all, Mat_Unlabel[i, :], knn_num_neighbors)

local_indptr.append(np.int32(local_indptr[-1]) + knn_num_neighbors)

local_indices.extend(k_neighbors)

local_data.append(values)

data = np.hstack(comm.allgather(local_data))

indices = np.hstack(comm.allgather(local_indices))

indptr_tmp = comm.allgather(local_indptr)

indptr = []

for i in range(len(indptr_tmp)):

if i == 0:

indptr.extend(indptr_tmp[i])

else:

last_indptr = indptr[-1]

del(indptr[-1])

indptr.extend(indptr_tmp[i] + last_indptr)

return csr_matrix((np.hstack(data), indices, indptr), dtype = np.float32)

# label propagation

def run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 100, tol = 1e-4, test_per_iter = 1):

# load data and graph

print Processor %d/%d loading graph file… % (comm_rank, comm_size)

#Mat_Label, labels, Mat_Unlabel, groundtruth = loadFourBandData()

Mat_Label, labels, labels_id, Mat_Unlabel, unlabel_data_id = load_MNIST()

if comm_size > len(labels_id):

raise ValueError(Sorry, the processors must be less than the number of classes)

#affinity_matrix = buildSubGraph(Mat_Label, Mat_Unlabel, knn_num_neighbors)

affinity_matrix = buildSubGraph_MPI(Mat_Label, Mat_Unlabel, knn_num_neighbors)

# get some parameters

num_classes = len(labels_id)

num_label_samples = len(labels)

num_unlabel_samples = Mat_Unlabel.shape[0]

affinity_matrix_UL = affinity_matrix[:, 0:num_label_samples]

affinity_matrix_UU = affinity_matrix[:, num_label_samples:num_label_samples+num_unlabel_samples]

if comm_rank == 0:

print Have %d labeled images, %d unlabeled images and %d classes % (num_label_samples, num_unlabel_samples, num_classes)

# divide label_function_U and label_function_L to all processors

class_offset = np.linspace(0, num_classes, comm_size + 1).astype(‘int’)

# initialize local label_function_U

local_start_class = class_offset[comm_rank]

local_num_classes = class_offset[comm_rank+1] – local_start_class

local_label_function_U = eye(num_unlabel_samples, local_num_classes, 0, np.float32, format=’csr’)

# initialize local label_function_L

local_label_function_L = lil_matrix((num_label_samples, local_num_classes), dtype = np.float32)

for i in xrange(num_label_samples):

class_off = int(labels[i]) – local_start_class

if class_off >= 0 and class_off local_num_classes:

local_label_function_L[i, class_off] = 1.0

local_label_function_L = local_label_function_L.tocsr()

local_label_info = affinity_matrix_UL.dot(local_label_function_L)

print Processor %d/%d has to process %d classes… % (comm_rank, comm_size, local_label_function_L.shape[1])

# start to propagation

iter = 1; changed = 100.0;

evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)

while True:

pre_label_function = local_label_function_U.copy()

# propagation

local_label_function_U = affinity_matrix_UU.dot(local_label_function_U) + local_label_info

# check converge

local_changed = abs(pre_label_function – local_label_function_U).sum()

changed = comm.reduce(local_changed, root = 0, op = MPI.SUM)

status = ‘RUN’

test = False

if comm_rank == 0:

if iter % 1 == 0:

norm_changed = changed / (num_unlabel_samples * num_classes)

print —> Iteration %d/%d, changed: %f % (iter, max_iter, norm_changed)

if iter >= max_iter or changed tol:

status = ‘STOP’

print ************** Iteration over! ****************

if iter % test_per_iter == 0:

test = True

iter += 1

test = comm.bcast(test if comm_rank == 0 else None, root = 0)

status = comm.bcast(status if comm_rank == 0 else None, root = 0)

if status == ‘STOP’:

break

if test == True:

evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)

evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id)

def evaluation(num_unlabel_samples, local_start_class, local_label_function_U, unlabel_data_id, labels_id):

# get local label with max score

if comm_rank == 0:

print Start to combine local result…

local_max_score = np.zeros((num_unlabel_samples, 1), np.float32)

local_max_label = np.zeros((num_unlabel_samples, 1), np.int32)

for i in xrange(num_unlabel_samples):

local_max_label[i, 0] = np.argmax(local_label_function_U.getrow(i).todense())

local_max_score[i, 0] = local_label_function_U[i, local_max_label[i, 0]]

local_max_label[i, 0] += local_start_class

# gather the results from all the processors

if comm_rank == 0:

print Start to gather results from all processors

all_max_label = np.hstack(comm.allgather(local_max_label))

all_max_score = np.hstack(comm.allgather(local_max_score))

# get terminate label of unlabeled data

if comm_rank == 0:

print Start to analysis the results…

right_predict_count = 0

for i in xrange(num_unlabel_samples):

if i % 1000 == 0:

print ***, all_max_score[i]

max_idx = np.argmax(all_max_score[i])

max_label = all_max_label[i, max_idx]

if int(unlabel_data_id[i]) == int(labels_id[max_label]):

right_predict_count += 1

accuracy = float(right_predict_count) * 100.0 / num_unlabel_samples

print Have %d samples, accuracy: %.3f%%! % (num_unlabel_samples, accuracy)

if __name__ == ‘__main__’:

run_label_propagation_sparse(knn_num_neighbors = 20, max_iter = 30)

声明:本文内容来自网络转载或用户投稿,文章版权归原作者和原出处所有。文中观点,不代表本站立场。若有侵权请联系本站删除(kf@86ic.com)https://www.86ic.net/ziliao/156759.html

为您推荐

联系我们

联系我们

在线咨询: QQ交谈

邮箱: kf@86ic.com

关注微信
微信扫一扫关注我们

微信扫一扫关注我们

返回顶部