博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
《机器学习实战》KNN算法实现
阅读量:4288 次
发布时间:2019-05-27

本文共 15482 字,大约阅读时间需要 51 分钟。


本系列都是参考《机器学习实战》这本书,只对学习过程一个记录,不做详细的描述!

注释:看了一段时间Ng的机器学习视频,感觉不能光看不练,现在一边练习再一边去学习理论!

KNN很早就之前就看过也记录过,在此不做更多说明,这是k-means之前的记录,感觉差不多:

1.简单的分类

代码:

1 import numpy as np 2 import operator 3 import KNN 4  5 def classify0(inX,dataSet,labels,k): 6      dataSetSize = dataSet.shape[0] #样本个数 7      diffMat = np.tile(inX,(dataSetSize,1)) - dataSet#样本每个值和测试数据做差 8      sqDiffMat = diffMat**2#平方 9      sqDistances = sqDiffMat.sum(axis=1)#第二维度求和,也就是列10      distances = sqDistances**0.5#平方根11      sortedDistIndicies = distances.argsort()#下标排序12      classCount = {}13 14      for i in range(k):15           voteIlabel = labels[sortedDistIndicies[i]]#得到距离最近的几个数16           classCount[voteIlabel] = classCount.get(voteIlabel,0)+1#标签计数17      sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)#按照数值排序operator.itemgetter(1)代表第二个域18      #上面排序之后就不是字典了,而是一个列表里面包含的元组[('c',2),('a',3)]19      return sortedClassCount[0][0]20 21 if __name__ == '__main__':22      group,labels = KNN.createDataSet()23      result = classify0([0,0.5],group,labels,1)24      print (result)

KNN.Py文件

1 import numpy as np2 import operator3 4 5 def createDataSet():6     group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])7     labels = ['A', 'B', 'C', 'D']8     return group, labels

2.约会网站的预测

  下面给出每个部分的代码和注释:

A.文本文件转换为可用数据

上面的文本中有空格和换行,而且样本和标签都在一起,必须的分开处理成矩阵才可以进行下一步操作。

1 def file2matrix(filename):#把文件转化为可操作数据 2     fr = open(filename)#打开文件 3     arrayOLines = fr.readlines()#读取每行文件 4     numberOfLines = len(arrayOLines)#行数量 5     returnMat = np.zeros([numberOfLines,3])#存储数据 6     classLabelVector = [] 7     index = 0 8     for line in arrayOLines: 9         line = line.strip()#去除换行符10         listFromLine = line.split('\t')#按照空格去分割11         returnMat[index,:] = listFromLine[0:3]#样本12         classLabelVector.append(int(listFromLine[-1]))#labels13         index += 114     return returnMat,classLabelVector#返回数据和标签

 B.归一化

数据大小差异太明显,比如有三个特征:a=[1,2,3],b=[1000,2000,3000],c=[0.1,0.2,0.3],我们发现c和a根本没啥作用,因为b的值太大了,或者说b的权重太大了,Ng中可以用惩罚系数去操作,或者正则化都可以处理这类数据,当然这是题外话。

1 def autoNorm(dataSet):#归一化函数 2     #每列的最值 3     minValue = dataSet.min(0) 4     maxValue = dataSet.max(0) 5     range = maxValue - minValue 6     #创建最小值矩阵 7     midData =  np.tile(minValue,[dataSet.shape[0],1]) 8     dataSet = dataSet - midData 9     #创建range矩阵10     range = np.tile(range,[dataSet.shape[0],1])11     dataSet = dataSet / range #直接相除不是矩阵相除12     return dataSet,minValue,maxValue

C.预测

KNN的方法就是距离,计算K个距离,然后排序看哪个占得比重大就选哪个类。

1 def classify0(inX, dataSet, labels, k):#核心分类程序 2     dataSetSize = dataSet.shape[0]  # 样本个数 3     diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet  # 样本每个值和测试数据做差 4     sqDiffMat = diffMat ** 2  # 平方 5     sqDistances = sqDiffMat.sum(axis=1)  # 第二维度求和,也就是列 6     distances = sqDistances ** 0.5  # 平方根 7     sortedDistIndicies = distances.argsort()  # 下标排序 8     classCount = {} 9 10     for i in range(k):11         voteIlabel = labels[sortedDistIndicies[i]]  # 得到距离最近的几个数12         classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  # 标签计数13     sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1),14                               reverse=True)  # 按照数值排序operator.itemgetter(1)代表第二个域15     # 上面排序之后就不是字典了,而是一个列表里面包含的元组[('c',2),('a',3)]16     return sortedClassCount[0][0]

D.性能测试

比如1000个数据,900个用做样本,100用做测试,看看精确度是多少?

1 def datingClassTest(): 2     hoRatio = 0.2 3     datingDataMat , datingLabels = file2matrix('datingTestSet2.txt') 4     normMat = autoNorm(datingDataMat) 5     n = normMat.shape[0] 6     numTestVecs = int(n*hoRatio)#测试数据和样本数据的分割点 7     erroCount = 0.0 8     #numTestVecs:n样本,[i,numTestVecs]测试 9     for i in range(numTestVecs):10         classfiResult = classify0(normMat[i,:],normMat[numTestVecs:n,:],11                                   datingLabels[numTestVecs:n],3)12         if (classfiResult!=datingLabels[i]): erroCount+=1.013     print ("the totle error os: %f" %(erroCount/float(numTestVecs)))

 E.实战分类

注意输入的数据也得归一化

1 def classfiPerson(): 2     resultList = ['not at all','in small doses','in large doses'] 3     personTats = float(input('please input video game \n')) 4     ffMiles = float(input('please input flier miles \n')) 5     iceCream = float(input('please input ice cream \n')) 6     datingData,datingLabels = file2matrix('datingTestSet2.txt') 7     normData,minData,maxData = autoNorm(datingData) 8     inputData = np.array([personTats,ffMiles,iceCream])#转化为矩阵 9     inputData = (inputData - minData)/(maxData - minData)#输入归一化10     result = classify0(inputData,normData,datingLabels,3)11     print('等级是:',result)

F.可视化显示

1      datingDatas, datingLabels = KNN.file2matrix('datingTestSet2.txt') 2      #可视化样本数据显示 3      fig = plt.figure('data_show') 4      ax = fig.add_subplot(111) 5      for i in range(datingDatas.shape[0]): 6           if datingLabels[i]==1: 7                ax.scatter(datingDatas[i, 0], datingDatas[i, 1], marker="*",c='r')  # 用后两个特征绘图 8  9           if datingLabels[i]==2:10                ax.scatter(datingDatas[i, 0], datingDatas[i, 1], marker="s", c='g')  # 用后两个特征绘图11 12           if datingLabels[i]==3:13                ax.scatter(datingDatas[i, 0], datingDatas[i, 1], marker="^", c='b')  # 用后两个特征绘图14      plt.show()

G.完整代码

1 import numpy as np 2 import operator 3 #from numpy import * 4  5 def createDataSet():#创建简单测试的几个数 6     group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]]) 7     labels = ['A', 'B', 'C', 'D'] 8     return group, labels 9 10 def autoNorm(dataSet):#归一化函数11     #每列的最值12     minValue = dataSet.min(0)13     maxValue = dataSet.max(0)14     range = maxValue - minValue15     #创建最小值矩阵16     midData =  np.tile(minValue,[dataSet.shape[0],1])17     dataSet = dataSet - midData18     #创建range矩阵19     range = np.tile(range,[dataSet.shape[0],1])20     dataSet = dataSet / range #直接相除不是矩阵相除21     return dataSet,minValue,maxValue22 23 def file2matrix(filename):#把文件转化为可操作数据24     fr = open(filename)#打开文件25     arrayOLines = fr.readlines()#读取每行文件26     numberOfLines = len(arrayOLines)#行数量27     returnMat = np.zeros([numberOfLines,3])#存储数据28     classLabelVector = []29     index = 030     for line in arrayOLines:31         line = line.strip()#去除换行符32         listFromLine = line.split('\t')#按照空格去分割33         returnMat[index,:] = listFromLine[0:3]#样本34         classLabelVector.append(int(listFromLine[-1]))#labels35         index += 136     return returnMat,classLabelVector#返回数据和标签37 38 def classify0(inX, dataSet, labels, k):#核心分类程序39     dataSetSize = dataSet.shape[0]  # 样本个数40     diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet  # 样本每个值和测试数据做差41     sqDiffMat = diffMat ** 2  # 平方42     sqDistances = sqDiffMat.sum(axis=1)  # 第二维度求和,也就是列43     distances = sqDistances ** 0.5  # 平方根44     sortedDistIndicies = distances.argsort()  # 下标排序45     classCount = {}46 47     for i in range(k):48         voteIlabel = labels[sortedDistIndicies[i]]  # 得到距离最近的几个数49         classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  # 标签计数50     sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1),51                               reverse=True)  # 按照数值排序operator.itemgetter(1)代表第二个域52     # 上面排序之后就不是字典了,而是一个列表里面包含的元组[('c',2),('a',3)]53     return sortedClassCount[0][0]54 55 def datingClassTest():56     hoRatio = 0.257     datingDataMat , datingLabels = file2matrix('datingTestSet2.txt')58     normMat = autoNorm(datingDataMat)59     n = normMat.shape[0]60     numTestVecs = int(n*hoRatio)#测试数据和样本数据的分割点61     erroCount = 0.062     #numTestVecs:n样本,[i,numTestVecs]测试63     for i in range(numTestVecs):64         classfiResult = classify0(normMat[i,:],normMat[numTestVecs:n,:],65                                   datingLabels[numTestVecs:n],3)66         if (classfiResult!=datingLabels[i]): erroCount+=1.067     print ("the totle error os: %f" %(erroCount/float(numTestVecs)))68 69 def classfiPerson():70     resultList = ['not at all','in small doses','in large doses']71     personTats = float(input('please input video game \n'))72     ffMiles = float(input('please input flier miles \n'))73     iceCream = float(input('please input ice cream \n'))74     datingData,datingLabels = file2matrix('datingTestSet2.txt')75     normData,minData,maxData = autoNorm(datingData)76     inputData = np.array([personTats,ffMiles,iceCream])#转化为矩阵77     inputData = (inputData - minData)/(maxData - minData)#输入归一化78     result = classify0(inputData,normData,datingLabels,3)79     print('等级是:',result)

3.手写数字识别

 A.转换文件

1 def img2vector(filename): 2     returnVector = np.zeros([32,32]) 3     fr = open(filename) 4     lineData = fr.readlines() 5     count = 0 6     for line in lineData: 7         line = line.strip()#去除换行符 8         for j in range(len(line)): 9             returnVector[count,j] = line[j]10         count += 111     returnVector = returnVector.reshape(1,1024).astype(int)#转化为1X102412     return returnVector

B.识别分类

1 def handWriteringClassTest(): 2     #--------------------------读取数据--------------------------------- 3     hwLabels = [] 4     trainingFileList = os.listdir('trainingDigits')#获取文件目录 5     m = len(trainingFileList)#获取目录个数 6     trainingMat = np.zeros([m,1024])#全部样本 7     for i in range(m): 8         fileNameStr = trainingFileList[i] 9         fileStr = fileNameStr.split('.')[0]#得到不带格式的文件名10         classNumStr = int(fileStr.split('_')[0])#得到最前面的数字类别0-911         hwLabels.append(classNumStr)#存储12         dirList = 'trainingDigits/' + fileNameStr#绝对目录信息13         vectorUnderTest = img2vector(dirList)#读取第i个数据信息14         trainingMat[i,:] = vectorUnderTest #存储15     #--------------------------测试数据--------------------------------16     testFileList = os.listdir('testDigits')17     errorCount = 0.018     m = len(testFileList)19     for i in range(m):20         fileNameStr = testFileList[i]21         fileInt = fileNameStr.split('.')[0].split('_')[0]22         dirList = 'testDigits/' + fileNameStr  # 绝对目录信息23         vectorUnderTest = img2vector(dirList)  # 读取第i个数据信息24         if int(fileInt) != int(classify0(vectorUnderTest,trainingMat,hwLabels,3)):25             errorCount += 126     print('error count is : ',errorCount)27     print('error Rate is : ', (errorCount/m))

C.完整代码

1 import numpy as np  2 import operator  3 import os  4 #from numpy import *  5   6 def createDataSet():#创建简单测试的几个数  7     group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])  8     labels = ['A', 'B', 'C', 'D']  9     return group, labels 10  11 def autoNorm(dataSet):#归一化函数 12     #每列的最值 13     minValue = dataSet.min(0) 14     maxValue = dataSet.max(0) 15     range = maxValue - minValue 16     #创建最小值矩阵 17     midData =  np.tile(minValue,[dataSet.shape[0],1]) 18     dataSet = dataSet - midData 19     #创建range矩阵 20     range = np.tile(range,[dataSet.shape[0],1]) 21     dataSet = dataSet / range #直接相除不是矩阵相除 22     return dataSet,minValue,maxValue 23  24 def file2matrix(filename):#把文件转化为可操作数据 25     fr = open(filename)#打开文件 26     arrayOLines = fr.readlines()#读取每行文件 27     numberOfLines = len(arrayOLines)#行数量 28     returnMat = np.zeros([numberOfLines,3])#存储数据 29     classLabelVector = [] 30     index = 0 31     for line in arrayOLines: 32         line = line.strip()#去除换行符 33         listFromLine = line.split('\t')#按照空格去分割 34         returnMat[index,:] = listFromLine[0:3]#样本 35         classLabelVector.append(int(listFromLine[-1]))#labels 36         index += 1 37     return returnMat,classLabelVector#返回数据和标签 38  39 def classify0(inX, dataSet, labels, k):#核心分类程序 40     dataSetSize = dataSet.shape[0]  # 样本个数 41     diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet  # 样本每个值和测试数据做差 42     sqDiffMat = diffMat ** 2  # 平方 43     sqDistances = sqDiffMat.sum(axis=1)  # 第二维度求和,也就是列 44     distances = sqDistances ** 0.5  # 平方根 45     sortedDistIndicies = distances.argsort()  # 下标排序 46     classCount = {} 47  48     for i in range(k): 49         voteIlabel = labels[sortedDistIndicies[i]]  # 得到距离最近的几个数 50         classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  # 标签计数 51     sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), 52                               reverse=True)  # 按照数值排序operator.itemgetter(1)代表第二个域 53     # 上面排序之后就不是字典了,而是一个列表里面包含的元组[('c',2),('a',3)] 54     return sortedClassCount[0][0] 55  56 def datingClassTest(): 57     hoRatio = 0.2 58     datingDataMat , datingLabels = file2matrix('datingTestSet2.txt') 59     normMat = autoNorm(datingDataMat) 60     n = normMat.shape[0] 61     numTestVecs = int(n*hoRatio)#测试数据和样本数据的分割点 62     erroCount = 0.0 63     #numTestVecs:n样本,[i,numTestVecs]测试 64     for i in range(numTestVecs): 65         classfiResult = classify0(normMat[i,:],normMat[numTestVecs:n,:], 66                                   datingLabels[numTestVecs:n],3) 67         if (classfiResult!=datingLabels[i]): erroCount+=1.0 68     print ("the totle error os: %f" %(erroCount/float(numTestVecs))) 69  70 def classfiPerson(): 71     resultList = ['not at all','in small doses','in large doses'] 72     personTats = float(input('please input video game \n')) 73     ffMiles = float(input('please input flier miles \n')) 74     iceCream = float(input('please input ice cream \n')) 75     datingData,datingLabels = file2matrix('datingTestSet2.txt') 76     normData,minData,maxData = autoNorm(datingData) 77     inputData = np.array([personTats,ffMiles,iceCream])#转化为矩阵 78     inputData = (inputData - minData)/(maxData - minData)#输入归一化 79     result = classify0(inputData,normData,datingLabels,3) 80     print('等级是:',result) 81  82 def img2vector(filename): 83     returnVector = np.zeros([32,32]) 84     fr = open(filename) 85     lineData = fr.readlines() 86     count = 0 87     for line in lineData: 88         line = line.strip()#去除换行符 89         for j in range(len(line)): 90             returnVector[count,j] = line[j] 91         count += 1 92     returnVector = returnVector.reshape(1,1024).astype(int)#转化为1X1024 93     return returnVector 94  95 def img2vector2(filename): 96     returnVect = np.zeros([1,1024]) 97     fr = open(filename) 98     for i in range(32): 99         lineStr = fr.readline()100         for j in range(32):101             returnVect[0,32*i+j] = int(lineStr[j])102     return returnVect103 104 def handWriteringClassTest():105     #--------------------------读取数据---------------------------------106     hwLabels = []107     trainingFileList = os.listdir('trainingDigits')#获取文件目录108     m = len(trainingFileList)#获取目录个数109     trainingMat = np.zeros([m,1024])#全部样本110     for i in range(m):111         fileNameStr = trainingFileList[i]112         fileStr = fileNameStr.split('.')[0]#得到不带格式的文件名113         classNumStr = int(fileStr.split('_')[0])#得到最前面的数字类别0-9114         hwLabels.append(classNumStr)#存储115         dirList = 'trainingDigits/' + fileNameStr#绝对目录信息116         vectorUnderTest = img2vector(dirList)#读取第i个数据信息117         trainingMat[i,:] = vectorUnderTest #存储118     #--------------------------测试数据--------------------------------119     testFileList = os.listdir('testDigits')120     errorCount = 0.0121     m = len(testFileList)122     for i in range(m):123         fileNameStr = testFileList[i]124         fileInt = fileNameStr.split('.')[0].split('_')[0]125         dirList = 'testDigits/' + fileNameStr  # 绝对目录信息126         vectorUnderTest = img2vector(dirList)  # 读取第i个数据信息127         if int(fileInt) != int(classify0(vectorUnderTest,trainingMat,hwLabels,3)):128             errorCount += 1129     print('error count is : ',errorCount)130     print('error Rate is : ', (errorCount/m))

转载地址:http://ejtgi.baihongyu.com/

你可能感兴趣的文章
java/事务与连接池
查看>>
iOS应用跳转到appstore更新和评价
查看>>
iOS MBProgressHUD的基本用法
查看>>
UI--位图和矢量图,色彩,
查看>>
PHP入门概述
查看>>
PHP配置和基础知识
查看>>
PHP配置加强二
查看>>
PHP 基础一
查看>>
iOS 输出指定位数
查看>>
iOS 中的传值的方法
查看>>
PHP 基础二
查看>>
iOS IPV6域名转IP
查看>>
iOS 解决苹果手机锁屏后APP退出的问题及app状态
查看>>
PHP基础三
查看>>
iOS7以后解决view上移的问题
查看>>
iOS后台运行延长时间
查看>>
PHP基础 四
查看>>
iOS中导航栏navigationBar返回按钮--导航右侧按钮--常用设置、导航控制器全局设置返回按钮
查看>>
iOS中判断是否首次下载app
查看>>
iOS 沙盒机制,沙盒存储,钥匙串
查看>>