¿Cómo puedo evitar problemas de NAN?
Me estoy poniendoMean of empty slice
advertencias de tiempo de ejecución. Cuando imprimo cuáles son mis variables (matrices numpy), varias de ellas contienennan
valores. La advertencia de tiempo de ejecución está mirando la línea 58 como el problema. ¿Qué puedo cambiar para que funcione?
A veces el programa se ejecutará sin problemas. La mayoría de las veces no.
Este es un algoritmo K-Means from scratch que agrupa el conjunto de datos del iris. Primero solicita a los usuarios la cantidad de centroides que desean (grupos). Luego genera aleatoriamente dicho número de grupos en el rango dado a partir de los números en el archivo de texto cargado.
Tengo el valor de ruptura en la instrucción else para evitar bucles infinitos.
¿Es porque tengo números que van por debajo de cero cuando resta los Centroides de los puntos de datos en el archivo?
Error que obtengo cuando ejecuto:
How Many Centrouds? 3
Dimensionality of Data: (150, 4)
Starting Centroiuds:
[[ 1.4 7.9 0.2 3.4]
[ 7.8 0.2 4.3 1.4]
[ 5.7 6.9 3. 6.6]]
t0 :
[[[-3.7 4.4 -1.2 3.2]
[ 2.7 -3.3 2.9 1.2]
[ 0.6 3.4 1.6 6.4]]
[[-3.5 4.9 -1.2 3.2]
[ 2.9 -2.8 2.9 1.2]
[ 0.8 3.9 1.6 6.4]]
[[-3.3 4.7 -1.1 3.2]
[ 3.1 -3. 3. 1.2]
[ 1. 3.7 1.7 6.4]]
...,
[[-5.1 4.9 -5. 1.4]
[ 1.3 -2.8 -0.9 -0.6]
[-0.8 3.9 -2.2 4.6]]
[[-4.8 4.5 -5.2 1.1]
[ 1.6 -3.2 -1.1 -0.9]
[-0.5 3.5 -2.4 4.3]]
[[-4.5 4.9 -4.9 1.6]
[ 1.9 -2.8 -0.8 -0.4]
[-0.2 3.9 -2.1 4.8]]]
Warning (from warnings module):
File "C:\Python27\lib\site-packages\numpy\core\_methods.py", line 59
warnings.warn("Mean of empty slice.", RuntimeWarning)
RuntimeWarning: Mean of empty slice.
Warning (from warnings module):
File "C:\Python27\lib\site-packages\numpy\core\_methods.py", line 68
ret, rcount, out=ret, casting='unsafe', subok=False)
RuntimeWarning: invalid value encountered in true_divide
---------------
Starting Centroids:
[[ 1.4 7.9 0.2 3.4]
[ 7.8 0.2 4.3 1.4]
[ 5.7 6.9 3. 6.6]]
Starting NewMeans:
[[ nan nan nan nan]
[ 5.84333333 3.054 3.75866667 1.19866667]
[ nan nan nan nan]]
Starting Centroids Now:
[[ nan nan nan nan]
[ 5.84333333 3.054 3.75866667 1.19866667]
[ nan nan nan nan]]
NewMeans now:
[[ nan nan nan nan]
[ 5.84333333 3.054 3.75866667 1.19866667]
[ nan nan nan nan]]
Código de Python:
import numpy as np
from pprint import pprint
import random
import sys
import warnings
arglist = sys.argv
#UNCOMMENT BELOW IN FINAL PROGRAM
'''
NoOfCentroids = int(arglist[2])
dataPointsFromFile = np.array(np.loadtxt(sys.argv[1], delimiter = ','))
'''
dataPointsFromFile = np.array(np.loadtxt('iris.txt', delimiter = ','))
NoOfCentroids = input('How Many Centrouds? ')
dataRange = ([])
#UNCOMMENT BELOW IN FINAL PROGRAM
'''
with open(arglist[1]) as f:
print 'Points in data set: ',sum(1 for _ in f)
'''
dataRange.append(round(np.amin(dataPointsFromFile),1))
dataRange.append(round(np.amax(dataPointsFromFile),1))
dataRange = np.asarray(dataRange)
dataPoints = np.array(dataPointsFromFile)
print 'Dimensionality of Data: ', dataPoints.shape
randomCentroids = []
data = ([])
templist = []
i = 0
while i<NoOfCentroids:
for j in range(len(dataPointsFromFile[1,:])):
cat = round(random.uniform(np.amin(dataPointsFromFile),np.amax(dataPointsFromFile)),1)
templist.append(cat)
randomCentroids.append(templist)
templist = []
i = i+1
centroids = np.asarray(randomCentroids)
def kMeans(array1, array2):
ConvergenceCounter = 1
keepGoing = True
StartingCentroids = np.copy(centroids)
print 'Starting Centroiuds:\n {}'.format(StartingCentroids)
while keepGoing:
#--------------Find The new means---------#
t0 = StartingCentroids[None, :, :] - dataPoints[:, None, :]
print 't0 :\n {}'.format(t0)
t1 = np.linalg.norm(t0, axis=-1)
t2 = np.argmin(t1, axis=-1)
#------Push the new means to a new array for comparison---------#
CentroidMeans = []
for x in range(len(StartingCentroids)):
CentroidMeans.append(np.mean(dataPoints[t2 == [x]], axis=0))
#--------Convert to a numpy array--------#
NewMeans = np.asarray(CentroidMeans)
#------Compare the New Means with the Starting Means------#
if np.array_equal(NewMeans,StartingCentroids):
print ('Convergence has been reached after {} moves'.format(ConvergenceCounter))
print ('Starting Centroids:\n{}'.format(centroids))
print ('Final Means:\n{}'.format(NewMeans))
print ('Final Cluster assignments: {}'.format(t2))
for x in xrange(len(StartingCentroids)):
print ('Cluster {}:\n'.format(x)), dataPoints[t2 == [x]]
for x in xrange(len(StartingCentroids)):
print ('Size of Cluster {}:'.format(x)), len(dataPoints[t2 == [x]])
keepGoing = False
else:
print 15*'-'
ConvergenceCounter = ConvergenceCounter +1
print 'Starting Centroids:\n'
print StartingCentroids
print '\n'
print 'Starting NewMeans:\n'
print NewMeans
StartingCentroids =np.copy(NewMeans)
print 'Starting Centroids Now:\n'
print StartingCentroids
print '\n'
print 'NewMeans now:'
print NewMeans
break
kMeans(centroids, dataPoints)