Scikit-learn: How to run KMeans on a one-dimensional array?

Scikit-learn: How to run KMeans on a one-dimensional array? - python

I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array and not with one-dimensional ones. I guess there is a trick to make it work but I don't know how. I saw that KMeans.fit() accepts "X : array-like or sparse matrix, shape=(n_samples, n_features)", but it wants the n_samples to be bigger than one
I tried putting my array on a np.zeros() matrix and run KMeans, but then is putting all the non-null values on class 1 and the rest on class 0.
Can anyone help in running this algorithm on a one-dimensional array?

You have many samples of 1 feature, so you can reshape the array to (13,876, 1) using numpy's reshape:
from sklearn.cluster import KMeans
import numpy as np
x = np.random.random(13876)
km = KMeans()
km.fit(x.reshape(-1,1)) # -1 will be calculated to be 13876 here

Read about Jenks Natural Breaks. Function in Python found the link from the article:
def get_jenks_breaks(data_list, number_class):
data_list.sort()
mat1 = []
for i in range(len(data_list) + 1):
temp = []
for j in range(number_class + 1):
temp.append(0)
mat1.append(temp)
mat2 = []
for i in range(len(data_list) + 1):
temp = []
for j in range(number_class + 1):
temp.append(0)
mat2.append(temp)
for i in range(1, number_class + 1):
mat1[1][i] = 1
mat2[1][i] = 0
for j in range(2, len(data_list) + 1):
mat2[j][i] = float('inf')
v = 0.0
for l in range(2, len(data_list) + 1):
s1 = 0.0
s2 = 0.0
w = 0.0
for m in range(1, l + 1):
i3 = l - m + 1
val = float(data_list[i3 - 1])
s2 += val * val
s1 += val
w += 1
v = s2 - (s1 * s1) / w
i4 = i3 - 1
if i4 != 0:
for j in range(2, number_class + 1):
if mat2[l][j] >= (v + mat2[i4][j - 1]):
mat1[l][j] = i3
mat2[l][j] = v + mat2[i4][j - 1]
mat1[l][1] = 1
mat2[l][1] = v
k = len(data_list)
kclass = []
for i in range(number_class + 1):
kclass.append(min(data_list))
kclass[number_class] = float(data_list[len(data_list) - 1])
count_num = number_class
while count_num >= 2: # print "rank = " + str(mat1[k][count_num])
idx = int((mat1[k][count_num]) - 2)
# print "val = " + str(data_list[idx])
kclass[count_num - 1] = data_list[idx]
k = int((mat1[k][count_num] - 1))
count_num -= 1
return kclass
Use and visualization:
import numpy as np
import matplotlib.pyplot as plt
def get_jenks_breaks(...):...
x = np.random.random(30)
breaks = get_jenks_breaks(x, 5)
for line in breaks:
plt.plot([line for _ in range(len(x))], 'k--')
plt.plot(x)
plt.grid(True)
plt.show()
Result:

Related

Heat equation divide by zero issue

I'm writing a code that solves a heat equation implementing an implicit method. The problem is that the values between first and last layer of the matrix are NaNs. What could be the problem?
From my problem of view, the main issue might be with the 105th line, which represents the convrsion of original function to the one that includes the boundary function.
Boundary functions code:
def func(x, t):
return x*(1 - x)*np.exp(-2*t)
# boundary function for x = 0 and x = 1
def q0(t):
return t*np.exp(-t/0.1)*np.cos(t) # граничное условие при x = 0
def q1(t):
return t*np.exp(-t/0.5)*np.cos(t) # граничное уcловие при x = 1
def derivative(f, x0, step):
return (f(x0+step) - f(x0))/step
# boundary function that for t = 0
def u_x0(x):
return (-x + 1)*x
Function that solves the three-diagonal matrix equation
def solution(a, b):
n = len(a)
x = [0 for k in range(0, n)]
# forward
v = [0 for k in range(0, n)]
u = [0 for k in range(0, n)]
# first string (t = 0)
v[0] = a[0][1] / (-a[0][0])
u[0] = ( - b[0]) / (-a[0][0])
for i in range(1, n - 1):
v[i] = a[i][i+1] / ( -a[i][i] - a[i][i-1]*v[i-1] )
u[i] = ( a[i][i-1]*u[i-1] - b[i] ) / ( -a[i][i] - a[i][i-1]*v[i-1] )
# last string (t = 1)
v[n-1] = 0
u[n-1] = (a[n-1][n-2]*u[n-2] - b[n-1]) / (-a[n-1][n-1] - a[n-1][n-2]*v[n-2])
x[n-1] = u[n-1]
for i in range(n-1, 0, -1):
x[i-1] = v[i-1] * x[i] + u[i-1]
return x
Coefficent matrix values:
A = -t/h**2
B = 1 + 2*t/h**2
C = -t/h**2
Code that actually solves the matrix:
i = 1
X =[]
while i < 99:
X = solution(cool_array, f)
k = 0
while k < len(x_i):
#line-105
X[k] += 0.01*(func(x_i[k], x_i[i]) - (1 - x_i[i])*derivative(q0, x_i[i], 0.01) - (x_i[i])*derivative(q1, x_i[i], 0.01))
k+=1
a = 1
while a < 98:
w_h_t[i][a] = X[a]
a+=1
f = X
f[0] = w_h_t[i][0]
f[99] = w_h_t[i][99]
i+=1
print(w_h_t)
As far as I understand, the algorith solution(a, b) is written properly, so I guess the problem might be with the boundary functions or with the 105th line. The output I expect is at least an array of number, not NaNs.

Dynamic Time Wrapping returns small value for far away curves

I have a python code that implements Dynamic Time Wrapping, which I use to compare the predicted curve to my actual curve. I care about the shape of the curve but also about the distance between the 2 curves. I z-normalized the 2 curves before calling the function that returns the cost. However, I got weird results. For example:
I got cost of 0.28 for this example:
While I got 0.38 for the below example:
In the first plot, the prediction is very far away compared to the second plot. I even got the same value of 0.28 with even very far away prediction such as 5000 points further. What is wrong here?
Below is my code from this source:
#Dynamic Time Wrapping Algorithm
def dp(dist_mat):
N, M = dist_mat.shape
# Initialize the cost matrix
cost_mat = numpy.zeros((N + 1, M + 1))
for i in range(1, N + 1):
cost_mat[i, 0] = numpy.inf
for i in range(1, M + 1):
cost_mat[0, i] = numpy.inf
# Fill the cost matrix while keeping traceback information
traceback_mat = numpy.zeros((N, M))
for i in range(N):
for j in range(M):
penalty = [
cost_mat[i, j], # match (0)
cost_mat[i, j + 1], # insertion (1)
cost_mat[i + 1, j]] # deletion (2)
i_penalty = numpy.argmin(penalty)
cost_mat[i + 1, j + 1] = dist_mat[i, j] + penalty[i_penalty]
traceback_mat[i, j] = i_penalty
# Traceback from bottom right
i = N - 1
j = M - 1
path = [(i, j)] #Path is commented because I am not interested in the path
# while i > 0 or j > 0:
# tb_type = traceback_mat[i, j]
# if tb_type == 0:
# # Match
# i = i - 1
# j = j - 1
# elif tb_type == 1:
# # Insertion
# i = i - 1
# elif tb_type == 2:
# # Deletion
# j = j - 1
# path.append((i, j))
# Strip infinity edges from cost_mat before returning
cost_mat = cost_mat[1:, 1:]
return (path[::-1], cost_mat)
I use the above code as below:
z_actual=stats.zscore(actual)
z_pred=stats.zscore(mean_predictions)
N = actual.shape[0]
M = mean_predictions.shape[0]
dist_mat = numpy.zeros((N, M))
for i in range(N):
for j in range(M):
dist_mat[i, j] = abs(z_actual[i] - z_pred[j])
path,cost_mat=dp(dist_mat)
mape=cost_mat[N - 1, M - 1]/(N + M)

python - why is the value of y in the input different from the output?

why is the value of y in the input different from the output?
in the input I put 14.6 210 then the output produces 1.46210 this makes my newton interpolation calculation not correct
import numpy as np
x = [0,8,16,24]
y = [14.6210,11.8430,9.8700,8.4180]
xinput = 12
n = len(x)-1
ST = np.zeros((n+1,n+1))
ST[:,0] = y
for k in range (1,n+1):
for i in range (0,n-k+1):
ST[i,k] = (ST[i+1,k-1] - ST[i,k-1]) / (x[i+k] - x[i])
print(ST)
p = ST[0,0]
for i in range(1, n+1):
a = ST[0,i]
for k in range (0,i):
a = a * (xinput - x[k])
p = p + a
print(p)

Python: got an output image with unexpected grid lines

I am writing a function that scales the input image into times of
its input size. The function Resize(Mat I, float s) first fills in the and Mat’s
that contained the query point coordinates. Then I calculate the query value by
using bilinear interpolation.
The output image seems to be alright except it has an unexpected # shape grid on it. Can you provide any hint for the resolution?
Output image:
Code:
import numpy as np
import cv2 as cv
import math
import matplotlib.pyplot as plt
#Mat I, float s
def Resize(I, s):
orig_x = I.shape[0];
orig_y = I.shape[1];
tar_x = int (orig_x * s) #int tar_x and tar_y
tar_y = int (orig_y * s);
#print(tar_x)
# Query points
X = np.empty((tar_y, tar_x), np.float32)
Y = np.empty((tar_y, tar_x), np.float32)
# calc interval between output points
interval = (orig_x-1) / (tar_x-1)
# Setting the query points
for i in range(0, tar_y):
for j in range(0, tar_x):
#set X[i, j] and Y[i,j]
X[i][j] = j * interval
Y[i][j] = i * interval
# Output image
output = np.empty((tar_y, tar_x), np.uint8)
# Performing the interpolation
for i in range(0, tar_y):
for j in range(0, tar_x):
#set output[i,j] using X[i, j] and Y[i,j]
x = X[i][j]
y = Y[i][j]
x1 = math.floor(x)
x2 = math.ceil(x)
y1 = math.floor(y)
y2 = math.ceil(y)
vq1= (x-x1)*I[y1,x2] + (x2-x)*I[y1,x1]
vq2= (x-x1)*I[y2,x2] + (x2-x)*I[y2,x1]
output[i,j] = (y-y1)*vq2 + (y2-y)*vq1
return output
s= 640 / 256
I = cv.imread("aerial_256.png", cv.IMREAD_GRAYSCALE)
output = Resize(I,s)
output = cv.cvtColor(output, cv.COLOR_BGR2RGB)
plt.imshow(output)
plt.savefig("aerial_640.png",bbox_inches='tight',transparent=True, pad_inches=0)
plt.show()

You are getting a black pixel where x is an integer and where y is an integer.
Take a look at the following code:
x1 = math.floor(x)
x2 = math.ceil(x)
vq1= (x-x1)*I[y1,x2] + (x2-x)*I[y1,x1]
vq2= (x-x1)*I[y2,x2] + (x2-x)*I[y2,x1]
Assume: x = 85.0
x1 = floor(x) = 85
x2 = ceil(x) = 85
(x-x1) = (85-85) = 0
(x2-x) = (85-85) = 0
vq1 = (x-x1)*I[y1,x2] + (x2-x)*I[y1,x1] = 0*I[y1,x2] + 0*I[y1,x1] = 0
vq2 = (x-x1)*I[y2,x2] + (x2-x)*I[y2,x1] = 0*I[y2,x2] + 0*I[y2,x1] = 0
output[i,j] = (y-y1)*vq2 + (y2-y)*vq1 = (y-y1)*0 + (y2-y)*0 = 0
Result:
In the entire column where x = 85.0 the value of output[i,j] is zero (we are getting a black column).
Same result applied to y = 85.0 - we are getting a black row.
When does x value is an integer?
Take a look at the following code:
# calc interval between output points
interval = (orig_x-1) / (tar_x-1)
# Setting the query points
for i in range(0, tar_y):
for j in range(0, tar_x):
#set X[i, j] and Y[i,j]
X[i][j] = j * interval
interval = (orig_x-1) / (tar_x-1) = 255/639 = (3*5*17/(3*3*71) = 85/213
j * interval = j * 85/213
Each time j is a multiple of 213, j * interval is an integer (we are getting a black column).
It happens when j=0, j=213, j=426, j=639, so there are two black columns (beside margins).
There are also two visible black rows (beside margins).
Suggested solution:
Replace x2 = math.ceil(x) with x2 = min(x1 + 1, orig_x-1).
Replace y2 = math.ceil(y) with y2 = min(y1 + 1, orig_y-1).
Corrected loop:
for i in range(0, tar_y):
for j in range(0, tar_x):
#set output[i,j] using X[i, j] and Y[i,j]
x = X[i][j]
y = Y[i][j]
x1 = math.floor(x)
x2 = min(x1 + 1, orig_x-1)
y1 = math.floor(y)
y2 = min(y1 + 1, orig_y-1)
vq1= (x-x1)*I[y1,x2] + (x2-x)*I[y1,x1]
vq2= (x-x1)*I[y2,x2] + (x2-x)*I[y2,x1]
output[i,j] = (y-y1)*vq2 + (y2-y)*vq1
Result:

Efficient way to create a dense matrix from diagonal vectors in Python?

I am trying to create this matrix in Python using numpy vectors:
where the values come from a function. I have implemented it with repeatedly using numpy.diag but for large dimensions, it becomes very slow. Here is the code:
def makeS(N):
vec = np.full(N, 2*v(x_range[1]))
vec[0]*=0.5
S = np.diag(vec)
vec = np.full(N-1, v(x_range[0]))
S+= np.diag(vec, 1)
for m in xrange(1, N):
vec = np.full(N-m, 2*v(x_range[m+1]))
vec[0]*= 0.5
S += np.diag(vec, -m)
return S
where v() is the said function and x_range is a vector of x-values. Is there a way to make this more efficient?
Edit:
Here is a full example:
import numpy as np
import math
N = 5
x_range = np.linspace(0, 1, N+1)
def v(x):
return math.exp(x)
def makeS(N):
vec = np.full(N, 2*v(x_range[1]))
vec[0]*=0.5
S = np.diag(vec)
vec = np.full(N-1, v(x_range[0]))
S+= np.diag(vec, 1)
for m in xrange(1, N):
vec = np.full(N-m, 2*v(x_range[m+1]))
vec[0]*= 0.5
S += np.diag(vec, -m)
return S
print makeS(N)
which outputs
[[ 1.22140276 1. 0. 0. 0. ]
[ 1.4918247 2.44280552 1. 0. 0. ]
[ 1.8221188 2.9836494 2.44280552 1. 0. ]
[ 2.22554093 3.6442376 2.9836494 2.44280552 1. ]
[ 2.71828183 4.45108186 3.6442376 2.9836494 2.44280552]]

This is the fastest approach I could find:
def makeS(N):
values = np.array([v(x) for x in x_range])
values_doubled = 2 * values
result = np.eye(N, k=1) * values[0]
result[:, 0] = values[1:]
for i in xrange(N - 1):
result[i + 1, 1:i + 2] = values_doubled[1:i + 2][::-1]
return result
With N=2000 the original takes 26.97 seconds on my machine while the new version takes 0.02339 seconds.
Here is the complete script for evaluating timings with some additional approaches.
import numpy as np
import math
import timeit
def v(x):
return math.exp(x)
def makeS1(N, x_range):
vec = np.full(N, 2 * v(x_range[1]))
vec[0] *= 0.5
S = np.diag(vec)
vec = np.full(N - 1, v(x_range[0]))
S += np.diag(vec, 1)
for m in xrange(1, N):
vec = np.full(N - m, 2 * v(x_range[m + 1]))
vec[0] *= 0.5
S += np.diag(vec, -m)
return S
def makeS2(N, x_range):
values = np.array([v(x) for x in x_range])
values_doubled = 2 * values
def value_at_position(ai, aj):
result = np.zeros((N, N))
for i, j in zip(ai.flatten(), aj.flatten()):
if j > i + 1:
continue
elif j == i + 1:
result[i, j] = values[0]
elif j == 0:
result[i, j] = values[i + 1]
else:
result[i, j] = values_doubled[i - j + 1]
return result
return np.fromfunction(value_at_position, (N, N))
def makeS3(N, x_range):
values = np.array([v(x) for x in x_range])
values_doubled = 2 * values
result = np.zeros((N, N))
for i in xrange(N):
for j in xrange(min(i + 2, N)):
if j == i + 1:
result[i, j] = values[0]
elif j == 0:
result[i, j] = values[i + 1]
else:
result[i, j] = values_doubled[i - j + 1]
return result
def makeS4(N, x_range):
values = np.array([v(x) for x in x_range])
values_doubled = 2 * values
result = np.eye(N, k=1) * values[0]
result[:, 0] = values[1:]
for i in xrange(N - 1):
result[i + 1, 1:i + 2] = values_doubled[1:i + 2][::-1]
return result
def main():
N = 2000
x_range = np.random.randn(N + 1)
start = timeit.default_timer()
s1 = makeS1(N, x_range)
print 'makeS1', timeit.default_timer() - start
start = timeit.default_timer()
s2 = makeS2(N, x_range)
print 'makeS2', timeit.default_timer() - start
start = timeit.default_timer()
s3 = makeS3(N, x_range)
print 'makeS3', timeit.default_timer() - start
start = timeit.default_timer()
s4 = makeS4(N, x_range)
print 'makeS4', timeit.default_timer() - start
if N < 10:
print s1
print s2
print s2
print s4
assert np.allclose(s1, s2)
assert np.allclose(s2, s3)
assert np.allclose(s3, s4)
main()
On my machine, this produces the output:
makeS1 26.9707232448
makeS2 11.7728229076
makeS3 0.643742975052
makeS4 0.0233912765665

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scikit-learn: How to run KMeans on a one-dimensional array? - python

You have many samples of 1 feature, so you can reshape the array to (13,876, 1) using numpy's reshape: from sklearn.cluster import KMeans import numpy as np x = np.random.random(13876) km = KMeans() km.fit(x.reshape(-1,1)) # -1 will be calculated to be 13876 here

Related

Heat equation divide by zero issue

Dynamic Time Wrapping returns small value for far away curves

python - why is the value of y in the input different from the output?

Python: got an output image with unexpected grid lines

Efficient way to create a dense matrix from diagonal vectors in Python?

Categories

Resources