Kmeans with initial centroids give different outputs in Matlab and Python environment - python

The input to the Kmeans in both the Matlab and Python environments is the following list:
input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59, 0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90, 0.64]
Matlab:
[idx, C] = kmeans(input',3,'Start',[0.3;0.9;1.5]);
Output
C = [0.596, 0.825, 1.035]
(idx==1) = 15, (idx==2) = 6, (idx==3) = 6
Python:
import numpy as np
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, n_init=1, init=np.array([0.3,0.9,1.5]).reshape(-1,1)).fit(np.array(input).reshape(-1, 1))
idx = kmeans.labels_
C = kmeans.cluster_centers_
Output
C = [0.430, 0.969, 0.637]
(idx==0) = 2, (idx==1) = 10, (idx==2) = 15
Clearly, the output centroids and the number of input points classified in the 3 clusters are different for these environments. What is the reason behind this even when the initial centroids are the same?

I've writed a minimal kmeans algorithm to test your dataset with matlab:
input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59,
0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90,
0.64];
c = [0.3;0.9;1.5]
for ii = 1:10
[~,idx] = min(abs(c-input)); % pairwise euclidian distance
c = accumarray(idx.',input,[],#mean) % compute the new centroid
end
After the first iteration the index idx, that indicate which is the closest centroid for each value, looks like this:
2 2 2 2 2 2 2 1 1 2...
The last centroid (1.5 here) is NEVER the closest value ! So in order to keep 3 groups the kmeans algorithm have to compute, somehow, a new value for this centroid (because it's hard to compute the mean of an empty set). And it looks like python and matlab have different implementation for it.
If you want to avoid this problem make sure that every initial centroid is the closest value for, at least, one element of your dataset.
You can, for example, take the first three differents value of your dataset.

Related

How to plot a matrix with description for each column and row

I have a data set I need to augment. Therefore, I have implemented an augmentation method called magnitude warping that has two hyperparameters to tune, namely sigma and knots. To assess the quality, I have two models that I train using the augmented data and test it on part of the real data. To compare the accuracy I also trained the models on only the real data. Lets assume the following Python code:
# test accuracy trained on real data only
ref_dt_accuracy = 0.86
ref_lstm_accuracy = 0.85
# test accuracy for each pair of hyperparameters
sigma = [0.2, 0.35, 0.5, 0.65]
knots = [4,5,6,7]
dt_accuracy_mw = [
[0.82, 0.85, 0.83, 0.84],
[0.8, 0.79, 0.81, 0.79],
[0.78,0.77, 0.74, 0.76],
[0.74, 0.72, 0.78, 0.70]
]
lstm_accuracy_mw = [
[0.80, 0.83, 0.81, 0.82],
[0.78, 0.77, 0.79, 0.77],
[0.76,0.75, 0.72, 0.74],
[0.72, 0.7, 0.76, 0.68]
]
Now, I want to plot two (three if the last option is possible) matrices:
Plotting dt_accuracy_mw and lstm_accuracy_mw such that each sigma and knots are visualized:
sigma/knots 4 5 6 7
0.2
0.35 Actual matrix consisting of aforementioned accuracies
0.5
0.65
A combined version of above such that each entry consists of dt_accuracy (ref_dt_accuracy - dt_accuracy)/lstm_accuracy (ref_lstm_accuracy - lstm_accuracy) , so each entry consists of the dt_accuracy the difference to the ref and the same for the lstm_accuracy. Each accuracy score of the models are then seperated by the /
How would one achieve this using any open source libraries such as matplotlib, seaborn etc.
You can create a Seaborn heatmap as follows:
from matplotlib import pyplot as plt
import seaborn as sns
sigma = [0.2, 0.35, 0.5, 0.65]
knots = [4, 5, 6, 7]
dt_accuracy_mw = [[0.82, 0.85, 0.83, 0.84],
[0.8, 0.79, 0.81, 0.79],
[0.78, 0.77, 0.74, 0.76],
[0.74, 0.72, 0.78, 0.70]]
ax = sns.heatmap(data=dt_accuracy_mw, xticklabels=knots, yticklabels=sigma,
linewidths=1, linecolor='blue', clip_on=False, annot=True, cbar=False,
cmap=sns.color_palette(['white'], as_cmap=True))
ax.set_xlabel('knots')
ax.set_ylabel('sigma')
plt.tight_layout()
plt.show()
If I understand the second question correctly, a matrix of annotations would do the job (the data can be anything with the correct width and height):
from matplotlib import pyplot as plt
import seaborn as sns
ref_dt_accuracy = 0.86
ref_lstm_accuracy = 0.85
sigma = [0.2, 0.35, 0.5, 0.65]
knots = [4, 5, 6, 7]
dt_accuracy_mw = [[0.82, 0.85, 0.83, 0.84],
[0.8, 0.79, 0.81, 0.79],
[0.78, 0.77, 0.74, 0.76],
[0.74, 0.72, 0.78, 0.70]]
lstm_accuracy_mw = [[0.80, 0.83, 0.81, 0.82],
[0.78, 0.77, 0.79, 0.77],
[0.76, 0.75, 0.72, 0.74],
[0.72, 0.7, 0.76, 0.68]]
annot_matrix = [[f'{ref_dt_accuracy - dt:.2f} / {ref_lstm_accuracy - lstm:.2f}'
for dt, lstm in zip(dt_row, lstm_row)]
for dt_row, lstm_row in zip(dt_accuracy_mw, lstm_accuracy_mw)]
ax = sns.heatmap(data=dt_accuracy_mw, xticklabels=knots, yticklabels=sigma,
annot=annot_matrix, fmt='',
linewidths=2, linecolor='crimson', clip_on=False, cbar=False,
cmap=sns.color_palette(['aliceblue'], as_cmap=True))
ax.set_xlabel('knots')
ax.set_ylabel('sigma')
plt.tight_layout()
plt.show()

Plotly python regression in ternary space

I'm trying to draw a regression line in plotly python in ternary space, but there doesn't seem to be an option like "trendline = 'loess' for scatter ternaries. Is there another way to achieve the same result for ternaries? Code from a previous post that makes a spline line but not a regression.
import numpy as np
import plotly.graph_objects as go
a = np.array([0.15, 0.15, 0.17, 0.2 , 0.21, 0.24, 0.26, 0.27, 0.27, 0.29, 0.32, 0.35, 0.39, 0.4 , 0.4 , 0.41, 0.47, 0.48, 0.51, 0.52, 0.54, 0.56, 0.59, 0.62, 0.63, 0.65, 0.69, 0.73, 0.74])
b = np.array([0.14, 0.15, 0.1 , 0.17, 0.17, 0.18, 0.05, 0.16, 0.17, 0.04, 0.03, 0.14, 0.13, 0.13, 0.14, 0.14, 0.13, 0.13, 0.14, 0.14, 0.15, 0.16, 0.18, 0.2 , 0.21, 0.22, 0.24, 0.25, 0.25])
c = np.array([0.71, 0.7 , 0.73, 0.63, 0.62, 0.58, 0.69, 0.57, 0.56, 0.67, 0.65, 0.51, 0.48, 0.47, 0.46, 0.45, 0.4 , 0.39, 0.35, 0.34, 0.31, 0.28, 0.23, 0.18, 0.16, 0.13, 0.07, 0.02, 0.01])
fig = go.Figure()
curve_portion = np.where((b < 0.15) & (c > 0.6))
curve_other_portion = np.where(~((b < 0.15) & (c > 0.6)))
def add_plot_spline_portions(fig, indices_groupings):
for indices in indices_groupings:
fig.add_trace(go.Scatterternary({
'mode': 'lines',
'connectgaps': True,
'a': a[indices],
'b': b[indices],
'c': c[indices],
'line': {'color': 'black', 'shape': 'spline', 'smoothing': 1},
'marker': {'size': 2, 'line': {'width': 0.1}}
})
)
add_plot_spline_portions(fig, [curve_portion, curve_other_portion])
fig.show(renderer='png')
I can outline what I think is a general sort of solution - it doesn't have as much mathematical rigor as I would like, and involves some guess and check type work - but hopefully it's helpful.
The first consideration is that for this regression on a ternary plot, there are only two degrees of freedom because A+B+C=1 (you might find this explanation helpful). This means it only makes sense to consider the relationship between two of the variables at a time. What we really want to do is create a regression between two of the variables, then determine the value of the third variable using the equation A+B+C=1.
The second consideration is bit harder to define, but since you are after a regression that captures the "reversing" nature of the variable A, we want a regression where A can take on repeated values. I think the most straightforward way to achieve this is for A to be the variable you are predicting.
For simplicity sake, let's say we use a degree 2 polynomial regression that predicts A from either B or C. We can make a scatter and choose whichever polynomial will have a better fit for our purposes.
Here is a quick eda:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
a = np.array([0.15, 0.15, 0.17, 0.2 , 0.21, 0.24, 0.26, 0.27, 0.27, 0.29, 0.32, 0.35, 0.39, 0.4 , 0.4 , 0.41, 0.47, 0.48, 0.51, 0.52, 0.54, 0.56, 0.59, 0.62, 0.63, 0.65, 0.69, 0.73, 0.74])
b = np.array([0.14, 0.15, 0.1 , 0.17, 0.17, 0.18, 0.05, 0.16, 0.17, 0.04, 0.03, 0.14, 0.13, 0.13, 0.14, 0.14, 0.13, 0.13, 0.14, 0.14, 0.15, 0.16, 0.18, 0.2 , 0.21, 0.22, 0.24, 0.25, 0.25])
c = np.array([0.71, 0.7 , 0.73, 0.63, 0.62, 0.58, 0.69, 0.57, 0.56, 0.67, 0.65, 0.51, 0.48, 0.47, 0.46, 0.45, 0.4 , 0.39, 0.35, 0.34, 0.31, 0.28, 0.23, 0.18, 0.16, 0.13, 0.07, 0.02, 0.01])
## eda to determine polynomial of best fit to predict A
fig_eda = make_subplots(rows=1, cols=2)
fig_eda.add_trace(go.Scatter(x=b, y=a, mode='markers'),row=1, col=1)
coefficients = np.polyfit(b,a,2)
p = np.poly1d(coefficients)
b_vals = np.linspace(min(b),max(b))
a_pred = np.array([p(x) for x in b_vals])
fig_eda.add_trace(go.Scatter(x=b_vals, y=a_pred, mode='lines'),row=1, col=1)
fig_eda.add_trace(go.Scatter(x=c, y=a, mode='markers'),row=1, col=2)
coefficients = np.polyfit(c,a,2)
p = np.poly1d(coefficients)
c_vals = np.linspace(min(c),max(c))
a_pred = np.array([p(x) for x in c_vals])
fig_eda.add_trace(go.Scatter(x=c_vals, y=a_pred, mode='lines'),row=1, col=2)
Notice how predicting A from B looks like it captures the reversing nature of A better than predicting A from C. If we try to make a degree 2 polynomial regression on A from C, we can see A is not going to repeat within the domain of C: [0,1] because of the very low sloping nature of that polynomial.
So let's proceed with this regression with C as the predictor variable, and A as the predicted variable (and B also being a predicted variable using B = 1 - (A + C).
fig = go.Figure()
fig.add_trace(go.Scatterternary({
'mode': 'markers',
'connectgaps': True,
'a': a,
'b': b,
'c': c
}))
## since A+B+C = 100, we only need to fit a polynomial between two of the variables
## fit an n-degree polynomial to 2 of your variables
## source https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
coefficients = np.polyfit(b,a,2)
p = np.poly1d(coefficients)
## we use the entire domain of the input variable B
b_vals = np.linspace(0,1)
a_pred = np.array([p(x) for x in b_vals])
c_pred = 1 - (b_vals + a_pred)
fig.add_trace(go.Scatterternary({
'mode': 'lines',
'connectgaps': True,
'a': a_pred,
'b': b_vals,
'c': c_pred,
'marker': {'size': 2, 'color':'red', 'line': {'width': 0.1}}
}))
fig.show()
This is the lowest degree polynomial regression that allows for repeated values of A (a linear regression to predict A would be the wouldn't allow A to take on repeated values). However, you can definitely experiment with increasing the degree of the polynomial you are using, and predicting A from either variables B or C.

How to find max value and its index in a row of 2d list?

I have a 2D list. I want to find the max value and its index of each row.
Here is the list
q_table = [[0.16, 0.40, 0.61, 0.48, 0.20],
[0.42, 0.79, 0.64, 0.54, 0.52],
[0.64, 0.64, 0.24, 0.93, 0.43],
[0.33, 0.54, 0.61, 0.43, 0.29],
[0.25, 0.56, 0.42, 0.69, 0.62]]
Output:
0.61 2
0.79 1
0.93 3
0.61 2
0.69 3
# I'm using python 3.8
Thank You in Advance
as suggested in comments you can use max to get the max values from your list and argmax for getting the position.
np.argmax(q_table, axis=1) #returns list of position of max value in each list.
np.max(q_table, axis=1) # return list of max value in each list.
you can then use zip function to iterate both the list together and store the output in list of list
import numpy as np
max_list_with_position=[ [x,y] for x,y in zip(np.argmax(q_table, axis=1),np.max(q_table, axis=1))]
print(max_list_with_position)
output:
[[2, 0.61], [1, 0.79], [3, 0.93], [2, 0.61], [3, 0.69]]
q_table = [[0.16, 0.40, 0.61, 0.48, 0.20],
[0.42, 0.79, 0.64, 0.54, 0.52],
[0.64, 0.64, 0.24, 0.93, 0.43],
[0.33, 0.54, 0.61, 0.43, 0.29],
[0.25, 0.56, 0.42, 0.69, 0.62]]
rows_count = len(q_table) # to count number of rows
for i in range(rows_count):
a_row = q_table[i] # taking each row in a variable
max_value = max(a_row) # find mad value in a single row
index = a_row.index(max_value) # find the max value's index of a single row
print("The max value ",max_value, " and Index in ",index)
If there is a better way do suggest.
and here is the output,
The max value 0.61 and Index in 2
The max value 0.79 and Index in 1
The max value 0.93 and Index in 3
The max value 0.61 and Index in 2
The max value 0.69 and Index in 3

how would I calculate the L2 or euclidean distance between rows of different 2D vectors

I've been given 2 different 2D arrays and I'm asked to calculate the L2 distance between the rows of array x and the rows in array y. The shape of array x is (M, D) and the shape of array y is (N, D). The final answer array should have the shape (M, N).
I'm not very good at python. I'm really just doing random things and seeing what happens.
import numpy as np
def compute_distances(x, y):
""" Write a function that computes the L2 distance between each row
in `x` and `y`.
Parameters
----------
x : numpy.ndarray
x.shape must be (M, D)
Each row of `x` is a flattened vector representing the pixel
values of a single image. Thus `x` represents
M images, each one described by a length-D vector.
y : numpy.ndarray
y.shape must be (N, D)
Each row of `y` is a flattened vector representing the pixel
values of a single image. Thus `y` represents
N images, each one described by a length-D vector.
Returns
-------
distances : numpy.ndarray
distances.shape = (M, N)
distances[i, j] = the L2 distance between x[i] and y[j]
"""
# student code goes here
M = x.shape[0]
N = y.shape[0]
dists = np.array (M, N)
dists[i, j] = np.sqrt(np.sum(np.square(x.shape[0] - y.shape[0]), axis = 0))
pass
I like this one since it works for 1, 2 and 3D arrays
def e_dist(a, b, metric='euclidean'):
"""Distance calculation for 1D, 2D and 3D points using einsum
preprocessing :
use `_view_`, `_new_view_` or `_reshape_` with structured/recarrays
Parameters
----------
a, b : array like
Inputs, list, tuple, array in 1, 2 or 3D form
metric : string
euclidean ('e', 'eu'...), sqeuclidean ('s', 'sq'...),
Notes
-----
mini e_dist for 2d points array and a single point
>>> def e_2d(a, p):
diff = a - p[np.newaxis, :] # a and p are ndarrays
return np.sqrt(np.einsum('ij,ij->i', diff, diff))
See Also
--------
cartesian_dist : function
Produces pairs of x,y coordinates and the distance, without duplicates.
"""
a = np.asarray(a)
b = np.atleast_2d(b)
a_dim = a.ndim
b_dim = b.ndim
if a_dim == 1:
a = a.reshape(1, 1, a.shape[0])
if a_dim >= 2:
a = a.reshape(np.prod(a.shape[:-1]), 1, a.shape[-1])
if b_dim > 2:
b = b.reshape(np.prod(b.shape[:-1]), b.shape[-1])
diff = a - b
dist_arr = np.einsum('ijk,ijk->ij', diff, diff)
if metric[:1] == 'e':
dist_arr = np.sqrt(dist_arr)
dist_arr = np.squeeze(dist_arr)
return dist_arr
Yielding
a = np.random.rand(3, 2)
b = np.random.rand(5, 2)
e_dist(a, b)
array([[0.62, 0.45, 0.88, 0.7 , 0.33],
[0.46, 0.57, 0.47, 0.25, 0.4 ],
[0.94, 0.68, 0.16, 0.35, 0.62]])
and for...
a = np.random.rand(2, 3, 2)
b = np.random.rand(2, 5, 2)
e_dist(a, b)
array([[0.72, 0.39, 0.89, 0.25, 0.29, 0.41, 0.31, 0.6 , 0.8 , 0.39],
[0.25, 0.26, 0.69, 0.45, 0.45, 0.63, 0.57, 0.39, 0.25, 0.78],
[0.28, 0.27, 0.74, 0.47, 0.46, 0.69, 0.62, 0.45, 0.25, 0.84],
[0.43, 0.48, 0.47, 0.52, 0.54, 0.2 , 0.17, 0.21, 0.55, 0.38],
[0.15, 0.36, 0.54, 0.51, 0.52, 0.52, 0.48, 0.23, 0.23, 0.69],
[0.87, 0.76, 0.78, 0.67, 0.71, 0.26, 0.28, 0.64, 0.99, 0.06]])
And to compare numpy and scipy variants for 2d
from scipy.spatial import distance_matrix
distance_matrix(a, b)
array([[0.87, 0.89, 0.91, 0.07, 0.68, 0.95, 0.89],
[0.78, 0.52, 0.28, 0.85, 0.22, 0.65, 0.51],
[0.75, 0.46, 0.51, 1.08, 0.5 , 0.57, 0.45]])
e_dist(a, b)
array([[0.87, 0.89, 0.91, 0.07, 0.68, 0.95, 0.89],
[0.78, 0.52, 0.28, 0.85, 0.22, 0.65, 0.51],
[0.75, 0.46, 0.51, 1.08, 0.5 , 0.57, 0.45]])
So there are lots of options, depending on the size of the arrays you are working with.

Generate permutations of values in fixed intervals

Suppose I have a possible values like the following:
vals = [0.85, 0.9, 0.95, 0.98]
Now, I want to generate list of len 5.. but based on every possible combinations..
perm_values = [0.85, 0.85, 0.85, 0.85, 0.85],
[0.85, 0.90, 0.85, 0.85, 0.85],
[0.85, 0.85, 0.90, 0.85, 0.85],
...
[0.98, 0.98, 0.98, 0.98, 0.98]]
I am pretty sure, there is a pythonic way to do this..
I am currently, having like 5 loops indexing to vals array..
Its simply
from itertools import product
vals = [0.85, 0.9, 0.95, 0.98]
perm_vals = list(product(vals, repeat=5))

Categories