Kmeans: Reassign data point to second nearest?

Kmeans: Reassign data point to second nearest? - python

I have a trained Scikit Kmean model.
When using the models predict-function, the model assigns a given data point to the nearest cluster. (As expected)
What is the easiest method to instead have the model assign the data point to the SECOND nearest, or THIRD nearest cluster?
I cannot seem to find this anywhere. (I might be missing something essential.)

The Kmeans classifier has a transform(X) method that returns the distance of each record to the centroids of each cluster, in the form of an array with the shape [n_observations, n_clusters].
With that, you can pick which cluster to assign the records to.
Example:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale
np.random.seed(42)
digits = load_digits()
data = scale(digits.data)
n_digits = len(np.unique(digits.target))
km = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
km.fit(data)
predicted = km.predict(data)
dist_centers = km.transform(data)
To validate the transform output, we can compare the result of predict to taking the minimum value of the centroid distances:
>>> np.allclose(km.predict(data), np.argmin(dist_centers, axis=1))
True
Finally, we can use np.argsort to get the index of the sorted elements of each row in the distances array in such a way that the first column of the result corresponds to the labels of the nearest clusters, the second column corresponds to the labels of the second nearest clusters, and so on.
>>> print(predicted)
[0 3 3 ... 3 7 7]
>>> print(np.argsort(dist_centers, axis=1))
[[0 7 4 ... 8 6 5]
[3 9 4 ... 6 0 5]
[3 9 4 ... 8 6 5]
...
[3 1 9 ... 8 6 5]
[7 0 9 ... 8 6 5]
[7 3 1 ... 9 6 5]]

Related

How does Sklearn Naive Bayes Bernoulli Classifier work when the predictors are not binary?

As we know the Bernoulli Naive Bayes Classifier uses binary predictors (features). The thing I am not getting is how BernoulliNB in scikit-learn is giving results even if the predictors are not binary. The following example is taken verbatim from the documentation:
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
Y = np.array([1, 2, 3, 4, 4, 5])
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X, Y)
print(clf.predict(X[2:3]))
Output:
array([3])
Here are the first 10 features of X, and they are obviously not binary:
3 4 0 1 3 0 0 1 4 4 1
1 0 2 4 4 0 4 1 4 1 0
2 4 4 0 3 3 0 3 1 0 2
2 2 3 1 4 0 0 3 2 4 1
0 4 0 3 2 4 3 2 4 2 4
3 3 3 3 0 2 3 1 3 2 3
How does BernoulliNB work here even though the predictors are not binary?

This is due to the binarize argument; from the docs:
binarize : float or None, default=0.0
Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
When called with its default value binarize=0.0, as is the case in your code (since you do not specify it explicitly), it will result in converting every element of X greater than 0 to 1, hence the transformed X that will be used as the actual input to the BernoulliNB classifier will consist indeed of binary values.
The binarize argument works exactly the same way with the stand-alone preprocessing function of the same name; here is a simplified example, adapting your own:
from sklearn.preprocessing import binarize
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 1))
X
# result
array([[3],
[4],
[0],
[1],
[3],
[0]])
binarize(X) # here as well, default threshold=0.0
# result (binary values):
array([[1],
[1],
[0],
[1],
[1],
[0]])

Discard points with X,Y coordinate close to eachother in Dataframe

I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7

As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2

You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5

How to reorder coordinate values using the euclidian distance in Python?

I want to reorder the coordinate value based on the euclidean distance .
For example I have coordinates:
1 2
2 1
1 3
1 9
6 9
3 5
6 8
4 5
7 9
I have got euclidean distance of first coordinate with other coordinate:
With the following code:
with open("../data comparision project/testfile.txt") as f:
# for splitting the text file into to lists of list
my_list = [[x for x in line.strip().split(' ')] for line in f
index = 0
# empty list to store distances.
euclidean_distance_list = []
for list_of_item in my_list:
plot1=my_list[0]
plot2=my_list[index]
euclidean_distance=math.sqrt((float(plot1[0])-float(plot2[0]))**2 + (float(plot1[1])-float(plot2[1]))**2)
index=index+1
# Out of for loop
sorted_list=sorted(euclidean_distance_list)
print(sorted_list)
This generates the following output:
[0.0, 1.0, 1.4142135623730951, 3.605551275463989, 4.242640687119285, 7.0, 7.810249675906654, 8.602325267042627, 9.219544457292887]
Now I want to reorder the original coordinate value based on the these distances such that it will be:
1 2
1 3
1 9
2 1
3 5
4 5
6 8
6 9
7 9
Can anyone help me with python code.I have caluclated distance but unable to get list with sorted coordinate vlaues.

You want to sort the list based on a custom comparator.
Check out the key optional argument to the sort function. You can supply a custom comparator as key.
https://docs.python.org/3/howto/sorting.html

To fill in a bit more detail - supposing that you already wrote the function:
def euclidean_distance(a, b):
# does the math and gives the distance between coordinates a and b.
# If you got the values some other way - better reorganize the code
# first so that you have a function like this :)
We can use functools.partial to make a function for distances from a given point:
distance_from_a = functools.partial(euclidean_distance, points[0])
and then the rest of the logic is built into Python's native sorting functionality:
sorted(points, key=distance_from_a)

You can perform a custom sort by doing something like this assuming you are using numpy:
import numpy as np
def euclidian_distance(a, b):
return np.linalg.norm(a - b)
coords = np.array([[1,2],
[2,1],
[1,3],
[1,9],
[6,9],
[3,5],
[6,8],
[4,5],
[7,9]])
coords = sorted(coords, key=lambda point: euclidian_distance(point, coords[0]))
print(np.matrix(coords)) # matrix is only for formatting for readability purposes
Output:
[[1 2]
[1 3]
[2 1]
[3 5]
[4 5]
[1 9]
[6 8]
[6 9]
[7 9]]
To explain why the above output is different from the OP's. It's because the OP's example output is not actually ordered by distance like they described they wanted.

Time series cross-validation using linear regression from scikit learn

I'm using the Linear Regression model from Scikit Learn to an explanatory fit on a time series:
from sklearn import linear_model
import numpy as np
X = np.array([np.random.random(100), np.random.random(100)])
y = np.array(np.random.random(100))
regressor = linear_model.LinearRegression()
regressor.fit(X, y)
y_hat = regressor.predict(X)
I want do cross-validate the the prediction. As far as I know, I can't use the cross_val from sklearn (like Kfold) because it will break down the results randomly, and I need that the folds are sequentially. For example,
data_set = [1 2 3 4 5 6 7 8 9 10]
# first train set
train = [1]
# first test set
test = [2 3 4 5 6 7 8 9 10]
#fit, predict, evaluate
# train set
train = [1 2]
# test set
test = [3 4 5 6 7 8 9 10]
#fit, predict, evaluate
...
# train set
train = [1 2 3 4 5 6 7 8]
# test set
test = [9 10]
#fit, predict, evaluate
Is it possible to do this using sklearn?

You do not need scikit for this kind of folding. Slicing is sufficient, something like:
step = 1
for i in range(0, len(data_set), step):
train = dataset[:i]
test = dataset[i:]
# etc...

Performance of NumPy for algorithms concerning individual elements of an array

I'm interested in the performance of NumPy, when it comes to algorithms that check whether a condition is True for an element and its affiliations (e.g. neighbouring elements) and assign a value according to the condition.
An example may be: (I make this up now)
I generate a 2d array of 1's and 0's, randomly.
Then I check whether the first element of the array is the same with its neighbors.
If the similar ones are the majority, I switch (0 -> 1 or 1 -> 0) that particular element.
And I proceed to the next element.
I guess that this kind of element wise conditions and element-wise operations are pretty slow with NumPy, is there a way that I can make the performance better?
For example, would creating the array with type dbool and adjusting the code, would it help?
Thanks in advance.

Maybe http://www.scipy.org/Cookbook/GameOfLifeStrides helps you.

It looks like your are doing some kind of image processing, you can try scipy.ndimage.
from scipy.ndimage import convolve
import numpy as np
np.random.seed(0)
x = np.random.randint(0,2,(5,5))
print x
w = np.ones((3,3), dtype=np.int8)
w[1,1] = 0
y = convolve(x, w, mode="constant")
print y
the outputs are:
[[0 1 1 0 1]
[1 1 1 1 1]
[1 0 0 1 0]
[0 0 0 0 1]
[0 1 1 0 0]]
[[3 4 4 5 2]
[3 5 5 5 3]
[2 4 4 4 4]
[2 3 3 3 1]
[1 1 1 2 1]]
y is the sum of the neighbors of every element. Do the same convolve with all ones, you get the number of neighbors number of every element:
>>> n = convolve(np.ones((5,5),np.int8), w, mode="constant")
>>> n
[[3 5 5 5 3]
[5 8 8 8 5]
[5 8 8 8 5]
[5 8 8 8 5]
[3 5 5 5 3]]
then you can do element-wise operations with x, y, n, and get your result.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Kmeans: Reassign data point to second nearest? - python

Related

How does Sklearn Naive Bayes Bernoulli Classifier work when the predictors are not binary?

Discard points with X,Y coordinate close to eachother in Dataframe

How to reorder coordinate values using the euclidian distance in Python?

Time series cross-validation using linear regression from scikit learn

Performance of NumPy for algorithms concerning individual elements of an array

Categories

Resources