Related
I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that.
Is there a function to give the cluster id and it will list out all the data points in that cluster?
I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns.
data = pd.read_csv('filename')
km = KMeans(n_clusters=5).fit(data)
cluster_map = pd.DataFrame()
cluster_map['data_index'] = data.index.values
cluster_map['cluster'] = km.labels_
Once the DataFrame is available is quite easy to filter,
For example, to filter all data points in cluster 3
cluster_map[cluster_map.cluster == 3]
If you have a large dataset and you need to extract clusters on-demand you'll see some speed-up using numpy.where. Here is an example on the iris dataset:
from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
km = KMeans(n_clusters=3)
km.fit(X)
Define a function to extract the indices of the cluster_id you provide. (Here are two functions, for benchmarking, they both return the same values):
def ClusterIndicesNumpy(clustNum, labels_array): #numpy
return np.where(labels_array == clustNum)[0]
def ClusterIndicesComp(clustNum, labels_array): #list comprehension
return np.array([i for i, x in enumerate(labels_array) if x == clustNum])
Let's say you want all samples that are in cluster 2:
ClusterIndicesNumpy(2, km.labels_)
array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])
Numpy wins the benchmark:
%timeit ClusterIndicesNumpy(2,km.labels_)
100000 loops, best of 3: 4 µs per loop
%timeit ClusterIndicesComp(2,km.labels_)
1000 loops, best of 3: 479 µs per loop
Now you can extract all of your cluster 2 data points like so:
X[ClusterIndicesNumpy(2,km.labels_)]
array([[ 6.9, 3.1, 4.9, 1.5],
[ 6.7, 3. , 5. , 1.7],
[ 6.3, 3.3, 6. , 2.5],
... #truncated
Double-check the first three indices from the truncated array above:
print X[52], km.labels_[52]
print X[77], km.labels_[77]
print X[100], km.labels_[100]
[ 6.9 3.1 4.9 1.5] 2
[ 6.7 3. 5. 1.7] 2
[ 6.3 3.3 6. 2.5] 2
Actually a very simple way to do this is:
clusters=KMeans(n_clusters=5)
df[clusters.labels_==0]
The second row returns all the elements of the df that belong to the 0th cluster. Similarly you can find the other cluster-elements.
To get the IDs of the points/samples/observations that are inside each cluster, do this:
Python 2
Example using Iris data and a nice pythonic way:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
RESULTS
#dict format
{0: array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
1: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
2: array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}
# list format
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]
Python 3
Just change
for key, value in mydict.iteritems():
to
for key, value in mydict.items():
You can look at attribute labels_
For example
km = KMeans(2)
km.fit([[1,2,3],[2,3,4],[5,6,7]])
print km.labels_
output: array([1, 1, 0], dtype=int32)
As you can see first and second point is cluster 1, last point in cluster 0.
You can Simply store the labels in an array. Convert the array to a data frame. Then Merge the data that you used to create K means with the new data frame with clusters.
Display the dataframe. Now you should see the row with corresponding cluster. If you want to list all the data with specific cluster, use something like data.loc[data['cluster_label_name'] == 2], assuming 2 your cluster for now.
So I have two lists:
num = [
3, 22, 23, 25, 28, 29, 56, 57, 67, 68, 73, 78, 79, 82, 83, 89, 90, 91,
92, 98, 99, 108, 126, 127, 128, 131, 132, 133
]
details = [
'num=10,time=088', 'num=10,time=084', 'num=10,time=080', 'num=10,time=076',
'num=10,time=072', 'num=10,time=068', 'num=10,time=064', 'num=10,time=060',
'num=10,time=056', 'num=10,time=052', 'num=10,time=048', 'num=10,time=044',
.
.
.
.
'num=07,time=280', 'num=07,time=276', 'num=05,time=508', 'num=05,time=504',
'num=05,time=500', 'num=05,time=496'
]
num has 28 elements and details has 134 elements. I want to remove elements in details by index based on values from num. For example elements with index 3, 22, 23, 25, 28... (these are numbers from num list) should be removed from details.
When I use .pop() as it is described here it gives me an error saying:
AttributeError: 'str' object has no attribute 'pop'
similarily when I use del details[] it gives me an error saying:
IndexError: list assignment index out of range
Here is my code:
for an in details:
an.pop(num)
This should do what you want (delete from details every element indexed by the values in num):
for i in reversed(num):
del details[i]
It iterates over the list backwards so that the indexing of future things to delete doesn't change (otherwise you'd delete 3 and then the element formerly indexed as 22 would be 21)--this is probably the source of your IndexError.
Hmm. Two things. First your loop not quite right. Instead of
for an in details:
an.pop(num)
you want
for an in num: # step through every item in num list
details.pop(an) # remove the item with index an from details list
Second, you need to make sure to pop() items from details in reverse order so that your indexes are good. For example if you pop() index 3 from details, then everything else in details is reordered and when you go to remove index 22 it will be the wrong "cell".
I've simplified details to be a list containing numbers 0 to 133, but this code should work just fine on your real list
num = [3, 22, 23, 25, 28, 29, 56, 57, 67, 68, 73, 78, 79, 82, 83, 89, 90, 91, 92, 98,
99, 108, 126, 127, 128, 131, 132, 133]
details = list(range(134))
# sort indexes in descending order (sort in place)
num.sort(reverse = True)
for an in num:
details.pop(an)
print(details)
output
[0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 24,
26, 27, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 58, 59, 60, 61, 62, 63, 64, 65, 66, 69, 70, 71,
72, 74, 75, 76, 77, 80, 81, 84, 85, 86, 87, 88, 93, 94, 95, 96, 97, 100, 101, 10
2, 103, 104, 105, 106, 107, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 11
9, 120, 121, 122, 123, 124, 125, 129, 130]
I have two types of data lists, historical data and simulated data, that I want to compare with each other to see if they have the same distribution. My code is as follow:
import scipy.stats as stats
data_hist = [164, 157, 145, 113, 127, 192, 214, 193, 107, 95, 60, 55, 30, 19, 22, 22, 19, 20]
date_sim1 = [160, 174, 142, 121, 122, 192, 198, 179, 119, 107, 63, 50, 26, 17, 16, 22, 23, 23]
date_sim2 = [181, 130, 152, 114, 122, 198, 183, 192, 105, 100, 85, 42, 37, 26, 25, 30, 17, 15]
print(stats.chisquare(date_sim1, f_exp=data_hist))
print(stats.chisquare(date_sim2, f_exp=data_hist))
The code gives the following output:
Power_divergenceResult(statistic=12.11387994054504, pvalue=0.79319278886052769)
Power_divergenceResult(statistic=34.413397609752003, pvalue=0.0074220617004927226)
I compared the same data lists with each other using the F-test in excel and got the P-values as 0.939 and 0.849 respectively.
Now my question is am I using the correct chi-square function to calculate the P-value and how do I interpret it to know if I should reject the null hypothesis or not. Why is there a big difference in the P-value when using the different methods.
I originally posted this before, but it was confusing. In order to make it more clear, I'll try to explain it a bit better.
Executing my code creates a list with the values: [2, 2, 3, 4, 6, 9, 14, 22, 35, 56, 90], which is just sequence that I appended into a list. I want to be able to count by 1 with these values. So, starting from 2, I need to be able to count up by 1 forever. For instance, to count to 2(the starting position), I choose the value 2 in the list and add it to a different list. To count 3, I use 3. To count 4, I use 4. However, to count to 5, since 5 is not actually in the list, I will need to add both 2 and 3 together, then add that result to the list. I want to be able to do this for all values past 1 (starting from 2). The most important thing to understand here is that I am only trying to prove I can do this. It's not meant as a functional counter, it's meant to be a proof of concept.
The list is created by adding the previous two numbers to get a third number, then subtracting that number by 1.
Here is the code:
def main():
x = 2
y = 2
z = (x + y) - 1
print x
print y
times = 10
count = 0
while times > 0:
if count == 0:
seq_list = []
seq_list.extend([x, y])
print seq_list
count = 1
else:
seq_list.append(y)
print seq_list
z = (x + y) - 1
x = y
y = z
print z
times -= 1
main()
This code outputs each value created individually, as well as a list containing each previous number created. This creates the list: [2, 2, 3, 4, 6, 9, 14, 22, 35, 56, 90, ...] [(2 + 2) - 1] = 3, [(2 + 3) - 1] = 4, etc.
If I understand correctly, then a simple solution would be to collect all the possible combinations in a set and then check if it's a complete sequence of numbers:
>>> import itertools
>>> l = [2, 2, 3, 4, 6, 9, 14, 22, 35, 56, 90]
>>> s = set()
>>> for i in range(len(l)):
... for comb in itertools.combinations(l, i+1):
... s.add(sum(list(comb)))
>>> sorted(s)
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102,
103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150,
151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166,
167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182,
183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198,
199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214,
215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230,
231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 243]
As you can see, 242 is the only number that can't be reached via summation of your list's members:
>>> [i for i in range(2,244) if i not in s]
[242]
Of course, this approach is entirely brute-force - not a problem with small lists, but it won't scale nicely with larger ones.
Here you see, which combinations lead to which sum:
from itertools import combinations
from collections import defaultdict
def main():
x = y = 2
seq_list = [x]
for _ in range(10):
x, y = y, (x + y) - 1
seq_list.append(x)
print seq_list
result = defaultdict(list)
for i in range(1, 1 + len(seq_list)):
for comb in combinations(seq_list, i):
result[sum(comb)].append(comb)
for k in sorted(result):
print k, result[k]
if __name__ == '__main__':
main()
If you're not interested in the combinations then you can use the following to print all the numbers that can be reached using your sequence.
def check_sum(a, b, stop, total):
return a == total or (a < total and b <= stop and
(check_sum(b, a + b - 1, stop, total - a) or check_sum(b, a + b - 1, stop, total)))
print [i for i in range(244) if check_sum(2, 2, 90, i)]
If you're interested in the more general question -- can all numbers be formed by a sum of numbers from the sequence a, b, a + b - 1... -- then you can use the following. It displays the first combination of numbers within that sequence which will add to the given total, or None if no such combination is found.
def combine_to_sum(a, b, total):
if a == total:
return [a]
if a < total:
r = combine_to_sum(b, a + b - 1, total - a)
if r:
return [a] + r
return combine_to_sum(b, a + b - 1, total)
for i in range(300): # or any large number
print i, combine_to_sum(2, 2, i)
I have an algorithm. I want that last solution of the algorithm if respect certains conditions become the first solution. In my case I have this:
First PArt
Split the multidimensional array q in 2 parts
split_at = q[:,3].searchsorted([1,random.randrange(LB,UB-I)])
D = numpy.split(q, split_at)
Change and rename the splitted matrix:
S=B[1]
SF=B[2]
S2=copy(SF)
S2[:,3]=S2[:,3]+I
Define a function f:
f=sum(S[:,1]*S[:,3])+sum(S2[:,1]*S2[:,3])
This first part is an obligated passage.
Second Passage
Then I split again the array in 2 parts:
split_at = q[:,3].searchsorted([1,random.randrange(LB,UB-I)])
D = numpy.split(q, split_at)
I rename and change parts of the matrix(like in the first passage:
T=D[1]
TF=D[2]
T2=copy(TF)
T2[:,3]=T2[:,3]+I
u=random.sample(T[:],1) #I random select an array from T
v=random.sample(T2[:],1) #random select an array from T2
u=array(u)
v=array(v)
Here is my first problem: I want to continue the algorithm only if v[0,0]-u[0,0]+T[-1,3]<=UB, if not I want to repeat Second Passage until the condition is verified.
Now I swap 1 random array from T with another from T2:
x=numpy.where(v==T2)[0][0]
y=numpy.where(u==T)[0][0]
l=np.copy(T[y])
T[y],T2[x]=T2[x],T[y]
T2[x],l=l,T2[x]
I modified and recalculate some in the matrix:
E=np.copy(T)
E2=np.copy(T2)
E[:,3]=np.cumsum(E[:,0])
E2[:,3]=np.cumsum(E2[:,0])+I
Define f2:
f2=sum(E[:,1]*E[:,3])+sum(E2[:,1]*E2[:,3])
Here my second and last problem. I need to iterate this algorithm. If f-f2<0 my new starting solution has to be E and E2 and my new f has to be f2 and iterate excluding last choice the algorithm (recalcultaing a new f and f2).
Thank you for the patience. I'm a noob :D
EDIT:
I have an example here(this part goes before the code I have written on top)
import numpy as np
import random
p=[ 29, 85, 147, 98, 89, 83, 49, 7, 48, 88, 106, 97, 2,
107, 33, 144, 123, 84, 25, 42, 17, 82, 125, 103, 31, 110,
34, 100, 36, 46, 63, 18, 132, 10, 26, 119, 133, 15, 138,
113, 108, 81, 118, 116, 114, 130, 134, 86, 143, 126, 104, 52,
102, 8, 90, 11, 87, 37, 68, 75, 69, 56, 40, 70, 35,
71, 109, 5, 131, 121, 73, 38, 149, 20, 142, 91, 24, 53,
57, 39, 80, 79, 94, 136, 111, 78, 43, 92, 135, 65, 140,
148, 115, 61, 137, 50, 77, 30, 3, 93]
w=[106, 71, 141, 134, 14, 53, 57, 128, 119, 6, 4, 2, 140,
63, 51, 126, 35, 21, 125, 7, 109, 82, 95, 129, 67, 115,
112, 31, 114, 42, 91, 46, 108, 60, 97, 142, 85, 149, 28,
58, 52, 41, 22, 83, 86, 9, 120, 30, 136, 49, 84, 38,
70, 127, 1, 99, 55, 77, 144, 105, 145, 132, 45, 61, 81,
10, 36, 80, 90, 62, 32, 68, 117, 64, 24, 104, 131, 15,
47, 102, 100, 16, 89, 3, 147, 48, 148, 59, 143, 98, 88,
118, 121, 18, 19, 11, 69, 65, 123, 93]
p=array(p,'double')
w=array(w,'double')
r=p/w
LB=12
UB=155
I=9
j=p,w,r
j=transpose(j)
k=j[j[:,2].argsort()]
c=np.cumsum(k[:,0])
q=k[:,0],k[:,1],k[:,2],c
q=transpose(q)
o=sum(q[:,1]*q[:,3])
split_at = q[:,3].searchsorted([1,UB-I])
B = numpy.split(q, split_at)
S=B[1]
SF=B[2]
S2=copy(SF)
S2[:,3]=S2[:,3]+I
f=sum(S[:,1]*S[:,3])+sum(S2[:,1]*S2[:,3])
split_at = q[:,3].searchsorted([1,random.randrange(LB,UB-I)])
D = numpy.split(q, split_at)
T=D[1]
TF=D[2]
T2=copy(TF)
T2[:,3]=T2[:,3]+I
u=random.sample(T[:],1)
v=random.sample(T2[:],1)
u=array(u)
v=array(v)
x=numpy.where(v==T2)[0][0]
y=numpy.where(u==T)[0][0]
l=np.copy(T[y])
T[y],T2[x]=T2[x],T[y]
T2[x],l=l,T2[x]
E=np.copy(T)
E2=np.copy(T2)
E[:,3]=np.cumsum(E[:,0])
E2[:,3]=np.cumsum(E2[:,0])+I
f2=sum(E[:,1]*E[:,3])+sum(E2[:,1]*E2[:,3])
I tried:
def DivideRandom(T,T2):
split_at = q[:,3].searchsorted([1,random.randrange(LB,UB-I)])
D = numpy.split(q, split_at)
T=D[1]
TF=D[2]
T2=copy(TF)
T2[:,3]=T2[:,3]+I
Divide(T,T2)
def SelectJob(u,v):
u=random.sample(T[:],1)
v=random.sample(T2[:],1)
u=array(u)
v=array(v)
SelectJob(u,v)
d=v[0,0]-u[0,0]+T[-1,3]
def Swap(u,v):
x=numpy.where(v==T2)[0][0]
y=numpy.where(u==T)[0][0]
l=np.copy(T[y])
T[y],T2[x]=T2[x],T[y]
T2[x],l=l,T2[x]
E=np.copy(T)
E2=np.copy(T2)
E[:,3]=np.cumsum(E[:,0])
E2[:,3]=np.cumsum(E2[:,0])+I
f2=sum(E[:,1]*E[:,3])+sum(E2[:,1]*E2[:,3])
while True:
if d<=UB
Swap(u,v)
if d>UB
DivideRandom(T,T2)
SelectJob(u,v)
if d<UB:
break
You can iterate indefinitely using while True, then stop whenever your conditions are met using break:
count = 0
while True:
count += 1
if count == 10:
break
So for your second example you can try:
while True:
...
if f - f2 < 0:
# use new variables
f, E = f2, E2
else:
break
Your first problem is similar; loop, test, reset the appropriate variables.