reordering cluster numbers for correct correspondence - python

I have a dataset that I clustered using two different clustering algorithms. The results are about the same, but the cluster numbers are permuted.
Now for displaying the color coded labels, I want the label ids to be same for the same clusters.
How can I get correct permutation between the two label ids?
I can do this using brute force, but perhaps there is a better/faster method. I would greatly appreciate any help or pointers. If possible I am looking for a python function.

The most well-known algorithm for finding the optimum matching is the hungarian method.
Because it cannot be explained in a few sentences, I have to refer you to a book of your choice, or Wikipedia article "Hungarian algorithm".
You can probably get good results (even perfect if the difference is indeed tiny) by simply picking the maximum of the correspondence matrix and then removing that row and column.

I have a function that works for me. But it may fail when the two cluster results are very inconsistent, which leads to duplicated max values in the contingency matrix. If your cluster results are about the same, it should work.
Here is my code:
from sklearn.metrics.cluster import contingency_matrix
def align_cluster_index(ref_cluster, map_cluster):
"""
remap cluster index according the the ref_cluster.
both inputs must be nparray and have same number of unique cluster index values.
Xin Niu Jan-15-2020
"""
ref_values = np.unique(ref_cluster)
map_values = np.unique(map_cluster)
print(ref_values)
print(map_values)
num_values = ref_values.shape[0]
if ref_values.shape[0]!=map_values.shape[0]:
print('error: both inputs must have same number of unique cluster index values.')
return()
switched_col = set()
while True:
cont_mat = contingency_matrix(ref_cluster, map_cluster)
print(cont_mat)
# divide contingency_matrix by its row and col sums to avoid potential duplicated values:
col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
print(col_sum)
print(row_sum)
cont_mat = cont_mat/(col_sum+row_sum)
print(cont_mat)
# ignore columns that have been switched:
cont_mat[:, list(switched_col)]=-1
print(cont_mat)
sort_0 = np.argsort(cont_mat, axis = 0)
sort_1 = np.argsort(cont_mat, axis = 1)
print('argsort contmat:')
print(sort_0)
print(sort_1)
if np.array_equal(sort_1[:,-1], np.array(range(num_values))):
break
# switch values according to the max value in the contingency matrix:
# get the position of max value:
idx_max = np.unravel_index(np.argmax(cont_mat, axis=None), cont_mat.shape)
print(cont_mat)
print(idx_max)
if (cont_mat[idx_max]>0) and (idx_max[0] not in switched_col):
cluster_tmp = map_cluster.copy()
print('switch', map_values[idx_max[1]], 'and:', ref_values[idx_max[0]])
map_cluster[cluster_tmp==map_values[idx_max[1]]]=ref_values[idx_max[0]]
map_cluster[cluster_tmp==map_values[idx_max[0]]]=ref_values[idx_max[1]]
switched_col.add(idx_max[0])
print(switched_col)
else:
break
print('final argsort contmat:')
print(sort_0)
print(sort_1)
print('final cont_mat:')
cont_mat = contingency_matrix(ref_cluster, map_cluster)
col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
cont_mat = cont_mat/(col_sum+row_sum)
print(cont_mat)
return(map_cluster)
And here is some test code:
ref_cluster = np.array([2,2,3,1,0,0,0,1,2,1,2,2,0,3,3,3,3])
map_cluster = np.array([0,0,0,1,1,3,2,3,2,2,0,0,0,2,0,3,3])
c = align_cluster_index(ref_cluster, map_cluster)
print(ref_cluster)
print(c)
>>>[2 2 3 1 0 0 0 1 2 1 2 2 0 3 3 3 3]
>>>[2 2 2 1 1 3 0 3 0 0 2 2 2 0 2 3 3]

Related

How to select list elements based on crteria from other lists

I am new to Python, coming from SciLab (an open source MatLab ersatz), which I am using as a toolbox for my analyses (test data analysis, reliability, acoustics, ...); I am definitely not a computer science lad.
I have data in the form of lists of same length (vectors of same size in SciLab).
I use some of them as parameter in order to select data from another one; e.g.
t_v = [1:10]; // a parameter vector
p_v = [20:29]; another parameter vector
res_v(t_v > 5 & p_v < 28); // are the res_v vector elements of which "corresponding" p_v and t_v values comply with my criteria; i can use it for analyses.
This is very direct and simple in SciLab; I did not find the way to achieve the same with Python, either "Pythonically" or simply translated.
Any idea that could help me, please?
Have a nice day,
Patrick.
You could use numpy arrays. It's easy:
import numpy as np
par1 = np.array([1,1,5,5,5,1,1])
par2 = np.array([-1,1,1,-1,1,1,1])
data = np.array([1,2,3,4,5,6,7])
print(par1)
print(par2)
print(data)
bool_filter = (par1[:]>1) & (par2[:]<0)
# example to do it directly in the array
filtered_data = data[ par1[:]>1 ]
print( filtered_data )
#filtering with the two parameters
filtered_data_twice = data[ bool_filter==True ]
print( filtered_data_twice )
output:
[1 1 5 5 5 1 1]
[-1 1 1 -1 1 1 1]
[1 2 3 4 5 6 7]
[3 4 5]
[4]
Note that it does not keep the same number of elements.
Here's my modified solution according to your last comment.
t_v = list(range(1,10))
p_v = list(range(20,29))
res_v = list(range(30,39))
def first_idex_greater_than(search_number, lst):
for count, number in enumerate(lst):
if number > search_number:
return count
def first_idex_lower_than(search_number, lst):
for count, number in enumerate(lst[::-1]):
if number < search_number:
return len(lst) - count # since I searched lst from top to bottom,
# I need to also reverse count
t_v_index = first_idex_greater_than(5, t_v)
p_v_index = first_idex_lower_than(28, p_v)
print(res_v[min(t_v_index, p_v_index):max(t_v_index, p_v_index)])
It returns an array [35, 36, 37].
I'm sure you can optimize it better according to your needs.
The problem statement is not clearly defined, but this is what I interpret to be a likely solution.
import pandas as pd
tv = list(range(1, 11))
pv = list(range(20, 30))
res = list(range(30, 40))
df = pd.DataFrame({'tv': tv, 'pv': pv, 'res': res})
print(df)
def criteria(row, col1, a, col2, b):
if (row[col1] > a) & (row[col2] < b):
return True
else:
return False
df['select'] = df.apply(lambda row: criteria(row, 'tv', 5, 'pv', 28), axis=1)
selected_res = df.loc[df['select']]['res'].tolist()
print(selected_res)
# ... or another way ..
print(df.loc[(df.tv > 5) & (df.pv < 28)]['res'])
This produces a dataframe where each column is the original lists, and applies a selection criteria, based on columns tv and pv to identify the rows in which the criteria, applied dependently to the 2 lists, is satisfied (or not), and then creates a new column of booleans identifying the rows where the criteria is either True or False.
[35, 36, 37]
5 35
6 36
7 37

Find where the slope changes in my data as a parameter that can be easily indexed and extracted

I have the following data:
0.8340502011561366 0.8423491600218922
0.8513456021654467
0.8458192388553084
0.8440111276014195
0.8489589671423143
0.8738088120491972
0.8845129900705279
0.8988298998926688
0.924633964692693
0.9544790734065157
0.9908034431246875
1.0236430466543138
1.061619773027915
1.1050038249835414
1.1371449802490126
1.1921182610371368
1.2752207659022576
1.344047620255176
1.4198117350668353
1.507943067143741
1.622137968203745
1.6814098429502085
1.7646810054280595
1.8485457435775694
1.919591124757554
1.9843144220593145
2.030158014640226
2.018184122476175
2.0323466012624207
2.0179200409023874
2.0316932950853723
2.013683870089898
2.03010703506514
2.0216151623726977
2.038855467786505
2.0453923522466093
2.03759031642753
2.019424996752278
2.0441806106428606
2.0607521369415136
2.059310067318373
2.0661157975162485
2.053216429539864
2.0715123971225564
2.0580473413362075
2.055814512721712
2.0808278560688964
2.0601637029377113
2.0539429365156003
2.0609648613513754
2.0585135712612646
2.087674625814453
2.062482961966647
2.066476100210777
2.0568444178944967
2.0587903943282266
2.0506399365756396
The data plotted looks like:
I want to find the point where the slope changes in sign (I circled it in black. Should be around index 26):
I need to find this point of change for several hundred files. So far I tried the recommendation from this post:
Finding the point of a slope change as a free parameter- Python
I think since my data is a bit noisey I am not getting a smooth transition in the change of the slope.
This is the code I have tried so far:
import numpy as np
#load 1-D data file
file = str(sys.argv[1])
y = np.loadtxt(file)
#create X based on file length
x = np.linspace(1,len(y), num=len(y))
Find first derivative:
m = np.diff(y)/np.diff(x)
print(m)
#Find second derivative
b = np.diff(m)
print(b)
#find Index
index = 0
for difference in b:
index += 1
if difference < 0:
print(index, difference)
Since my data is noisey I am getting some negative values before the index I want. The index I want it to retrieve in this case is around 26 (which is where my data becomes constant). Does anyone have any suggestions on what I can do to solve this issue? Thank you!
A gradient approach is useless in this case because you don't care about velocities or vector fields. The knowledge of the gradient don't add extra information to locate the maximum value since the run are always positive hence will not effect the sign of the gradient. A method based entirly on raise is suggested.
Detect the indices for which the data are decreasing, find the difference between them and the location of the max value. Then by index manipulation you can find the value for which data has a maximum.
data = '0.8340502011561366 0.8423491600218922 0.8513456021654467 0.8458192388553084 0.8440111276014195 0.8489589671423143 0.8738088120491972 0.8845129900705279 0.8988298998926688 0.924633964692693 0.9544790734065157 0.9908034431246875 1.0236430466543138 1.061619773027915 1.1050038249835414 1.1371449802490126 1.1921182610371368 1.2752207659022576 1.344047620255176 1.4198117350668353 1.507943067143741 1.622137968203745 1.6814098429502085 1.7646810054280595 1.8485457435775694 1.919591124757554 1.9843144220593145 2.030158014640226 2.018184122476175 2.0323466012624207 2.0179200409023874 2.0316932950853723 2.013683870089898 2.03010703506514 2.0216151623726977 2.038855467786505 2.0453923522466093 2.03759031642753 2.019424996752278 2.0441806106428606 2.0607521369415136 2.059310067318373 2.0661157975162485 2.053216429539864 2.0715123971225564 2.0580473413362075 2.055814512721712 2.0808278560688964 2.0601637029377113 2.0539429365156003 2.0609648613513754 2.0585135712612646 2.087674625814453 2.062482961966647 2.066476100210777 2.0568444178944967 2.0587903943282266 2.0506399365756396'
data = data.split()
import numpy as np
a = np.array(data, dtype=float)
diff = np.diff(a)
neg_indeces = np.where(diff<0)[0]
neg_diff = np.diff(neg_indeces)
i_max_dif = np.where(neg_diff == neg_diff.max())[0][0] + 1
i_max = neg_indeces[i_max_dif] - 1 # because aise as a difference of two consecutive values
print(i_max, a[i_max])
Output
26 1.9843144220593145
Some details
print(neg_indeces) # all indeces of the negative values in the data
# [ 2 3 27 29 31 33 36 37 40 42 44 45 47 48 50 52 54 56]
print(neg_diff) # difference between such indices
# [ 1 24 2 2 2 3 1 3 2 2 1 2 1 2 2 2 2]
print(neg_diff.max()) # value with highest difference
# 24
print(i_max_dif) # location of the max index of neg_indeces -> 27
# 2
print(i_max) # index of the max of the origonal data
# 26
When the first derivative changes sign, that's when the slope sign changes. I don't think you need the second derivative, unless you want to determine the rate of change of the slope. You also aren't getting the second derivative. You're just getting the difference of the first derivative.
Also, you seem to be assigning arbitrary x values. If you're y-values represent points that are equally spaced apart, than it's ok, otherwise the derivative will be wrong.
Here's an example of how to get first and second der...
import numpy as np
x = np.linspace(1, 100, 1000)
y = np.cos(x)
# Find first derivative:
m = np.diff(y)/np.diff(x)
#Find second derivative
m2 = np.diff(m)/np.diff(x[:-1])
print(m)
print(m2)
# Get x-values where slope sign changes
c = len(m)
changes_index = []
for i in range(1, c):
prev_val = m[i-1]
val = m[i]
if prev_val < 0 and val > 0:
changes_index.append(i)
elif prev_val > 0 and val < 0:
changes_index.append(i)
for i in changes_index:
print(x[i])
notice I had to curtail the x values for the second der. That's because np.diff() returns one less point than the original input.

Iterate the code in a shortest way for the whole dataset

I have very big df:
df.shape() = (106, 3364)
I want to calculate so called frechet distance by using this Frechet Distance between 2 curves. And it works good. Example:
x = df['1']
x1 = df['1.1']
p = np.array([x, x1])
y = df['2']
y1 = df['2.1']
q = np.array([y, y1])
P_final = list(zip(p[0], p[1]))
Q_final = list(zip(q[0], q[1]))
from frechetdist import frdist
frdist(P_final,Q_final)
But I can not do row by row like:
`1 and 1.1` to `1 and 1.1` which is equal to 0
`1 and 1.1` to `2 and 2.1` which is equal to some number
...
`1 and 1.1` to `1682 and 1682.1` which is equal to some number
I want to create something (first idea is for loop, but maybe you have better solution) to calculate this frdist(P_final,Q_final) between:
first rows to all rows (including itself)
second row to all rows (including itself)
Finally, I supposed to get a matrix size (106,106) with 0 on diagonal (because distance between itself is 0)
matrix =
0 1 2 3 4 5 ... 105
0 0
1 0
2 0
3 0
4 0
5 0
... 0
105 0
Not including my trial code because it is confusing everyone!
EDITED:
Sample data:
1 1.1 2 2.1 3 3.1 4 4.1 5 5.1
0 43.1024 6.7498 45.1027 5.7500 45.1072 3.7568 45.1076 8.7563 42.1076 8.7563
1 46.0595 1.6829 45.0595 9.6829 45.0564 4.6820 45.0533 8.6796 42.0501 3.6775
2 25.0695 5.5454 44.9727 8.6660 41.9726 2.6666 84.9566 3.8484 44.9566 1.8484
3 35.0281 7.7525 45.0322 3.7465 14.0369 3.7463 62.0386 7.7549 65.0422 7.7599
4 35.0292 7.5616 45.0292 4.5616 23.0292 3.5616 45.0292 7.5616 25.0293 7.5613
I just used own sample data in your format (I hope)
import pandas as pd
from frechetdist import frdist
import numpy as np
# create sample data
df = pd.DataFrame([[1,2,3,4,5,6], [3,4,5,6,8,9], [2,3,4,5,2,2], [3,4,5,6,7,3]], columns=['1','1.1','2', '2.1', '3', '3.1'])
# this matrix will hold the result
res = np.ndarray(shape=(df.shape[1] // 2, df.shape[1] // 2), dtype=np.float32)
for row in range(res.shape[0]):
for col in range(row, res.shape[1]):
# extract the two functions
P = [*zip([df.loc[:, f'{row+1}'], df.loc[:, f'{row+1}.1']])]
Q = [*zip([df.loc[:, f'{col+1}'], df.loc[:, f'{col+1}.1']])]
# calculate distance
dist = frdist(P, Q)
# put result back (its symmetric)
res[row, col] = dist
res[col, row] = dist
# output
print(res)
Output:
[[0. 4. 7.5498343]
[4. 0. 5.5677643]
[7.5498343 5.5677643 0. ]]
Hope that helps
EDIT: Some general tips:
If speed matters: check if frdist handles also a numpy array of shape
(n_values, 2) than you could save the rather expensive zip-and-unpack operation
and directly use the arrays or build the data directly in a format the your library needs
Generally, use better column namings (3 and 3.1 is not too obvious). Why you dont call them x3, y3 or x3 and f_x3
I would actually put the data into two different Matrices. If you watch the
code I had to do some not-so-obvious stuff like iterating over shape
divided by two and built indices from string operations because of the given table layout

Binning and then combining bins with minimum number of observations?

Let's say I create some data and then create bins of different sizes:
from __future__ import division
x = np.random.rand(1,20)
new, = np.digitize(x,np.arange(1,x.shape[1]+1)/100)
new_series = pd.Series(new)
print(new_series.value_counts())
reveals:
20 17
16 1
4 1
2 1
dtype: int64
I basically want to transform the underlying data, if I set a minimum threshold of at least 2 per bin, so that new_series.value_counts() is this:
20 17
16 3
dtype: int64
EDITED:
x = np.random.rand(1,100)
bins = np.arange(1,x.shape[1]+1)/100
new = np.digitize(x,bins)
n = new.copy()[0] # this will hold the the result
threshold = 2
for i in np.unique(n):
if sum(n == i) <= threshold:
n[n == i] += 1
n.clip(0, bins.size) # avoid adding beyond the last bin
n = n.reshape(1,-1)
This can move counts up multiple times, until a bin is filled sufficiently.
Instead of using np.digitize, it might be simpler to use np.histogram instead, because it will directly give you the counts, so that we don't need to sum ourselves.

Find two disjoint pairs of pairs that sum to the same vector

This is a follow-up to Find two pairs of pairs that sum to the same value .
I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(m,n))
I would like to determine if the matrix has two disjoint pairs of pairs of columns which sum to the same column vector. I am looking for a fast method to do this. In the previous problem ((0,1), (0,2)) was acceptable as a pair of pairs of column indices but in this case it is not as 0 is in both pairs.
The accepted answer from the previous question is so cleverly optimised I can't see how to make this simple looking change unfortunately. (I am interested in columns rather than rows in this question but I can always just do A.transpose().)
Here is some code to show it testing all 4 by 4 arrays.
n = 4
nxn = np.arange(n*n).reshape(n, -1)
count = 0
for i in xrange(2**(n*n)):
A = (i >> nxn) %2
p = 1
for firstpair in combinations(range(n), 2):
for secondpair in combinations(range(n), 2):
if firstpair < secondpair and not set(firstpair) & set(secondpair):
if (np.array_equal(A[firstpair[0]] + A[firstpair[1]], A[secondpair[0]] + A[secondpair[1]] )):
if (p):
count +=1
p = 0
print count
This should output 3136.
Here is my solution, extended to do what I believe you want. It isn't entirely clear though; one may get an arbitrary number of row-pairs that sum to the same total; there may exist unique subsets of rows within them that sum to the same value. For instance:
Given this set of row-pairs that sum to the same total
[[19 19 30 30]
[11 16 11 16]]
There exists a unique subset of these rows that may still be counted as valid; but should it?
[[19 30]
[16 11]]
Anyway, I hope those details are easy to deal with, given the code below.
import numpy as np
n = 20
#also works for non-square A
A = np.random.randint(2, size=(n*6,n)).astype(np.int8)
##A = np.array( [[0, 0, 0], [1, 1, 1], [1, 1 ,1]], np.uint8)
##A = np.zeros((6,6))
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
R = np.zeros(a.shape[1:], dtype)
for col in columns:
R *= base
R += col
yield R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
#keep all packed columns; we might need them later
columns = []
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
columns.append(packed_column)
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all combinations
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those combinations which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
"""
we now have a small set of relevant combinations, but we have lost the details of their particulars
to see which combinations of rows does sum to the same value, we do need to consider rows as a whole
we can simply apply the same mechanism, but for all columns at the same time,
but only for the selected subset of row combinations known to be relevant
"""
#construct full packed matrix
B = np.ascontiguousarray(np.vstack(columns).T)
#perform all relevant sums, over all columns
rowsums = sum(B[I[active_combinations]] for I in combinations_index)
#find the unique rowsums, by viewing rows as a void object
unique, count, inverse = unique_count(voidview(rowsums))
#if not, we did something wrong in deciding on active combinations
assert(np.all(count>1))
#loop over all sets of rows that sum to an identical unique value
for i in xrange(len(unique)):
#set of indexes into combinations_index;
#note that there may be more than two combinations that sum to the same value; we grab them all here
combinations_group = active_combinations[inverse==i]
#associated row-combinations
#array of shape=(mulitplicity,group_size)
row_combinations = combinations_index[:,combinations_group]
#if no duplicate rows involved, we have a match
if len(np.unique(row_combinations[:,[0,-1]])) == multiplicity*2:
print row_combinations
return True
#none of identical rowsums met uniqueness criteria
return False
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(1):
## print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Edit: code cleanup

Categories