Transforming different arrays into a loop - python

I was wondering if it was possible to transform these next process into a loop so that I can use one word for this (not as a vector):
Data0 = np.zeros(dem0.shape, dtype=np.int32)
Data0[zipp[0] >= 0 ] = 1
Data1 = np.zeros(dem1.shape, dtype=np.int32)
Data1[zipp[1] >= 0 ] = 1
Data2 = np.zeros(dem2.shape, dtype=np.int32)
Data2[zipp[2] >= 0 ] = 1
Data3 = np.zeros(dem3.shape, dtype=np.int32)
Data3[zipp[3] >= 0 ] = 1
As you can see, there are 4 shapes for each layer (four layers total). I am trying to put a specific/corresponding "zipp" vector position to each dem.shape for each layer I have (in vector zipp[i] each i is an array of each dem).
What I want it to do is to replace with the number 1 those values greater than or equal to zero in the array contained in zipp[i] for each layer/shape/dem.
However, as a result, I must deliver this as a word not a vector or array, so I've been thinking of a loop but haven't been illuminated enough just yet.
Thank you :)

I'm not quite sure what you mean by delivering the result "as a word not a vector or array", but assuming all of these arrays have the same shape you can reduce this to a couple of lines (maybe someone else knows how to do it in 1):
data = np.zeros_like(zipp, dtype=np.int32)
data[zipp >= 0] = 1
If just you want to return a boolean array of where zipp is greater than or equal to 0 you can do that in 1 line like this:
bool = np.greater_equal(zipp, 0)

Related

Find where the slope changes in my data as a parameter that can be easily indexed and extracted

I have the following data:
0.8340502011561366 0.8423491600218922
0.8513456021654467
0.8458192388553084
0.8440111276014195
0.8489589671423143
0.8738088120491972
0.8845129900705279
0.8988298998926688
0.924633964692693
0.9544790734065157
0.9908034431246875
1.0236430466543138
1.061619773027915
1.1050038249835414
1.1371449802490126
1.1921182610371368
1.2752207659022576
1.344047620255176
1.4198117350668353
1.507943067143741
1.622137968203745
1.6814098429502085
1.7646810054280595
1.8485457435775694
1.919591124757554
1.9843144220593145
2.030158014640226
2.018184122476175
2.0323466012624207
2.0179200409023874
2.0316932950853723
2.013683870089898
2.03010703506514
2.0216151623726977
2.038855467786505
2.0453923522466093
2.03759031642753
2.019424996752278
2.0441806106428606
2.0607521369415136
2.059310067318373
2.0661157975162485
2.053216429539864
2.0715123971225564
2.0580473413362075
2.055814512721712
2.0808278560688964
2.0601637029377113
2.0539429365156003
2.0609648613513754
2.0585135712612646
2.087674625814453
2.062482961966647
2.066476100210777
2.0568444178944967
2.0587903943282266
2.0506399365756396
The data plotted looks like:
I want to find the point where the slope changes in sign (I circled it in black. Should be around index 26):
I need to find this point of change for several hundred files. So far I tried the recommendation from this post:
Finding the point of a slope change as a free parameter- Python
I think since my data is a bit noisey I am not getting a smooth transition in the change of the slope.
This is the code I have tried so far:
import numpy as np
#load 1-D data file
file = str(sys.argv[1])
y = np.loadtxt(file)
#create X based on file length
x = np.linspace(1,len(y), num=len(y))
Find first derivative:
m = np.diff(y)/np.diff(x)
print(m)
#Find second derivative
b = np.diff(m)
print(b)
#find Index
index = 0
for difference in b:
index += 1
if difference < 0:
print(index, difference)
Since my data is noisey I am getting some negative values before the index I want. The index I want it to retrieve in this case is around 26 (which is where my data becomes constant). Does anyone have any suggestions on what I can do to solve this issue? Thank you!
A gradient approach is useless in this case because you don't care about velocities or vector fields. The knowledge of the gradient don't add extra information to locate the maximum value since the run are always positive hence will not effect the sign of the gradient. A method based entirly on raise is suggested.
Detect the indices for which the data are decreasing, find the difference between them and the location of the max value. Then by index manipulation you can find the value for which data has a maximum.
data = '0.8340502011561366 0.8423491600218922 0.8513456021654467 0.8458192388553084 0.8440111276014195 0.8489589671423143 0.8738088120491972 0.8845129900705279 0.8988298998926688 0.924633964692693 0.9544790734065157 0.9908034431246875 1.0236430466543138 1.061619773027915 1.1050038249835414 1.1371449802490126 1.1921182610371368 1.2752207659022576 1.344047620255176 1.4198117350668353 1.507943067143741 1.622137968203745 1.6814098429502085 1.7646810054280595 1.8485457435775694 1.919591124757554 1.9843144220593145 2.030158014640226 2.018184122476175 2.0323466012624207 2.0179200409023874 2.0316932950853723 2.013683870089898 2.03010703506514 2.0216151623726977 2.038855467786505 2.0453923522466093 2.03759031642753 2.019424996752278 2.0441806106428606 2.0607521369415136 2.059310067318373 2.0661157975162485 2.053216429539864 2.0715123971225564 2.0580473413362075 2.055814512721712 2.0808278560688964 2.0601637029377113 2.0539429365156003 2.0609648613513754 2.0585135712612646 2.087674625814453 2.062482961966647 2.066476100210777 2.0568444178944967 2.0587903943282266 2.0506399365756396'
data = data.split()
import numpy as np
a = np.array(data, dtype=float)
diff = np.diff(a)
neg_indeces = np.where(diff<0)[0]
neg_diff = np.diff(neg_indeces)
i_max_dif = np.where(neg_diff == neg_diff.max())[0][0] + 1
i_max = neg_indeces[i_max_dif] - 1 # because aise as a difference of two consecutive values
print(i_max, a[i_max])
Output
26 1.9843144220593145
Some details
print(neg_indeces) # all indeces of the negative values in the data
# [ 2 3 27 29 31 33 36 37 40 42 44 45 47 48 50 52 54 56]
print(neg_diff) # difference between such indices
# [ 1 24 2 2 2 3 1 3 2 2 1 2 1 2 2 2 2]
print(neg_diff.max()) # value with highest difference
# 24
print(i_max_dif) # location of the max index of neg_indeces -> 27
# 2
print(i_max) # index of the max of the origonal data
# 26
When the first derivative changes sign, that's when the slope sign changes. I don't think you need the second derivative, unless you want to determine the rate of change of the slope. You also aren't getting the second derivative. You're just getting the difference of the first derivative.
Also, you seem to be assigning arbitrary x values. If you're y-values represent points that are equally spaced apart, than it's ok, otherwise the derivative will be wrong.
Here's an example of how to get first and second der...
import numpy as np
x = np.linspace(1, 100, 1000)
y = np.cos(x)
# Find first derivative:
m = np.diff(y)/np.diff(x)
#Find second derivative
m2 = np.diff(m)/np.diff(x[:-1])
print(m)
print(m2)
# Get x-values where slope sign changes
c = len(m)
changes_index = []
for i in range(1, c):
prev_val = m[i-1]
val = m[i]
if prev_val < 0 and val > 0:
changes_index.append(i)
elif prev_val > 0 and val < 0:
changes_index.append(i)
for i in changes_index:
print(x[i])
notice I had to curtail the x values for the second der. That's because np.diff() returns one less point than the original input.

Iterate the code in a shortest way for the whole dataset

I have very big df:
df.shape() = (106, 3364)
I want to calculate so called frechet distance by using this Frechet Distance between 2 curves. And it works good. Example:
x = df['1']
x1 = df['1.1']
p = np.array([x, x1])
y = df['2']
y1 = df['2.1']
q = np.array([y, y1])
P_final = list(zip(p[0], p[1]))
Q_final = list(zip(q[0], q[1]))
from frechetdist import frdist
frdist(P_final,Q_final)
But I can not do row by row like:
`1 and 1.1` to `1 and 1.1` which is equal to 0
`1 and 1.1` to `2 and 2.1` which is equal to some number
...
`1 and 1.1` to `1682 and 1682.1` which is equal to some number
I want to create something (first idea is for loop, but maybe you have better solution) to calculate this frdist(P_final,Q_final) between:
first rows to all rows (including itself)
second row to all rows (including itself)
Finally, I supposed to get a matrix size (106,106) with 0 on diagonal (because distance between itself is 0)
matrix =
0 1 2 3 4 5 ... 105
0 0
1 0
2 0
3 0
4 0
5 0
... 0
105 0
Not including my trial code because it is confusing everyone!
EDITED:
Sample data:
1 1.1 2 2.1 3 3.1 4 4.1 5 5.1
0 43.1024 6.7498 45.1027 5.7500 45.1072 3.7568 45.1076 8.7563 42.1076 8.7563
1 46.0595 1.6829 45.0595 9.6829 45.0564 4.6820 45.0533 8.6796 42.0501 3.6775
2 25.0695 5.5454 44.9727 8.6660 41.9726 2.6666 84.9566 3.8484 44.9566 1.8484
3 35.0281 7.7525 45.0322 3.7465 14.0369 3.7463 62.0386 7.7549 65.0422 7.7599
4 35.0292 7.5616 45.0292 4.5616 23.0292 3.5616 45.0292 7.5616 25.0293 7.5613
I just used own sample data in your format (I hope)
import pandas as pd
from frechetdist import frdist
import numpy as np
# create sample data
df = pd.DataFrame([[1,2,3,4,5,6], [3,4,5,6,8,9], [2,3,4,5,2,2], [3,4,5,6,7,3]], columns=['1','1.1','2', '2.1', '3', '3.1'])
# this matrix will hold the result
res = np.ndarray(shape=(df.shape[1] // 2, df.shape[1] // 2), dtype=np.float32)
for row in range(res.shape[0]):
for col in range(row, res.shape[1]):
# extract the two functions
P = [*zip([df.loc[:, f'{row+1}'], df.loc[:, f'{row+1}.1']])]
Q = [*zip([df.loc[:, f'{col+1}'], df.loc[:, f'{col+1}.1']])]
# calculate distance
dist = frdist(P, Q)
# put result back (its symmetric)
res[row, col] = dist
res[col, row] = dist
# output
print(res)
Output:
[[0. 4. 7.5498343]
[4. 0. 5.5677643]
[7.5498343 5.5677643 0. ]]
Hope that helps
EDIT: Some general tips:
If speed matters: check if frdist handles also a numpy array of shape
(n_values, 2) than you could save the rather expensive zip-and-unpack operation
and directly use the arrays or build the data directly in a format the your library needs
Generally, use better column namings (3 and 3.1 is not too obvious). Why you dont call them x3, y3 or x3 and f_x3
I would actually put the data into two different Matrices. If you watch the
code I had to do some not-so-obvious stuff like iterating over shape
divided by two and built indices from string operations because of the given table layout

reordering cluster numbers for correct correspondence

I have a dataset that I clustered using two different clustering algorithms. The results are about the same, but the cluster numbers are permuted.
Now for displaying the color coded labels, I want the label ids to be same for the same clusters.
How can I get correct permutation between the two label ids?
I can do this using brute force, but perhaps there is a better/faster method. I would greatly appreciate any help or pointers. If possible I am looking for a python function.
The most well-known algorithm for finding the optimum matching is the hungarian method.
Because it cannot be explained in a few sentences, I have to refer you to a book of your choice, or Wikipedia article "Hungarian algorithm".
You can probably get good results (even perfect if the difference is indeed tiny) by simply picking the maximum of the correspondence matrix and then removing that row and column.
I have a function that works for me. But it may fail when the two cluster results are very inconsistent, which leads to duplicated max values in the contingency matrix. If your cluster results are about the same, it should work.
Here is my code:
from sklearn.metrics.cluster import contingency_matrix
def align_cluster_index(ref_cluster, map_cluster):
"""
remap cluster index according the the ref_cluster.
both inputs must be nparray and have same number of unique cluster index values.
Xin Niu Jan-15-2020
"""
ref_values = np.unique(ref_cluster)
map_values = np.unique(map_cluster)
print(ref_values)
print(map_values)
num_values = ref_values.shape[0]
if ref_values.shape[0]!=map_values.shape[0]:
print('error: both inputs must have same number of unique cluster index values.')
return()
switched_col = set()
while True:
cont_mat = contingency_matrix(ref_cluster, map_cluster)
print(cont_mat)
# divide contingency_matrix by its row and col sums to avoid potential duplicated values:
col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
print(col_sum)
print(row_sum)
cont_mat = cont_mat/(col_sum+row_sum)
print(cont_mat)
# ignore columns that have been switched:
cont_mat[:, list(switched_col)]=-1
print(cont_mat)
sort_0 = np.argsort(cont_mat, axis = 0)
sort_1 = np.argsort(cont_mat, axis = 1)
print('argsort contmat:')
print(sort_0)
print(sort_1)
if np.array_equal(sort_1[:,-1], np.array(range(num_values))):
break
# switch values according to the max value in the contingency matrix:
# get the position of max value:
idx_max = np.unravel_index(np.argmax(cont_mat, axis=None), cont_mat.shape)
print(cont_mat)
print(idx_max)
if (cont_mat[idx_max]>0) and (idx_max[0] not in switched_col):
cluster_tmp = map_cluster.copy()
print('switch', map_values[idx_max[1]], 'and:', ref_values[idx_max[0]])
map_cluster[cluster_tmp==map_values[idx_max[1]]]=ref_values[idx_max[0]]
map_cluster[cluster_tmp==map_values[idx_max[0]]]=ref_values[idx_max[1]]
switched_col.add(idx_max[0])
print(switched_col)
else:
break
print('final argsort contmat:')
print(sort_0)
print(sort_1)
print('final cont_mat:')
cont_mat = contingency_matrix(ref_cluster, map_cluster)
col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values))
row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values)))
cont_mat = cont_mat/(col_sum+row_sum)
print(cont_mat)
return(map_cluster)
And here is some test code:
ref_cluster = np.array([2,2,3,1,0,0,0,1,2,1,2,2,0,3,3,3,3])
map_cluster = np.array([0,0,0,1,1,3,2,3,2,2,0,0,0,2,0,3,3])
c = align_cluster_index(ref_cluster, map_cluster)
print(ref_cluster)
print(c)
>>>[2 2 3 1 0 0 0 1 2 1 2 2 0 3 3 3 3]
>>>[2 2 2 1 1 3 0 3 0 0 2 2 2 0 2 3 3]

python: divide list into pequal parts and add samples in each part together

The following is my script. Each equal part has self.number samples, in0 is input sample. There is an error as follows:
pn[i] = pn[i] + d
IndexError: list index out of range
Is this the problem about the size of pn? How can I define a list with a certain size but no exact number in it?
for i in range(0,len(in0)/self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
if pn[i] >= self.alpha:
out[i] = 1
elif pn[i] <= self.beta:
out[i] = 0
else:
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
There are a number of problems in the code as posted, however, the gist seems to be something that you'd want to do with numpy arrays instead of iterating over lists.
For example, the set of if/else cases that check if pn[i] >= some_value and then sets a corresponding entry into another list with the result (true/false) could be done as a one-liner with an array operation much faster than iterating over lists.
import numpy as np
# for example, assuming you have 9 numbers in your list
# and you want them divided into 3 sublists of 3 values each
# in0 is your original list, which for example might be:
in0 = [1.05, -0.45, -0.63, 0.07, -0.71, 0.72, -0.12, -1.56, -1.92]
# convert into array
in2 = np.array(in0)
# reshape to 3 rows, the -1 means that numpy will figure out
# what the second dimension must be.
in2 = in2.reshape((3,-1))
print(in2)
output:
[[ 1.05 -0.45 -0.63]
[ 0.07 -0.71 0.72]
[-0.12 -1.56 -1.92]]
With this 2-d array structure, element-wise summing is super easy. So is element-wise threshold checking. Plus 'vectorizing' these operations has big speed advantages if you are working with large data.
# add corresponding entries, we want to add the columns together,
# as each row should correspond to your sub-lists.
pn = in2.sum(axis=0) # you can sum row-wise or column-wise, or all elements
print(pn)
output: [ 1. -2.72 -1.83]
# it is also trivial to check the threshold conditions
# here I check each entry in pn against a scalar
alpha = 0.0
out1 = ( pn >= alpha )
print(out1)
output: [ True False False]
# you can easily convert booleans to 1/0
x = out1.astype('int') # or simply out1 * 1
print(x)
output: [1 0 0]
# if you have a list of element-wise thresholds
beta = np.array([0.0, 0.5, -2.0])
out2 = (pn >= beta)
print(out2)
output: [True False True]
I hope this helps. Using the correct data structures for your task can make the analysis much easier and faster. There is a wealth of documentation on numpy, which is the standard numeric library for python.
You initialize pn to an empty list just inside the for loop, never assign anything into it, and then attempt to access an index i. There is nothing at index i because there is nothing at any index in pn yet.
for i in range(0, len(in0) / self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
If you are trying to add the value d to the pn list, you should do this instead:
pn.append(d)

Find an easier way to compare two 2-d array's independence

My question
1. Intro
ka & kb are two 2-d array all in the shape of 31*37
They contain 2 value: 0 & 1
Independence:the grid amount when only the value of ka[i, j] = 1
Using np.mask, they shows like this:
http://i4.tietuku.com/29adccd90484fe34.png
code here:
ka_select = np.ma.masked_less(ka,0.001)
pa =plt.pcolor(kb_select,cmap="Set1",alpha =0.7,facecolor = "k",edgecolor = 'k',zorder =1)
kb_select = np.ma.masked_less(kb,0.001)
pb =plt.pcolor(kb_select,cmap="Set1",alpha =0.7,facecolor = "k",edgecolor = 'k',zorder =1)
2. My early work
Comparing with two array ka & kb.
If the value in index[i,j] all equal to 1, it means that this two array has overlapped in this grid.
Count the overlapping frequency.
I have written some code about comparing two 2-d array
### repeat I defined is the estimate matrix to represent overlap or not in [i,j] position
repeat = np.zeros(ka.shape[0]*ka.shape[0]).reshape(ka.shape[0],ka.shape[1])
for i in range(0,ka.shape[0],1):
for j in range(0,ka.shape[1],1):
if (ka[i,j] == 1) & (kb[i,j] == 1) :
repeat [i,j]=1
else:
repeat[u,v] = 0
rep.append(repeat.sum())
rep: the overlapping frequency for these two 2-d array.
http://i4.tietuku.com/7121ee003ce9d034.png
3. My question
When there are more than two 2-d numpy array all in the same shape with value (0,1), How to sum the overlapping frequency?
I can compare multi array in sequence but the repeat grid would be re-counted
More explain
I want to sum the amount of array ka when ka = 1 but (kb & kc & ...) != 1 at grid[i,j] (Which I call it independence as shown in title).
If ka only comparing with kb, I can use rep to achieve that, and I haven't thought out the method dealing with more than 2 array
Why not using the sum of the arrays kb, ... and test the resulting elements?
An example with three grids:
import numpy
# some random arrays
ka = numpy.random.random_integers(0,1,37*31).reshape(31,37)
kb = numpy.random.random_integers(0,1,37*31).reshape(31,37)
kc = numpy.random.random_integers(0,1,37*31).reshape(31,37)
combined_rest = kb + kc
print "independance:", numpy.sum( (ka == 1) & (combined_rest < 2) )

Categories