Iterating over numpy arrays to compare columns of different arrays - python

I am new to programming and had a question. If I had two numpy arrays:
A = np.array([[1,0,3], [2,6,5], [3,4,1],[4,3,2],[5,7,9]], dtype=np.int64)
B = np.array([[3,4,5],[6,7,9],[1,0,3],[4,5,6]], dtype=np.int64)
I want to compare the last two columns of array A to the last two columns of array B, and then if they are equal, output the entire row to a new array. So, the output of these two arrays would be:
[1,0,3
1,0,3
5,7,9
6,7,9]
Because even though the first element does not match for the last two rows, the last two elements do.
Here is my code so far, but it is not even close to working. Can anyone give me some tips?
column_two_A = A[:,1]
column_two_B = B[:,1]
column_three_A = A[:,2]
column_three_B = B[:,2]
column_four_A = A[:,3]
column_four_B = B[:,3]
times = A[:,0]
for elementA in column_three_A:
for elementB in column_three_B:
if elementA == elementB:
continue
for elementC in column_two_A:
for elementD in column_two_B:
if elementC == elementD:
continue
for elementE in column_four_A:
for elementF in column_four_B:
if elementE == elementF:
continue
element.append(time)
print(element)

Numpy holds many functions for that kind of tasks. Here is a solution to check if the values of A are in B. Add print() statements and check what chk, chk2 and x are.
import numpy as np
A = np.array([[1,0,3], [2,6,5], [3,4,1],[4,3,2],[5,7,9]], dtype=np.int64)
B = np.array([[3,4,5],[6,7,9],[1,0,3],[4,5,6]], dtype=np.int64)
c = []
for k in A:
chk = np.equal(k[-2:], B[:, -2:])
chk2 = np.all(chk, axis=1)
x = (B[chk2, :])
if x.size:
c.append(x)
print(c)

I think I figured it out by staying up all night... thank you!
`for i in range(len(A)):
for j in range(len(B)):
if A[i][1] == B[j][1]:
if A[i][2] == B[j][2]:
print(B[j])
print(A[i])`

Related

Fastest way to find locations from other numpy array(same shape) and calculate horizontal sum

Hi let's say there are two numpy 2D arrays A, B with same shape.
I am trying to get elements from A where B element equals to x, and get horizontal sum of these elements at A.
I tried two following ways but they were both slow,,(it is big array)
Can somebody advice me faster way? Thank you.
First
farr = np.stack([np.where(rarr == d, varr, 0).sum(axis=1) for d in digits])
Second
#njit
def final_array(varr, rarr, digits):
farr = np.zeros((len(varr), len(digits)))
for i, d in enumerate(digits):
farr[:,i] = np.where(rarr == d, varr, 0).sum(axis=1)
return farr
farr = final_array(varr, rarr, digits)

Compare rows with conditions and generate a new dataframe in Pandas

I have a very big dataframe with this structure:
Timestamp Val1
Here you can see a real sample:
Timestamp Temp
0 1622471518.92911 36.443
1 1622471525.034114 36.445
2 1622471531.148139 37.447
3 1622471537.284337 36.449
4 1622471543.622588 43.345
5 1622471549.734765 36.451
6 1622471556.2518 36.454
7 1622471562.361368 41.461
8 1622471568.472718 42.468
9 1622471574.826475 36.470
What I want to do is compare the Temp column with itself and if is higher than "X", for example 4, and the time between they is lower than "Y", for example 180 min, then I save some data of they.
Now I'm using two for loops one inside the other, but this expends to much time and usually pandas has an option to avoid this.
This is my code:
cap_time, maxim = 180, 4
cap_time = cap_time * 60
temps= df['Temperature'].values
times = df['Timestamp'].values
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
print(i,j,len(temps))
if float(temps[j]) > float(temps[i])*maxim:
timeIn = dt.datetime.fromtimestamp(float(times[i]))
timeOut = dt.datetime.fromtimestamp(float(times[j]))
diff = timeOut - timeIn
tdiff = diff.total_seconds()
if dd > cap_time:
break
else:
res = [temps[i], temps[j], times[i], times[j], tdiff/60, cap_time/60, maxim]
results.append(res)
break
# Then I save it in a dataframe and another actions
Can Pandas help me to achieve my goal and reduce the execution time? I found dataFrame.diff() but I'm not sure is what I want (or I don`t know how to use it).
Thank you very much.
Short of avoiding the nested for loops, you can already speed things up by avoiding all unnecessary calculations and conversions within the loops. In particular, you can use NumPy broadcasting to define a Boolean array beforehand, in which you can look up whether the condition is met:
import numpy as np
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
if condition[i, j]:
results.append([temps[i], temps[j],
times[i], times[j],
times_diff[i, j]])
results
[[36.443, 43.345, 1622471518.92911, 1622471543.622588, 24.693477869033813],
...
[36.454, 42.468, 1622471556.2518, 1622471568.472718, 12.22091794013977]]
To avoid the loops altogether, you could define a 3-dimensional full results array and then use the condition array as a Boolean mask to filter out the results you want:
import numpy as np
n = len(temps)
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results_full = np.stack([np.repeat(temps[:, None], n, axis=1),
np.tile(temps, (n, 1)),
np.repeat(times[:, None], n, axis=1),
np.tile(times, (n, 1)),
times_diff])
results = results_full[np.stack(results_full.shape[0] * [condition])]
results.reshape((5, -1)).T
array([[ 3.64430000e+01, 4.33450000e+01, 1.62247152e+09,
1.62247154e+09, 2.46934779e+01],
...
[ 3.64540000e+01, 4.24680000e+01, 1.62247156e+09,
1.62247157e+09, 1.22209179e+01],
...
])
As you can see, the resulting numbers are the same as above, although this time the results array will contain more rows, because we didn't use the shortcut of starting the inner loop at i+1.

Numpy where conditional statement along axis 0

I have a 1D vector Zc containing n elements that are 2D arrays. I want to find the index of each 2D array that equals np.ones(Zc[i].shape).
a = np.zeros((5,5))
b = np.ones((5,5))*4
c = np.ones((5,5))
d = np.ones((5,5))*2
Zc = np.stack((a,b,c,d))
for i in range(len(Zc)):
a = np.ones(Zc[i].shape)
b = Zc[i]
if np.array_equal(a,b):
print(i)
else:
pass
Which returns 2. The code above works and returns the correct answer, but I want to know if there a vectorized way to achieve the same result?
Going off of hpaulj's comment:
>>> allones = (Zc == np.array(np.ones(Zc[i].shape))).all(axis=(1,2))
>>> np.where(allones)[0][0]
2

How to apply my own function along each rows and columns with NumPy

I'm using NumPy to store data into matrices.
I'm struggling to make the below Python code perform better.
RESULT is the data store I want to put the data into.
TMP = np.array([[1,1,0],[0,0,1],[1,0,0],[0,1,1]])
n_row, n_col = TMP.shape[0], TMP.shape[0]
RESULT = np.zeros((n_row, n_col))
def do_something(array1, array2):
intersect_num = np.bitwise_and(array1, array2).sum()
union_num = np.bitwise_or(array1, array2).sum()
try:
return intersect_num / float(union_num)
except ZeroDivisionError:
return 0
for i in range(n_row):
for j in range(n_col):
if i >= j:
continue
RESULT[i, j] = do_something(TMP[i], TMP[j])
I guess it would be much faster if I could use some NumPy built-in function instead of for-loops.
I was looking for the various questions around here, but I couldn't find the best fit for my problem.
Any suggestion? Thanks in advance!
Approach #1
You could do something like this as a vectorized solution -
# Store number of rows in TMP as a paramter
N = TMP.shape[0]
# Get the indices that would be used as row indices to select rows off TMP and
# also as row,column indices for setting output array. These basically correspond
# to the iterators involved in the loopy implementation
R,C = np.triu_indices(N,1)
# Calculate intersect_num, union_num and division results across all iterations
I = np.bitwise_and(TMP[R],TMP[C]).sum(-1)
U = np.bitwise_or(TMP[R],TMP[C]).sum(-1)
vals = np.true_divide(I,U)
# Setup output array and assign vals into it
out = np.zeros((N, N))
out[R,C] = vals
Approach #2
For cases with TMP holding 1s and 0s, those np.bitwise_and and np.bitwise_or would be replaceable with dot-products and as such could be faster alternatives. So, with those we would have an implementation like so -
M = TMP.shape[1]
I = TMP.dot(TMP.T)
TMP_inv = 1-TMP
U = M - TMP_inv.dot(TMP_inv.T)
out = np.triu(np.true_divide(I,U),1)

Getting the row and column numbers that meets multiple conditions in Pandas

I am trying to get the row and column number, which meets three conditions in Pandas DataFrame.
I have a DataFrame of 0, 1, -1 (bigger than 1850); when I try to get the row and column it takes forever to get the output.
The following is an example I have been trying to use:
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.randint(2, size=(1845,1850)))
b = pd.DataFrame(np.random.randint(2, size=(5,1850)))
b[b == 1] = -1
c = pd.concat([a,b], ignore_index=True)
column_positive = []
row_positive = []
column_negative = []
row_negative = []
column_zero = []
row_zero = []
for column in range(0, c.shape[0]):
for row in range(0, c.shape[1]):
if c.iloc[column, row] == 1:
column_positive.append(column)
row_positive.append(row)
elif c.iloc[column, row] == -1:
column_negative.append(column)
row_negative.append(row)
else:
column_zero.append(column)
row_zero.append(row)
I did some web searching and found that np.where() does something like this, but I have no idea how to do it.
Could anyone tell a better alternative?
You are right np.where would be one way to do it. Here's an implementation with it -
# Extract the values from c into an array for ease in further processing
c_arr = c.values
# Use np.where to get row and column indices corresponding to three comparisons
column_zero, row_zero = np.where(c_arr==0)
column_negative, row_negative = np.where(c_arr==-1)
column_positive, row_positive = np.where(c_arr==1)
If you don't mind having rows and columns as a Nx2 shaped array, you could do it in a bit more concise manner, like so -
neg_idx, zero_idx, pos_idx = [np.argwhere(c_arr == item) for item in [-1,0,1]]

Categories