optimizing indexing and retrieval of elements in numpy arrays in Python? - python

I'm trying to optimize the following code, potentially by rewriting it in Cython: it simply takes a low dimensional but relatively long numpy arrays, looks into of its columns for 0 values, and marks those as -1 in an array. The code is:
import numpy as np
def get_data():
data = np.array([[1,5,1]] * 5000 + [[1,0,5]] * 5000 + [[0,0,0]] * 5000)
return data
def get_cols(K):
cols = np.array([2] * K)
return cols
def test_nonzero(data):
K = len(data)
result = np.array([1] * K)
# Index into columns of data
cols = get_cols(K)
# Mark zero points with -1
idx = np.nonzero(data[np.arange(K), cols] == 0)[0]
result[idx] = -1
import time
t_start = time.time()
data = get_data()
for n in range(5000):
test_nonzero(data)
t_end = time.time()
print (t_end - t_start)
data is the data. cols is the array of columns of data to look for non-zero values (for simplicity, I made it all the same column). The goal is to compute a numpy array, result, which has a 1 value for each row where the column of interest is non-zero, and -1 for the rows where the corresponding columns of interest have a zero.
Running this function 5000 times on a not-so-large array of 15,000 rows by 3 columns takes about 20 seconds. Is there a way this can be sped up? It appears that most of the work goes into finding the nonzero elements and retrieving them with indices (the call to nonzero and subsequent use of its index.) Can this be optimized or is this the best that can be done?
How could a Cython implementation gain speed on this?

cols = np.array([2] * K)
That's going to be really slow. That's create a very large python list and then converts it into a numpy array. Instead, do something like:
cols = np.ones(K, int)*2
That'll be way faster
result = np.array([1] * K)
Here you should do:
result = np.ones(K, int)
That will produce the numpy array directly.
idx = np.nonzero(data[np.arange(K), cols] == 0)[0]
result[idx] = -1
The cols is an array, but you can just pass a 2. Furthermore, using nonzero adds an extra step.
idx = data[np.arange(K), 2] == 0
result[idx] = -1
Should have the same effect.

Related

Moving average in python array

I have an array 'aN' with a shape equal to (1000,151). I need to calculate the average every 10 data in rows, so I implemented this
arr = aN[:]
window_size = 10
i = 0
moving_averages = []
while i < len(arr) - window_size + 1:
window_average = round(np.sum(arr[i:i+window_size]) / window_size, 2)
moving_averages.append(window_average)
i += 10
The point is that my output is a list of 100 data, but I need an array with the same number of columns that the original array (151).
Any idea on how to get this outcome??
TIA!!
If you convert it to a pandas dataframe, you can use the rolling() function of pandas together with the mean() function. It should be able to accomplish what you need.

Compare rows with conditions and generate a new dataframe in Pandas

I have a very big dataframe with this structure:
Timestamp Val1
Here you can see a real sample:
Timestamp Temp
0 1622471518.92911 36.443
1 1622471525.034114 36.445
2 1622471531.148139 37.447
3 1622471537.284337 36.449
4 1622471543.622588 43.345
5 1622471549.734765 36.451
6 1622471556.2518 36.454
7 1622471562.361368 41.461
8 1622471568.472718 42.468
9 1622471574.826475 36.470
What I want to do is compare the Temp column with itself and if is higher than "X", for example 4, and the time between they is lower than "Y", for example 180 min, then I save some data of they.
Now I'm using two for loops one inside the other, but this expends to much time and usually pandas has an option to avoid this.
This is my code:
cap_time, maxim = 180, 4
cap_time = cap_time * 60
temps= df['Temperature'].values
times = df['Timestamp'].values
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
print(i,j,len(temps))
if float(temps[j]) > float(temps[i])*maxim:
timeIn = dt.datetime.fromtimestamp(float(times[i]))
timeOut = dt.datetime.fromtimestamp(float(times[j]))
diff = timeOut - timeIn
tdiff = diff.total_seconds()
if dd > cap_time:
break
else:
res = [temps[i], temps[j], times[i], times[j], tdiff/60, cap_time/60, maxim]
results.append(res)
break
# Then I save it in a dataframe and another actions
Can Pandas help me to achieve my goal and reduce the execution time? I found dataFrame.diff() but I'm not sure is what I want (or I don`t know how to use it).
Thank you very much.
Short of avoiding the nested for loops, you can already speed things up by avoiding all unnecessary calculations and conversions within the loops. In particular, you can use NumPy broadcasting to define a Boolean array beforehand, in which you can look up whether the condition is met:
import numpy as np
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
if condition[i, j]:
results.append([temps[i], temps[j],
times[i], times[j],
times_diff[i, j]])
results
[[36.443, 43.345, 1622471518.92911, 1622471543.622588, 24.693477869033813],
...
[36.454, 42.468, 1622471556.2518, 1622471568.472718, 12.22091794013977]]
To avoid the loops altogether, you could define a 3-dimensional full results array and then use the condition array as a Boolean mask to filter out the results you want:
import numpy as np
n = len(temps)
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results_full = np.stack([np.repeat(temps[:, None], n, axis=1),
np.tile(temps, (n, 1)),
np.repeat(times[:, None], n, axis=1),
np.tile(times, (n, 1)),
times_diff])
results = results_full[np.stack(results_full.shape[0] * [condition])]
results.reshape((5, -1)).T
array([[ 3.64430000e+01, 4.33450000e+01, 1.62247152e+09,
1.62247154e+09, 2.46934779e+01],
...
[ 3.64540000e+01, 4.24680000e+01, 1.62247156e+09,
1.62247157e+09, 1.22209179e+01],
...
])
As you can see, the resulting numbers are the same as above, although this time the results array will contain more rows, because we didn't use the shortcut of starting the inner loop at i+1.

How to apply my own function along each rows and columns with NumPy

I'm using NumPy to store data into matrices.
I'm struggling to make the below Python code perform better.
RESULT is the data store I want to put the data into.
TMP = np.array([[1,1,0],[0,0,1],[1,0,0],[0,1,1]])
n_row, n_col = TMP.shape[0], TMP.shape[0]
RESULT = np.zeros((n_row, n_col))
def do_something(array1, array2):
intersect_num = np.bitwise_and(array1, array2).sum()
union_num = np.bitwise_or(array1, array2).sum()
try:
return intersect_num / float(union_num)
except ZeroDivisionError:
return 0
for i in range(n_row):
for j in range(n_col):
if i >= j:
continue
RESULT[i, j] = do_something(TMP[i], TMP[j])
I guess it would be much faster if I could use some NumPy built-in function instead of for-loops.
I was looking for the various questions around here, but I couldn't find the best fit for my problem.
Any suggestion? Thanks in advance!
Approach #1
You could do something like this as a vectorized solution -
# Store number of rows in TMP as a paramter
N = TMP.shape[0]
# Get the indices that would be used as row indices to select rows off TMP and
# also as row,column indices for setting output array. These basically correspond
# to the iterators involved in the loopy implementation
R,C = np.triu_indices(N,1)
# Calculate intersect_num, union_num and division results across all iterations
I = np.bitwise_and(TMP[R],TMP[C]).sum(-1)
U = np.bitwise_or(TMP[R],TMP[C]).sum(-1)
vals = np.true_divide(I,U)
# Setup output array and assign vals into it
out = np.zeros((N, N))
out[R,C] = vals
Approach #2
For cases with TMP holding 1s and 0s, those np.bitwise_and and np.bitwise_or would be replaceable with dot-products and as such could be faster alternatives. So, with those we would have an implementation like so -
M = TMP.shape[1]
I = TMP.dot(TMP.T)
TMP_inv = 1-TMP
U = M - TMP_inv.dot(TMP_inv.T)
out = np.triu(np.true_divide(I,U),1)

Transforming a 3 Column Matrix into an N x N Matrix in Numpy

I have a 2D numpy array with 3 columns. Columns 1 and 2 are a list of connections between ID's. Column 3 is a the strength of that connection. I would like to transform this 3 column matrix into a weighted adjacency matrix (an N x N matrix where cells represent the strength of connection between each ID).
I have already done this in my code below. matrix is the 3 column 2D array and t1 is the weighted adjacency matrix. My problem is this code is very slow because I am using nested for loops. I am familiar with the pandas function melt which does this, but I am not able to use pandas. Is there a faster implementation not using pandas?
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
flds = list(np.unique(matrix[:,0]))
flds.extend(list(np.unique(matrix[:,1])))
flds = np.asarray(flds)
flds = np.unique(flds)
#make lookup dict
lookup = dict(zip(np.arange(0,len(flds)), flds))
lookup_rev = dict(zip(flds, np.arange(0,len(flds))))
#make empty n by n matrix with unique lists
t1 = np.zeros([len(flds) , len(flds)])
#map values into the n by n matrix and make the rest 0
'''this takes a long time to run'''
#iterate through rows
for i in np.arange(0,len(lookup)):
#iterate through columns
for k in np.arange(0,len(lookup)):
val = matrix[(matrix[:,0] == lookup[i]) & (matrix[:,1] == lookup[k])][:,2]
if val:
t1[i,k] = sum(val)
Assuming that I understood the question correctly and that val is a scalar, you could use a vectorized approach that involves initializing with zeros and then indexing, like so -
out = np.zeros((len(flds),len(flds)))
out[matrix[:,0].astype(int),matrix[:,1].astype(int)] = matrix[:,2]
Please note that by my observation it looks like you can avoid using lookup.
You need to iterate your matrix only once:
import numpy as np
size = 2000
a = np.arange(size)
np.random.shuffle(a)
b = np.arange(size)
np.random.shuffle(b)
c = np.random.rand(size,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
fields = np.unique(matrix[:,:2])
n = len(fields)
#make reverse lookup dict
lookup = dict(zip(fields, range(n)))
#make empty n by n matrix
t1 = np.zeros([n, n])
for src, dest, val in matrix:
i = lookup[src]
j = lookup[dest]
t1[i, j] += val
The main acceleration you can get is by not iterating through each element of the NxN matrix but instead iterate trough your connection list, which is much smaller.
I tried to simplify your code a bit. It use the list.index method, which can be slow, but it should still be faster that what you had.
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
lookup = np.unique(matrix[:,:2]).tolist() # You can call unique only once
t1 = np.zeros((len(lookup),len(lookup)))
for i,j,val in matrix:
t1[lookup.index(i),lookup.index(j)] = val # Fill the matrix

Numpy efficient construction of sparse coo_matrix or faster list extension

I have a list of 100k items and each item has a list of indices. I am trying to put this into a boolean sparse matrix for vector multiplication. My code isn't running as fast as I would like, so I am looking for performance tips or maybe alternative approaches for getting this data into a matrix.
rows = []
cols = []
for i, item in enumerate(items):
indices = item.getIndices()
rows += [i]*len(indices)
cols += indices
data = np.ones(len(rows), dtype='?')
mat = coo_matrix(data,(rows,cols)),shape=(len(items),totalIndices),dtype='?')
mat = mat.tocsr()
There wind up being 800k items in the rows/cols lists and just the extending of those lists seems to be taking up 16% and 13% of the building time. Converting to the coo_matrix then takes up 12%. Enumeration is taking up 13%. I got these stats from line_profiler and I am using python 3.3.
The best I can do is:
def foo3(items,totalIndices):
N = len(items)
cols=[]
cnts=[]
for item in items:
indices = getIndices(item)
cols += indices
cnts.append(len(indices))
rows = np.arange(N).repeat(cnts) # main change
data = np.ones(rows.shape, dtype=bool)
mat = sparse.coo_matrix((data,(rows,cols)),shape=(N,totalIndices))
mat = mat.tocsr()
return mat
For 100000 items it's only a 50% increase in speed.
A lot of sparse matrix algorithms run twice through the data, once to figure out the size of the sparse matrix, the other to fill it in with the right values. So perhaps it is worth trying something like this:
total_len = 0
for item in items:
total_len += len(item.getIndices())
rows = np.empty((total_len,), dtype=np.int32)
cols = np.empty((total_len,), dtype=np.int32)
total_len = 0
for i, item in enumerate(items):
indices = item.getIndices()
len_ = len(indices)
rows[total_len:total_len + len_] = i
cols[total_len:total_len + len_] = indices
total_len += len_
Followed by the same you are currently doing. You can also build the CSR matrix directly, avoiding the COO one, which will save some time as well. After the first run to find out the total size you would do:
indptr = np.empty((len(items) + 1,), dtype=np.int32)
indptr[0] = 0
indices = np.empty((total_len,), dtype=np.int32)
for i, item in enumerate(items):
item_indices = item.getIndices()
len_ = len(item_indices)
indptr[i+1] = indptr[i] + len_
indices[indptr[i]:indptr[i+1]] = item_indices
data = np.ones(total_len,), dtype=np.bool)
mat = csr_matrix((data, indices, indptr))

Categories