fastest way to iterate a numpy array and update each element - python

This might be weird to you people, but I happen to have this weird goal to achieve, code goes as follows.
# A is a numpy array, dtype=int32,
# and each element is actually an ID(int), the ID range might be wide,
# but the actually existing values are quite fewer than the dense range,
A = array([[379621, 552965, 192509],
[509849, 252786, 710979],
[379621, 718598, 591201],
[509849, 35700, 951719]])
# and I need to map these sparse ID to dense ones,
# my idea is to have a dict, mapping actual_sparse_ID -> dense_ID
M = {}
# so I iterate this numpy array, and check if this sparse ID has a dense one or not
for i in np.nditer(A, op_flags=['readwrite']):
if i not in M:
M[i] = len(M) # sparse ID got a dense one
i[...] = M[i] # replace sparse one with the dense ID
My goal could be achieved with np.unique(A, return_inverse=True), and the return_inverse result is what I want.
However, the numpy array I have is too huge to fully load into memory, so I cannot run np.unique over the whole data, and this is why I came up with this dict-mapping idea...
Is this the right way to go? Any possible improvement?

I will make an attempt to provide an alternative way of doing this by using numpy.unique() on sub-arrays. This solution is not fully tested. I also did not do any side-by-side performance evaluation since your solution is not fully working for me.
Let's say we have an array c that we split into two smaller arrays. Let's create some test data, for example:
>>> a = np.array([[1,1,2,3,4],[1,2,6,6,2],[8,0,1,1,4]])
>>> b = np.array([[11,2,-1,12,6],[12,2,6,11,2],[7,0,3,1,3]])
>>> c = np.vstack([a, b])
>>> print(c)
[[ 1 1 2 3 4]
[ 1 2 6 6 2]
[ 8 0 1 1 4]
[11 2 -1 12 6]
[12 2 6 11 2]
[ 7 0 3 1 3]]
Here we assume that c is the large array and a and b are sub-arrays. Of course, one could build c first and then extract sub-arrays.
Next step is to run numpy.unique() on the two sub-arrays:
>>> ua, ia = np.unique(a, return_inverse=True)
>>> ub, ib = np.unique(b, return_inverse=True)
>>> uc, ic = np.unique(c, return_inverse=True) # this is for future reference
Now, here is an algorithm for combining the results from subarrays:
def merge_unique(ua, ia, ub, ib):
# make copies *if* changing inputs is undesirable:
ua = ua.copy()
ia = ia.copy()
ub = ub.copy()
ib = ib.copy()
# find differences between unique values in the two arrays:
diffab = np.setdiff1d(ua, ub, assume_unique=True)
diffba = np.setdiff1d(ub, ua, assume_unique=True)
# find indices in ua, ub where to insert "other" unique values:
ssa = np.searchsorted(ua, diffba)
ssb = np.searchsorted(ub, diffab)
# throw away values that are too large:
ssa = ssa[np.where(ssa < len(ua))]
ssb = ssb[np.where(ssb < len(ub))]
# increment indices past previously computed "insert" positions:
for v in ssa[::-1]:
ia[ia >= v] += 1
for v in ssb[::-1]:
ib[ib >= v] += 1
# combine results:
uc = np.union1d(ua, ub) # or use ssa, ssb, diffba, diffab to update ua, ub
ic = np.concatenate([ia, ib])
return uc, ic
Now, let's run this function on the results of numpy.unique() from sub-arrays and then compare merged indices and unique values with the reference results uc and ic:
>>> uc2, ic2 = merge_unique(ua, ia, ub, ib)
>>> np.all(uc2 == uc)
True
>>> np.all(ic2 == ic)
True
Splitting into more than two sub-arrays can be handled with little additional work - simply keep accumulating "unique" values and indices, like this:
uacc, iacc = np.unique(subarr1, return_inverse=True)
ui, ii = np.unique(subarr2, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr3, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr4, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
................................ (etc.)

Related

How to select list elements based on crteria from other lists

I am new to Python, coming from SciLab (an open source MatLab ersatz), which I am using as a toolbox for my analyses (test data analysis, reliability, acoustics, ...); I am definitely not a computer science lad.
I have data in the form of lists of same length (vectors of same size in SciLab).
I use some of them as parameter in order to select data from another one; e.g.
t_v = [1:10]; // a parameter vector
p_v = [20:29]; another parameter vector
res_v(t_v > 5 & p_v < 28); // are the res_v vector elements of which "corresponding" p_v and t_v values comply with my criteria; i can use it for analyses.
This is very direct and simple in SciLab; I did not find the way to achieve the same with Python, either "Pythonically" or simply translated.
Any idea that could help me, please?
Have a nice day,
Patrick.
You could use numpy arrays. It's easy:
import numpy as np
par1 = np.array([1,1,5,5,5,1,1])
par2 = np.array([-1,1,1,-1,1,1,1])
data = np.array([1,2,3,4,5,6,7])
print(par1)
print(par2)
print(data)
bool_filter = (par1[:]>1) & (par2[:]<0)
# example to do it directly in the array
filtered_data = data[ par1[:]>1 ]
print( filtered_data )
#filtering with the two parameters
filtered_data_twice = data[ bool_filter==True ]
print( filtered_data_twice )
output:
[1 1 5 5 5 1 1]
[-1 1 1 -1 1 1 1]
[1 2 3 4 5 6 7]
[3 4 5]
[4]
Note that it does not keep the same number of elements.
Here's my modified solution according to your last comment.
t_v = list(range(1,10))
p_v = list(range(20,29))
res_v = list(range(30,39))
def first_idex_greater_than(search_number, lst):
for count, number in enumerate(lst):
if number > search_number:
return count
def first_idex_lower_than(search_number, lst):
for count, number in enumerate(lst[::-1]):
if number < search_number:
return len(lst) - count # since I searched lst from top to bottom,
# I need to also reverse count
t_v_index = first_idex_greater_than(5, t_v)
p_v_index = first_idex_lower_than(28, p_v)
print(res_v[min(t_v_index, p_v_index):max(t_v_index, p_v_index)])
It returns an array [35, 36, 37].
I'm sure you can optimize it better according to your needs.
The problem statement is not clearly defined, but this is what I interpret to be a likely solution.
import pandas as pd
tv = list(range(1, 11))
pv = list(range(20, 30))
res = list(range(30, 40))
df = pd.DataFrame({'tv': tv, 'pv': pv, 'res': res})
print(df)
def criteria(row, col1, a, col2, b):
if (row[col1] > a) & (row[col2] < b):
return True
else:
return False
df['select'] = df.apply(lambda row: criteria(row, 'tv', 5, 'pv', 28), axis=1)
selected_res = df.loc[df['select']]['res'].tolist()
print(selected_res)
# ... or another way ..
print(df.loc[(df.tv > 5) & (df.pv < 28)]['res'])
This produces a dataframe where each column is the original lists, and applies a selection criteria, based on columns tv and pv to identify the rows in which the criteria, applied dependently to the 2 lists, is satisfied (or not), and then creates a new column of booleans identifying the rows where the criteria is either True or False.
[35, 36, 37]
5 35
6 36
7 37

What's a more efficient way to calculate the max of each row in a matrix excluding its own column?

For a given 2D matrix np.array([[1,3,1],[2,0,5]]) if one needs to calculate the max of each row in a matrix excluding its own column, with expected example return np.array([[3,1,3],[5,5,2]]), what would be the most efficient way to do so?
Currently I implemented it with a loop to exclude its own col index:
n=x.shape[0]
row_max_mat=np.zeros((n,n))
rng=np.arange(n)
for i in rng:
row_max_mat[:,i] = np.amax(s_a_array_sum[:,rng!=i],axis=1)
Is there a faster way to do so?
Similar idea to yours (exclude columns one by one), but with indexing:
mask = ~np.eye(cols, dtype=bool)
a[:,np.where(mask)[1]].reshape((a.shape[0], a.shape[1]-1, -1)).max(1)
Output:
array([[3, 1, 3],
[5, 5, 2]])
You could do this using np.accumulate. Compute the forward and backward accumulations of maximums along the horizontal axis and then combine them with an offset of one:
import numpy as np
m = np.array([[1,3,1],[2,0,5]])
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
print(r)
# [[3 1 3]
# [5 5 2]]
This will require 3x the size of your matrix to process (although you could take that down to 2x if you want an in-place update). Adding a 3rd&4th dimension could also work using a mask but that will require columns^2 times matrix's size to process and will likely be slower.
If needed, you can apply the same technique column wise or to both dimensions (by combining row wise and column wise results).
a = np.array([[1,3,1],[2,0,5]])
row_max = a.max(axis=1).reshape(-1,1)
b = (((a // row_max)+1)%2)
c = b*row_max
d = (a // row_max)*((a*b).max(axis=1).reshape(-1,1))
c+d # result
Since, we are looking to get max excluding its own column, basically the output would have each row filled with the max from it, except for the max element position, for which we will need to fill in with the second largest value. As such, argpartition seems would fit right in there. So, here's one solution with it -
def max_exclude_own_col(m):
out = np.full(m.shape, m.max(1, keepdims=True))
sidx = np.argpartition(-m,2,axis=1)
R = np.arange(len(sidx))
s0,s1 = sidx[:,0], sidx[:,1]
mask = m[R,s0]>m[R,s1]
L1c,L2c = np.where(mask,s0,s1), np.where(mask,s1,s0)
out[R,L1c] = m[R,L2c]
return out
Benchmarking
Other working solution(s) for large arrays -
# #Alain T.'s soln
def max_accum(m):
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
return r
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
So, we will test out with large arrays of various shapes for timings and speedups -
In [54]: import benchit
In [55]: funcs = [max_exclude_own_col, max_accum]
In [170]: inputs = [np.random.randint(0,100,(100000,n)) for n in [10, 20, 50, 100, 200, 500]]
In [171]: T = benchit.timings(funcs, inputs, indexby='shape')
In [172]: T
Out[172]:
Functions max_exclude_own_col max_accum
Shape
100000x10 0.017721 0.014580
100000x20 0.028078 0.028124
100000x50 0.056355 0.089285
100000x100 0.103563 0.200085
100000x200 0.188760 0.407956
100000x500 0.439726 0.976510
# Speedups with max_exclude_own_col over max_accum
In [173]: T.speedups(ref_func_by_index=1)
Out[173]:
Functions max_exclude_own_col Ref:max_accum
Shape
100000x10 0.822783 1.0
100000x20 1.001660 1.0
100000x50 1.584334 1.0
100000x100 1.932017 1.0
100000x200 2.161241 1.0
100000x500 2.220725 1.0

python: divide list into pequal parts and add samples in each part together

The following is my script. Each equal part has self.number samples, in0 is input sample. There is an error as follows:
pn[i] = pn[i] + d
IndexError: list index out of range
Is this the problem about the size of pn? How can I define a list with a certain size but no exact number in it?
for i in range(0,len(in0)/self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
if pn[i] >= self.alpha:
out[i] = 1
elif pn[i] <= self.beta:
out[i] = 0
else:
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
There are a number of problems in the code as posted, however, the gist seems to be something that you'd want to do with numpy arrays instead of iterating over lists.
For example, the set of if/else cases that check if pn[i] >= some_value and then sets a corresponding entry into another list with the result (true/false) could be done as a one-liner with an array operation much faster than iterating over lists.
import numpy as np
# for example, assuming you have 9 numbers in your list
# and you want them divided into 3 sublists of 3 values each
# in0 is your original list, which for example might be:
in0 = [1.05, -0.45, -0.63, 0.07, -0.71, 0.72, -0.12, -1.56, -1.92]
# convert into array
in2 = np.array(in0)
# reshape to 3 rows, the -1 means that numpy will figure out
# what the second dimension must be.
in2 = in2.reshape((3,-1))
print(in2)
output:
[[ 1.05 -0.45 -0.63]
[ 0.07 -0.71 0.72]
[-0.12 -1.56 -1.92]]
With this 2-d array structure, element-wise summing is super easy. So is element-wise threshold checking. Plus 'vectorizing' these operations has big speed advantages if you are working with large data.
# add corresponding entries, we want to add the columns together,
# as each row should correspond to your sub-lists.
pn = in2.sum(axis=0) # you can sum row-wise or column-wise, or all elements
print(pn)
output: [ 1. -2.72 -1.83]
# it is also trivial to check the threshold conditions
# here I check each entry in pn against a scalar
alpha = 0.0
out1 = ( pn >= alpha )
print(out1)
output: [ True False False]
# you can easily convert booleans to 1/0
x = out1.astype('int') # or simply out1 * 1
print(x)
output: [1 0 0]
# if you have a list of element-wise thresholds
beta = np.array([0.0, 0.5, -2.0])
out2 = (pn >= beta)
print(out2)
output: [True False True]
I hope this helps. Using the correct data structures for your task can make the analysis much easier and faster. There is a wealth of documentation on numpy, which is the standard numeric library for python.
You initialize pn to an empty list just inside the for loop, never assign anything into it, and then attempt to access an index i. There is nothing at index i because there is nothing at any index in pn yet.
for i in range(0, len(in0) / self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
If you are trying to add the value d to the pn list, you should do this instead:
pn.append(d)

Averaging out sections of a multiple row array in Python

I've got a 2-row array called C like this:
from numpy import *
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = vstack((A,B))
I want to take all the columns in C where the value in the first row falls between i and i+2, and average them. I can do this with just A no problem:
i = 0
A_avg = []
while(i<6):
selection = A[logical_and(A >= i, A < i+2)]
A_avg.append(mean(selection))
i += 2
then A_avg is:
[1.0,2.5,4.5]
I want to carry out the same process with my two-row array C, but I want to take the average of each row separately, while doing it in a way that's dictated by the first row. For example, for C, I want to end up with a 2 x 3 array that looks like:
[[1.0,2.5,4.5],
[50,35,15]]
Where the first row is A averaged in blocks between i and i+2 as before, and the second row is B averaged in the same blocks as A, regardless of the values it has. So the first entry is unchanged, the next two get averaged together, and the next two get averaged together, for each row separately. Anyone know of a clever way to do this? Many thanks!
I hope this is not too clever. TIL boolean indexing does not broadcast, so I had to manually do the broadcasting. Let me know if anything is unclear.
import numpy as np
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = np.vstack((A,B)) # float so that I can use np.nan
i = np.arange(0, 6, 2)[:, None]
selections = np.logical_and(A >= i, A < i+2)[None]
D, selections = np.broadcast_arrays(C[:, None], selections)
D = D.astype(float) # allows use of nan, and makes a copy to prevent repeated behavior
D[~selections] = np.nan # exclude these elements from mean
D = np.nanmean(D, axis=-1)
Then,
>>> D
array([[ 1. , 2.5, 4.5],
[ 50. , 35. , 15. ]])
Another way, using np.histogram to bin your data. This may be faster for large arrays, but is only useful for few rows, since a hist must be done with different weights for each row:
bins = np.arange(0, 7, 2) # include the end
n = np.histogram(A, bins)[0] # number of columns in each bin
a_mean = np.histogram(A, bins, weights=A)[0]/n
b_mean = np.histogram(A, bins, weights=B)[0]/n
D = np.vstack([a_mean, b_mean])

Find two disjoint pairs of pairs that sum to the same vector

This is a follow-up to Find two pairs of pairs that sum to the same value .
I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(m,n))
I would like to determine if the matrix has two disjoint pairs of pairs of columns which sum to the same column vector. I am looking for a fast method to do this. In the previous problem ((0,1), (0,2)) was acceptable as a pair of pairs of column indices but in this case it is not as 0 is in both pairs.
The accepted answer from the previous question is so cleverly optimised I can't see how to make this simple looking change unfortunately. (I am interested in columns rather than rows in this question but I can always just do A.transpose().)
Here is some code to show it testing all 4 by 4 arrays.
n = 4
nxn = np.arange(n*n).reshape(n, -1)
count = 0
for i in xrange(2**(n*n)):
A = (i >> nxn) %2
p = 1
for firstpair in combinations(range(n), 2):
for secondpair in combinations(range(n), 2):
if firstpair < secondpair and not set(firstpair) & set(secondpair):
if (np.array_equal(A[firstpair[0]] + A[firstpair[1]], A[secondpair[0]] + A[secondpair[1]] )):
if (p):
count +=1
p = 0
print count
This should output 3136.
Here is my solution, extended to do what I believe you want. It isn't entirely clear though; one may get an arbitrary number of row-pairs that sum to the same total; there may exist unique subsets of rows within them that sum to the same value. For instance:
Given this set of row-pairs that sum to the same total
[[19 19 30 30]
[11 16 11 16]]
There exists a unique subset of these rows that may still be counted as valid; but should it?
[[19 30]
[16 11]]
Anyway, I hope those details are easy to deal with, given the code below.
import numpy as np
n = 20
#also works for non-square A
A = np.random.randint(2, size=(n*6,n)).astype(np.int8)
##A = np.array( [[0, 0, 0], [1, 1, 1], [1, 1 ,1]], np.uint8)
##A = np.zeros((6,6))
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
R = np.zeros(a.shape[1:], dtype)
for col in columns:
R *= base
R += col
yield R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
#keep all packed columns; we might need them later
columns = []
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
columns.append(packed_column)
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all combinations
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those combinations which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
"""
we now have a small set of relevant combinations, but we have lost the details of their particulars
to see which combinations of rows does sum to the same value, we do need to consider rows as a whole
we can simply apply the same mechanism, but for all columns at the same time,
but only for the selected subset of row combinations known to be relevant
"""
#construct full packed matrix
B = np.ascontiguousarray(np.vstack(columns).T)
#perform all relevant sums, over all columns
rowsums = sum(B[I[active_combinations]] for I in combinations_index)
#find the unique rowsums, by viewing rows as a void object
unique, count, inverse = unique_count(voidview(rowsums))
#if not, we did something wrong in deciding on active combinations
assert(np.all(count>1))
#loop over all sets of rows that sum to an identical unique value
for i in xrange(len(unique)):
#set of indexes into combinations_index;
#note that there may be more than two combinations that sum to the same value; we grab them all here
combinations_group = active_combinations[inverse==i]
#associated row-combinations
#array of shape=(mulitplicity,group_size)
row_combinations = combinations_index[:,combinations_group]
#if no duplicate rows involved, we have a match
if len(np.unique(row_combinations[:,[0,-1]])) == multiplicity*2:
print row_combinations
return True
#none of identical rowsums met uniqueness criteria
return False
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(1):
## print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Edit: code cleanup

Categories