Creating an array without certain ranges - python

In python I have numpy.ndarray called a and a list of indices called b. I want to get a list of all the values of a which are not in -10..10 places around the indices of b.
This is my current code, which takes a lot of time to run due to allocations of data (a is very big):
aa=a
# Remove all ranges backwards
for bb in b[::-1]:
aa=np.delete(aa, range(bb-10,bb+10))
Is there a way to do it more efficiently? Preferably with few memory allocations.

np.delete will take an array of indicies of any size. You can simply populate your entire array of indicies and perform the delete once, therefore only deallocating and reallocating once. (not tested. possible typos.)
bb = np.empty((b.size, 21), dtype=int)
for i,v in enumerate(b):
bb[i] = v+np.arange(-10,11)
np.delete(a, bb.flat) # looks like .flat is optional
Note, if your ranges overlap, you'll get a difference between this and your algorithm: where yours will remove more items than those originally 10 indices away.

Could you find a certain number that you're sure will not be in a, and then set all indices around the b indices to that number, so that you can remove it afterwards?
import numpy as np
for i in range(-10, 11):
a[b + i] = number_not_in_a
values = set(np.unique(a)) - set([number_not_in_a])
This code will not allocate new memory for a at all, needs only one range object created, and does the job in exactly 22 c-optimized numpy operations (well, 43 if you count the b + i operations), plus the cost of turning the unique return array into a set.
Beware, if b includes indices which are less than 10, the number_not_in_a "zone" around these indices will wrap around to the other end of the array. If b includes indices larger than len(a) - 11, the operation will fail with an IndexError at some point.

Related

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.

Mapping between ids and indices using numpy arrays

I'm working on a graphical application that uses shapes such as quads, trias, lines etc. to represent geometry.
The input data is based on ID's.
A list of points is provided, each with an ID and coordinates (x, y, z)
A list of shapes is provided, each defined using the ids from he list of points
So a tria is defined as N1, N2, N3 where the N's are ID's in the list of points
I'm using VTK to display the data and it uses indices and not ids.
So I have to convert the id based input to index based input and I use the following numpy array approach which works REALLY well (and was provided by someone on this board I think)
# nodes - numpy array of point ids
nodes=np.asarray([1, 15, 56, 101, 150]) # This array can be millions of ids long
nmx = nodes.max()
node_id_to_index = np.empty((nmx + 1,), dtype=np.uint32)
# Using the node id as an index, insert consecutive indices as values into the array
# This gives us an array that can be indexed by ID and return the compact index
node_id_to_index[nodes] = np.arange(len(nodes), dtype=np.uint32)
Now when I have a shape defined using ids I can easily convert it to use indices like this
elems_by_id = np.asarray([56,1,150,15,101,1]) # This array can be millions of ids long
elems_by_index = node_id_to_index[elems_by_id]
# gives [2, 0, 4, 1, 3, 0]
One weakness of the approach is that if the original list of ids contains even a VERY large number, I'm required to allocate an array big enough to hold that many items. Even though I may not have that many entries in the original id list. (The original ID list can have gaps in the ids). I ran into this condition today.....
So my question is - how can I modify this approach to handle lists that contain ids so large that I don't have enough memory to create the mapping array?
Any help will be gratefully received....
Doug
OK - I think I found a solution - Credit to #Paul Panzer
But first some addition info - the input nodes array is sorted and guaranteed to have only unique ids
elems_by_index = nodes.searchsorted(elems_by_id)
This is only marginally slower than the original approach - so I'll just branch based on the max id in nodes - use the original approach when I can easily allocate enough memory and the second approach when the max id is huge....
As I understand, essentially you're looking for a fast way to find the index of a number in a list, e. g., you have a list like:
nodes = [932, 578, 41, ...]
and need a structure that would give
id_to_index[932] == 0
id_to_index[578] == 1
id_to_index[41] == 2
# etc.
(which can be done with something as simple as nodes.index(932), except that wouldn't be any fast). And your current solution is, essentially, the first part of the pigeonhole sort algorithm, and the problem is that the "number of elements (n) and the length of the range of possible key values (N) are approximately the same" condition isn't met - the range is much bigger in your case, so too much memory is wasted on that auxiliary data structure.
Why not simply use a Python dictionary, by the way? E. g. id_to_index = {932: 0, 578: 1, 41: 2, ...} - is it too slow (your current lookup is O(1), with a dictionary it would be log something)? Or is it because you want numpy indexing (e. g. id_to_index[[n1, n2, n3]] instead of one by one)? Perhaps, then, you can use SciPy sparce matrices (a single-row matrix instead of an array):
import numpy as np
import scipy.sparse as sp
nodes = np.array([9, 2, 7]) # a small test sample
# your solution with a numpy array
nmx = nodes.max()
node_id_to_index = np.empty((nmx + 1,), dtype=np.uint32)
node_id_to_index[nodes] = np.arange(len(nodes), dtype=np.uint32)
elems_by_id = [7, 2] # an even smaller test sample
elems_by_index = node_id_to_index[elems_by_id]
# gives [2 1]
print(elems_by_index)
# same with a 1 by (nmx + 1) sparce matrix
m = sp.csr_matrix((1, nmx + 1), dtype = np.uint32) # 1 x 10 matrix, stores nothing
m[0, nodes] = np.arange(len(nodes), dtype=np.uint32) # 1 x 10 matrix, stores 3 elements
m_by_index = m[0, elems_by_id] # looking through the 0-th row
print(m_by_index.toarray()[0]) # also gives [2 1]
Not sure if I chose the optimal type of matrix for this, read the descriptions of different types of sparse matrix formats to find the best one for the task.

Selecting randomly from two arrays based upon condition in Python

Suppose i have two arrays of equal lengths:
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
Now i want to pick up elements from these two arrays , in the sequence given such that they form a new array of same length as a & b by randomly selecting values between a & b, in the ratio of a:b = 4.68 i.e for every 1 value picked from a , there should be 4.68 values picked from b in the resultant array.
So effectively the resultant array could be something like :
res = [0,1,1,0,1, 1(from a) ,0(from a),1,1,0,0,1,1,0, 0(from a),0,0]
res array has : first 5 values are from b ,6th & 7th from a ,8th-14th from b , 15th from a ,16th-17th from b
Overall ratio of values from a:b in the given res array example is a:b 4.67 ( from a = 3 ,from b = 14 )
Thus between the two arrays, values have to be chosen at random however the sequence needs to be maintained i.e cannot take 7th value from one array and 3rd value from other .If the value to be populated in resultant array is 3rd then the choice is between the 3rd element of both input arrays at random.Also, overall ratio needs to be maintained as well.
Can you please help me in developing an efficient Pythonic way of reaching this resultant solution ? The solution need not be consistent with every run w.r.t values
Borrowing the a_count calculation from Barmar's answer (because it seems to work and I can't be bothered to reinvent it), this solution preserves the ordering of the values chosen from a and b:
from future_builtins import zip # Only on Python 2, to avoid temporary list of tuples
import random
# int() unnecessary on Python 3
a_count = int(round(1/(1 + 4.68) * len(a)))
# Use range on Python 3, xrange on Python 2, to avoid making actual list
a_indices = frozenset(random.sample(xrange(len(a)), a_count))
res = [aval if i in a_indices else bval for i, (aval, bval) in enumerate(zip(a, b))]
The basic idea here is that you determine how many a values you need, get a unique sample of the possible indices of that size, then iterate a and b in parallel, keeping the a value for the selected indices, and the b value for all others.
If you don't like the complexity of the list comprehension, you could use a different approach, copying b, then filling in the a values one by one:
res = b[:] # Copy b in its entirety
# Replace selected indices with a values
# No need to convert to frozenset for efficiency here, and it's clean
# enough to just iterate the sample directly without storing it
for i in random.sample(xrange(len(a)), a_count):
res[i] = a[i]
I believe this should work. You specify how many you want from a (you can simply use your ratio to figure out that number), you randomly generate a 'mask' of numbers and choose from a or be based on the cutoff (notice that you only sort to figure out the cutoff, but you use the unsorted mask later)
import numpy as np
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
mask = np.random.random(len(a))
from_a = 3
cutoff = np.sort(mask)[from_a]
res = []
for i in range(len(a)):
if (mask[i]>=cutoff):
res.append(a[i])
else:
res.append(b[i])

More efficient way to find index of objects in Python array

I have a very large 400x300x60x27 array (lets call it 'A'). I took the maximum values which is now a 400x300x60 array called 'B'. Basically I need to find the index in 'A' of each value in 'B'. I have converted them both to lists and set up a for loop to find the indices, but it takes an absurdly long time to get through it because there are over 7 million values. This is what I have:
B=np.zeros((400,300,60))
C=np.zeros((400*300*60))
B=np.amax(A,axis=3)
A=np.ravel(A)
A=A.tolist()
B=np.ravel(B)
B=B.tolist()
for i in range(0,400*300*60):
C[i]=A.index(B[i])
Is there a more efficient way to do this? Its taking hours and hours and the program is still stuck on the last line.
You don't need amax, you need argmax. In case of argmax, the array will only contain the indices rather than values, the computational efficiency of finding the values using indices are much better than vice versa.
So, I would recommend you to store only the indices. Before flattening the array.
instead of np.amax, run A.argmax, this will contain the indices.
But before you're flattening it to 1D, you will need to use a mapping function that causes the indices to 1D as well. This is probably a trivial problem, as you'd need to just use some basic operations to achieve this. But that would also consume some time as it needs to be executed quite some times. But it won't be a searching probem and would save you quite some time.
You are getting those argmax indices and because of the flattening, you are basically converting to linear index equivalents of those.
Thus, a solution would be to add in the proper offsets into the argmax indices in steps leveraging broadcasting at each one of them, like so -
m,n,r,s = A.shape
idx = A.argmax(axis=3)
idx += s*np.arange(r)
idx += r*s*np.arange(n)[:,None]
idx += n*r*s*np.arange(m)[:,None,None] # idx is your C output
Alternatively, a compact way to put it would be like so -
m,n,r,s = A.shape
I,J,K = np.ogrid[:m,:n,:r]
idx = n*r*s*I + r*s*J + s*K + A.argmax(axis=3)

Replace loop with broadcasting in numpy -> memory error

I have an 2D-array (array1), which has an arbitrary number of rows and in the first column I have strictly monotonic increasing numbers (but not linearly), which represent a position in my system, while the second one gives me a value, which represents the state of my system for and around the position in the first column.
Now I have a second array (array2); its range should usually be the same as for the first column of the first array, but does not matter to much, as you will see below.
I am now interested for every element in array2:
1. What is the argument in array1[:,0], which has the closest value to the current element in array2?
2. What is the value (array1[:,1]) of those elements.
As usually array2 will be longer than the number of rows in array1 it is perfectly fine, if I get one argument from array1 more than one time. In fact this is what I expect.
The value from 2. is written in the second and third column, as you will see below.
My striped code looks like this:
from numpy import arange, zeros, absolute, argmin, mod, newaxis, ones
ysize1 = 50
array1 = zeros((ysize1+1,2))
array1[:,0] = arange(ysize1+1)**2
# can be any strictly monotonic increasing array
array1[:,1] = mod(arange(ysize1+1),2)
# in my current case, but could also be something else
ysize2 = (ysize1)**2
array2 = zeros((ysize2+1,3))
array2[:,0] = arange(0,ysize2+1)
# is currently uniformly distributed over the whole range, but does not necessarily have to be
a = 0
for i, array2element in enumerate(array2[:,0]):
a = argmin(absolute(array1[:,0]-array2element))
array2[i,1] = array1[a,1]
It works, but takes quite a lot time to process large arrays. I then tried to implement broadcasting, which seems to work with the following code:
indexarray = argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
array2[:,2]=array1[indexarray,1] # just to compare the results
Unfortunately now I seem to run into a different problem: I get a memory error on the sizes of arrays I am using in the line of code with the broadcasting.
For small sizes it works, but for larger ones where len(array2[:,0]) is something like 2**17 (and could be even larger) and len(array1[:,0]) is about 2**14. I get, that the size of the array is bigger than the available memory. Is there an elegant way around that or to speed up the loop?
I do not need to store the intermediate array(s), I am just interested in the result.
Thanks!
First lets simplify this line:
argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
it should be:
a = array1[:, 0]
b = array2[:, 0]
argmin(abs(a - b[:, newaxis]), 1)
But even when simplified, you're creating two large temporary arrays. If a and b have sizes M and N, b - a and abs(...) each create a temporary array of size (M, N). Because you've said that a is monotonically increasing, you can avoid the issue all together by using a binary search (sorted search) which is much faster anyways. Take a look at the answer I wrote to this question a while back. Using the function from this answer, try this:
closest = find_closest(array1[:, 0], array2[:, 0])
array2[:, 2] = array1[closest, 1]

Categories