Vectorize iteration over two large numpy arrays in parallel - python

I have two large arrays of type numpy.core.memmap.memmap, called data and new_data, with > 7 million float32 items.
I need to iterate over them both within the same loop which I'm currently doing like this.
for i in range(0,len(data)):
if new_data[i] == 0: continue
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
However this is unreasonably slow, so I gather that using numpy's vectorising functions are the way to go.
Is it possible to vectorize with the index – so that the vectorised array can compare it's items to the corresponding item in the other array?
I thought of zipping the two arrays but I guess this would cause unreasonable overhead to prepare?
Is there some other way to optimise this operation?
For context: the goal is to effectively merge the two arrays such that each unique combination of corresponding values between the two arrays is represented by a different value in the resulting array, except zeros in the new_data array which are ignored. The arrays represent 3D bitmap images.
EDIT: available_values is a set of values that have not yet been used in data and persists across calls to this loop. new_values_map on the other hand is reset to an empty dictionary before each time this loop is used.
EDIT2: the data array only contains whole numbers, that is: it's initialised as zeros then with each usage of this loop with a different new_data it is populated with more values drawn from available_values which is initially a range of integers. new_data could theoretically be anything.

In answer to you question about vectorising, the answer is probably yes, though you need to clarify what available_values contains and how it's used, as that is the core of the vectorisation.
Your solution will probably look something like this...
indices = new_data != 0
data[indices] = available_values
In this case, if available_values can be considered as a set of values in which we allocate the first value to the first value in data in which new_data is not 0, that should work, as long as available_values is a numpy array.
Let's say new_data and data take values 0-255, then you can construct an available_values array with unique entries for every possible pair of values in new_data and data like the following:
available_data = numpy.array(xrange(0, 255*255)).reshape((255, 255))
indices = new_data != 0
data[indices] = available_data[data[indices], new_data[indices]]
Obviously, available_data can be whatever mapping you want. The above should be very quick whatever is in available_data (especially if you only construct available_data once).

Python gives you a powerful tools for handling large arrays of data: generators and iterators
Basically, they will allow to acces your data as they were regular lists, without fetching them at once to memory, but accessing piece by piece.
In case of accessing two large arrays at once, you can
for item_a, item_b in izip(data, new_data):
#... do you stuff here
izip creates an iterator what iterates over the elements of your arrays at once, but it does picks pieces as you need them, not all at once.

It seems that replacing the first two lines of loop to produce:
for i in numpy.where(new_data != 0)[0]:
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
has the desired effect.
So most of the time in the loop was spent skipping the entire loop upon encountering a zero in new_data. Don't really understand why these many null iterations were so expensive, maybe one day I will...

Related

List of lists (or numpy array): extracting data via SQL-like methods?

The problem: given a (large) Python list-of-lists, or, semi-equivalently, a numpy array, extract information from the array in a SQL-like manner, i.e., as if the array were a database.
For example: given a 4-column by (large) N-row array, extract the rows where the first column is equal to X. In SQL this would be:
SELECT * FROM array WHERE col_1_id = X
In Python, however... ¯\_(ツ)_/¯
An attempt to make the issue clearer:
The array in question holds in each sublist/row [M, a^2, b^2, c^2], where M is the sum of the squares. The list contains millions of entries, and M ranges from ~100 to ~10^6.
The desire is to extract from this data only the rows for which at least 8 different rows have the same sum. Naively we can do this with a loop:
Output = []
for i in [0..10^6]:
newarray = []
for row in array:
if row[0] == i:
newarray.append(row)
if len(newarray) >= 8:
Output.extend(newarray)
save(Output, 'outputfilename')
This output is a much shorter and more workable array. But my understanding is that this is incredibly inefficient (We're looping through a million row array a million times, that's a trillion calls, that seems problematic.)
Were this data in a database, I could grab it with:
SELECT * FROM array WHERE col_1 = i AND COUNT(i) >= 8
(depending on which SQL this might take a different form).
So far as I can tell, neither Python nor numpy has built-in functions that act like this. I don't expect the language to parse a SQL query, but there must be some tool within the language that approximates this function.
Numpy has a select method that doesn't actually select rows in this way, and some other methods that sound like they might make these operations possible but seem to do nothing of the sort. As mentioned below, the documentation is very thin on examples.
I have seen things somewhat like this done using collections.Counter(), but I'm not sure this specific desire can be done with it and am uncertain how to do it. The documentation is... thin on examples.
I'm aware of the fact that this may be an XY question, and have hence attempted to leave out the X except as examples of what I've tried. I am, however, in need of tools using Python (via SageMath/Jupyter). If there's a way of directly storing numpy/Python data in a database-like format and hitting it with SQL-like queries, that would be great too.
This might not be exactly what you are looking for, but I hope it can be helpful either way. :) I wrote a loop implementation that should be more efficient than the one you provided since we only loop through the column twice. We use a dictionary to keep track of the number of times a specific value in the first column occurs.
countDict = {}
#Counting the number of times a sum occurs in the first column of the array
for row in array:
if row[0] in countDict:
#If row sum exists in dictionary we increment the count
countDict[row[0]] +=1
else:
#Else we add the first count (1)
countDict[row[0]] = 1
output = [] #Output to generate
#Loop through first column of array again
for row in array:
#If the sum value occured at least 8 times we add it to the output list
if countDict[row[0]] >= 8:
output.append(row)

For-Loop over python float array

I am working with the IRIS dataset. I have two sets of data, (1 training set) (2 test set). Now I want to calculate the euclidean distance between every test set row and the train set rows. However, I only want to include the first 4 points of the row.
A working example would be:
dist = np.linalg.norm(inner1test[0][0:4]-inner1train[0][0:4])
print(dist)
***output: 3.034243***
The problem is that I have 120 training set points and 30 test set points - so i would have to do 2700 operations manually, thus I thought about iterating through with a for-loop. Unfortunately, every of my attemps is failing.
This would be my best attempt, which shows the error message
for i in inner1test:
for number in inner1train:
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
(IndexError: arrays used as indices must be of integer (or boolean)
type)
What would be the best solution to iterate through this array?
ps: I will also provide a screenshot for better vizualisation.
From what I see, inner1test is a tuple of lists, so the i value will not be an index but the actual list.
You should use enumerate, which returns two variables, the index and the actual data.
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
Also, if your lists begin the be bigger, consider using a generator which will execute your calculcations iteration per iteration and return only one value at a time, avoiding to return a big chunk of results which would occupy a lot of memory.
eg:
def my_calculatiuon(inner1test, inner1train):
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
yield dist
for i in my_calculatiuon(inner1test, inner1train):
print(i)
You might also want to investigate python list comprehension which is sometimes more elegant way to handle for loops with lists.
[EDIT]
Here's a probably easier solution anyway, without the need of indexes, which won't fail to enumerate a numpy object:
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
[/EDIT]
This was the final solution with the correct output for me:
distanceslist = list()
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
distances = (dist, testtrain[0:4])
distanceslist.append(distances)
distanceslist

Speeding up fancy indexing with numpy

I have two numpy arrays and each has shape of (10000,10000).
One is value array and the other one is index array.
Value=np.random.rand(10000,10000)
Index=np.random.randint(0,1000,(10000,10000))
I want to make a list (or 1D numpy array) by summing all the "Value array" referring the "Index array". For example, for each index i, finding matching array index and giving it to value array as argument
for i in range(1000):
NewArray[i] = np.sum(Value[np.where(Index==i)])
However, This is too slow since I have to do this loop through 300,000 arrays.
I tried to come up with some logical indexing method like
NewArray[Index] += Value[Index]
But it didn't work.
The next thing I tried is using dictionary
for k, v in list(zip(Index.flatten(),Value.flatten())):
NewDict[k].append(v)
and
for i in NewDict:
NewDict[i] = np.sum(NewDict[i])
But it was slow too
Is there any smart way to speed up?
I had two thoughts. First, try masking, it speeds this up by about 4x:
for i in range(1000):
NewArray[i] = np.sum(Value[Index==i])
Alternately, you can sort your arrays to put the values you're adding together in contiguous memory space. Masking or using where() has to gather all your values together each time you call sum on the slice. By front-loading this gathering, you might be able to speed things up considerably:
# flatten your arrays
vals = Value.ravel()
inds = Index.ravel()
s = np.argsort(inds) # these are the indices that will sort your Index array
v_sorted = vals[s].copy() # the copy here orders the values in memory instead of just providing a view
i_sorted = inds[s].copy()
searches = np.searchsorted(i_sorted, np.arange(0, i_sorted[-1] + 2)) # 1 greater than your max, this gives you your array end...
for i in range(len(searches) -1):
st = searches[i]
nd = searches[i+1]
NewArray[i] = v_sorted[st:nd].sum()
This method takes 26 sec on my computer vs 400 using the old way. Good luck. If you want to read more about contiguous memory and performance check this discussion out.

More efficient way to find index of objects in Python array

I have a very large 400x300x60x27 array (lets call it 'A'). I took the maximum values which is now a 400x300x60 array called 'B'. Basically I need to find the index in 'A' of each value in 'B'. I have converted them both to lists and set up a for loop to find the indices, but it takes an absurdly long time to get through it because there are over 7 million values. This is what I have:
B=np.zeros((400,300,60))
C=np.zeros((400*300*60))
B=np.amax(A,axis=3)
A=np.ravel(A)
A=A.tolist()
B=np.ravel(B)
B=B.tolist()
for i in range(0,400*300*60):
C[i]=A.index(B[i])
Is there a more efficient way to do this? Its taking hours and hours and the program is still stuck on the last line.
You don't need amax, you need argmax. In case of argmax, the array will only contain the indices rather than values, the computational efficiency of finding the values using indices are much better than vice versa.
So, I would recommend you to store only the indices. Before flattening the array.
instead of np.amax, run A.argmax, this will contain the indices.
But before you're flattening it to 1D, you will need to use a mapping function that causes the indices to 1D as well. This is probably a trivial problem, as you'd need to just use some basic operations to achieve this. But that would also consume some time as it needs to be executed quite some times. But it won't be a searching probem and would save you quite some time.
You are getting those argmax indices and because of the flattening, you are basically converting to linear index equivalents of those.
Thus, a solution would be to add in the proper offsets into the argmax indices in steps leveraging broadcasting at each one of them, like so -
m,n,r,s = A.shape
idx = A.argmax(axis=3)
idx += s*np.arange(r)
idx += r*s*np.arange(n)[:,None]
idx += n*r*s*np.arange(m)[:,None,None] # idx is your C output
Alternatively, a compact way to put it would be like so -
m,n,r,s = A.shape
I,J,K = np.ogrid[:m,:n,:r]
idx = n*r*s*I + r*s*J + s*K + A.argmax(axis=3)

Permutations over subarray in python

I have a array of identifiers that have been grouped into threes. For each group, I would like to randomly assign them to one of three sets and to have those assignments stored in another array. So, for a given array of grouped identifiers (I presort them):
groupings = array([1,1,1,2,2,2,3,3,3])
A possible output would be
assignments = array([0,1,2,1,0,2,2,0,1])
Ultimately, I would like to be able to generate many of these assignment lists and to do so efficiently. My current method is just to create an zeroes array and set each consecutive subarray of length 3 to a random permutation of 3.
assignment = numpy.zeros((12,10),dtype=int)
for i in range(0,12,3):
for j in range(10):
assignment[i:i+3,j] = numpy.random.permutation(3)
Is there a better/faster way?
Two things I can think about:
instead of visiting the 2D array 3 row * 1 column in your inner loop, try to visit it 1*3. Accessing 2D array horizontally first is usually faster than vertically first, since it gives you better spatial locality, which is good for caching.
instead of running numpy.random.permutation(3) each time, if 3 is fixed and is a small number, try to generate the arrays of permutations beforehand and save them into a constant array of array like: (array([0,1,2]), array([0,2,1]), array([1,0,2])...). You just need to randomly pick one array from it each time.

Categories