Iterating on data in two 3D arrays python - python

I'm trying to perform a number of functions to get some results from a set of satellite imagery (in the example case I am performing similarity functions). I first intended to iterate through all the pixels simultaneously, each containing 4 numbers, then calculating a value for each one based off these too numbers then write it to an array e.g scipy.spatial.distance.correlation(pixels_0, pixels_1).
The issue I have is when I run this loop I am having issues getting it to save to an array 1000x1000 giving it a value for each pixel.
array_0 = # some array with dimensions(1000, 1000, 4)
array_1 = # some array with dimensions(1000, 1000, 4)
result_array = []
for rows_0, rows_1 in itertools.izip(array_0, array_1):
for pixels_0, pixels_1 in itertools.izip(rows_0, rows_1):
results = some_function(pixels_0, pixels_1)
print results
>>> # successfully prints desired results
results_array.append(results)
>>> # unsuccessful in creating the desired array
I am getting the results I want to get printing down the run window but I don't know how to put it back into an array which I could manipulate in a similar manor. Are my for loops the issue or is this a simple issue with appending it back to arrays? Any explanation on speeding it up would also be great too as I'm very new to python and programming all together.
a = np.random.rand(10, 10, 4)
b = np.random.rand(10, 10, 4)
def dotprod(T0, T1):
return np.dot(T0, T1)/(np.linalg.norm(T0)*np.linalg.norm(T1))
results =dotprod(a.flatten(), b.flatten())
results = results.reshape(a.shape)
This now causes ValueError: total size of new array must be unchanged,
and when printing the first results value I receive only one number. Is this the fault of my own poorly constructed function or in how I am using numpy?

The best way is to use Numpy for your task. You should think in vectors. And you should write your some_function()to work in a vectorized manner. Here is an example:
array_0 = np.random.rand(1000,1000,4)
array_1 = np.random.rand(1000,1000,4)
results = some_function(array_0.flatten(), array_1.flatten()) ## this will be (1000*1000*4 X 1)
results = results.reshape(array_0.shape) ## reshaping to make it the way you want it.

Before investing anymore effort into programming it this way, take a look into the numpy package. It will be many times faster!
About your code: shouldn't your results array also be multidimensional? So in your inner (per row) loop you should be appending to a row, which you then in you outer loop append to your results matrix.
Try it with a small amount of data (e.g. 10 x 10 x 4) to learn from, but after that switch to numpy as soon as you can...

Related

defining a for loop for operations on each row of a NumPy array

I have a data set of length (L) which I named "data".
data=raw_data.iloc[:,0]
I randomly generated 2000 sample series from "data" and named it "resamples" to have a NumPy matrix of len =2000 and cols=L of the "data".
resamples=[np.random.choice(data, size=len(data), replace=True) for i in range (2000)]
The code below shows two operations in Scipy.stats using "data" which is a single array. Now I need to perform the same operation on each one of those sample series (2000 rows) by defining a for loop. The challenge is two parameters (loc and scale) are calculated in the first step and they should be used for each row to perform the next one. My knowledge falls short in defining such a for loop. I was wondering if anyone could help me with this.
loc, scale=stats.gumbel_r.fit(data)
return_gumbel=stats.gumbel_r.ppf([0.9999,0.9995,0.999],loc=loc, scale=scale)
The description is a little unclear, but I think you just need:
alist = []
for data in resamples:
loc, scale=stats.gumbel_r.fit(data)
return_gumbel=stats.gumbel_r.ppf([0.9999,0.9995,0.999],loc=loc, scale=scale)
alist.append(return_gumbel)
arr = np.array(alist)
You could also create arr first, and assign return_gumbel to the respective rows, but the list append is about the same speed. The loop could also be written as a list comprehension.
There was talk of vectorizing, but given the complex nature of the calculation I doubt if that is feasible - at least not without digging into the details of those stats functions. In numpy vectorizing means writing a function such that it works with all rows of the array at once, performing the actions in compiled numpy code.

Increasing performance of highly repeated numpy array index operations

In my program code I've got numpy value arrays and numpy index arrays. Both are preallocated and predefined during program initialization.
Each part of the program has one array values on which calculations are performed, and three index arrays idx_from_exch, idx_values and idx_to_exch. There is on global value array to exchange the values of several parts: exch_arr.
The index arrays have between 2 and 5 indices most of the times, seldomly (most probably never) more indices are needed. dtype=np.int32, shape and values are constant during the whole program run. Thus I set ndarray.flags.writeable=False after initialization, but this is optional. The index values of the index arrays idx_values and idx_to_exch are sorted in numerical order, idx_source may be sorted, but there is no way to define that. All index arrays corresponding to one value array/part have the same shape.
The values arrays and also the exch_arr usually have between 50 and 1000 elements. shape and dtype=np.float64 are constant during the whole program run, the values of the arrays change in each iteration.
Here are the example arrays:
import numpy as np
import numba as nb
values = np.random.rand(100) * 100 # just some random numbers
exch_arr = np.random.rand(60) * 3 # just some random numbers
idx_values = np.array((0, 4, 55, -1), dtype=np.int32) # sorted but varying steps
idx_to_exch = np.array((7, 8, 9, 10), dtype=np.int32) # sorted and constant steps!
idx_from_exch = np.array((19, 4, 7, 43), dtype=np.int32) # not sorted and varying steps
The example indexing operations look like this:
values[idx_values] = exch_arr[idx_from_exch] # get values from exchange array
values *= 1.1 # some inplace array operations, this is just a dummy for more complex things
exch_arr[idx_to_exch] = values[idx_values] # pass some values back to exchange array
Since these operations are being applied once per iteration for several million iterations, speed is crucial. I've been looking into many different ways of increasing indexing speed in my previous question, but forgot to be specific enough considering my application (especially getting values by indexing with constant index arrays and passing them to another indexed array).
The best way to do it seems to be fancy indexing so far. I'm currently also experimenting with numba guvectorize, but it seems that it is not worth the effort since my arrays are quite small.
memoryviews would be nice, but since the index arrays do not necessarily have consistent steps, I know of no way to use memoryviews.
So is there any faster way to do repeated indexing? Some way of predefining memory address arrays for each indexing operation, as dtype and shape are always constant? ndarray.__array_interface__ gave me a memory address, but I wasn't able to use it for indexing. I thought about something like:
stride_exch = exch_arr.strides[0]
mem_address = exch_arr.__array_interface__['data'][0]
idx_to_exch = idx_to_exch * stride_exch + mem_address
Is that feasible?
I've also been looking into using strides directly with as_strided, but as far as I know only consistent strides are allowed and my problem would require inconsistent strides.
Any help is appreciated!
Thanks in advance!
edit:
I just corrected a massive error in my example calculation!
The operation values = values * 1.1 changes the memory address of the array. All my operations in the program code are layed out to not change the memory address of the arrays, because alot of other operations rely on using memoryviews. Thus I replaced the dummy operation with the correct in-place operation: values *= 1.1
One solution to getting round expensive fancy indexing using numpy boolean arrays is using numba and skipping over the False values in your numpy boolean array.
Example implementation:
#numba.guvectorize(['float64[:], float64[:,:], float64[:]'], '(n),(m,n)->(m)', nopython=True, target="cpu")
def test_func(arr1, arr2, inds, res):
for i in range(arr1.shape[0]):
if not inds[i]:
continue
for j in range(arr2.shape[0]):
res[j, i] = arr1[i] + arr2[j, i]
Of course, play around with the numpy data types (smaller byte sizes will run faster) and target being "cpu" or "parallel".

Editing every value in a numpy matrix

I have a numpy matrix which I filled with data from a *.csv-file
csv = np.genfromtxt (file,skiprows=22)
matrix = np.matrix(csv)
This is a 64x64 matrix which looks like
print matrix
[[...,...,....]
[...,...,.....]
.....
]]
Now I need to take the logarithm math.log10() of every single value and safe it into another 64x64 matrix.
How can I do this? I tried
matrix_lg = np.matrix(csv)
for i in range (0,len(matrix)):
for j in range (0,len(matrix[0])):
matrix_lg[i,j]=math.log10(matrix[i,j])
but this only edited the first array (meaning the first row) of my initial matrix.
It's my first time working with python and I start getting confused.
You can just do:
matrix_lg = numpy.log10(matrix)
And it will do it for you. It's also much faster to do it this vectorized way instead of looping over every entry in python. It will also handle domain errors more gracefully.
FWIW though, the issue with your posted code is that the len() for matrices don't work exactly the same as they do for nested lists. As suggested in the comments, you can just use matrix.shape to get the proper dims to iterate through:
matrix_lg = np.matrix(csv)
for i in range(0,matrix_lg.shape[0]):
for j in range(0,matrix_lg.shape[1]):
matrix_lg[i,j]=math.log10(matrix_lg[i,j])

Replace loop with broadcasting in numpy -> memory error

I have an 2D-array (array1), which has an arbitrary number of rows and in the first column I have strictly monotonic increasing numbers (but not linearly), which represent a position in my system, while the second one gives me a value, which represents the state of my system for and around the position in the first column.
Now I have a second array (array2); its range should usually be the same as for the first column of the first array, but does not matter to much, as you will see below.
I am now interested for every element in array2:
1. What is the argument in array1[:,0], which has the closest value to the current element in array2?
2. What is the value (array1[:,1]) of those elements.
As usually array2 will be longer than the number of rows in array1 it is perfectly fine, if I get one argument from array1 more than one time. In fact this is what I expect.
The value from 2. is written in the second and third column, as you will see below.
My striped code looks like this:
from numpy import arange, zeros, absolute, argmin, mod, newaxis, ones
ysize1 = 50
array1 = zeros((ysize1+1,2))
array1[:,0] = arange(ysize1+1)**2
# can be any strictly monotonic increasing array
array1[:,1] = mod(arange(ysize1+1),2)
# in my current case, but could also be something else
ysize2 = (ysize1)**2
array2 = zeros((ysize2+1,3))
array2[:,0] = arange(0,ysize2+1)
# is currently uniformly distributed over the whole range, but does not necessarily have to be
a = 0
for i, array2element in enumerate(array2[:,0]):
a = argmin(absolute(array1[:,0]-array2element))
array2[i,1] = array1[a,1]
It works, but takes quite a lot time to process large arrays. I then tried to implement broadcasting, which seems to work with the following code:
indexarray = argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
array2[:,2]=array1[indexarray,1] # just to compare the results
Unfortunately now I seem to run into a different problem: I get a memory error on the sizes of arrays I am using in the line of code with the broadcasting.
For small sizes it works, but for larger ones where len(array2[:,0]) is something like 2**17 (and could be even larger) and len(array1[:,0]) is about 2**14. I get, that the size of the array is bigger than the available memory. Is there an elegant way around that or to speed up the loop?
I do not need to store the intermediate array(s), I am just interested in the result.
Thanks!
First lets simplify this line:
argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
it should be:
a = array1[:, 0]
b = array2[:, 0]
argmin(abs(a - b[:, newaxis]), 1)
But even when simplified, you're creating two large temporary arrays. If a and b have sizes M and N, b - a and abs(...) each create a temporary array of size (M, N). Because you've said that a is monotonically increasing, you can avoid the issue all together by using a binary search (sorted search) which is much faster anyways. Take a look at the answer I wrote to this question a while back. Using the function from this answer, try this:
closest = find_closest(array1[:, 0], array2[:, 0])
array2[:, 2] = array1[closest, 1]

Vectorizing the addition of results to a numpy array

I have a function that works something like this:
def Function(x):
a = random.random()
b = random.random()
c = OtherFunctionThatReturnsAThreeColumnArray()
results = np.zeros((1,5))
results[0,0] = a
results[0,1] = b
results[0,2] = c[-1,0]
results[0,3] = c[-1,1]
results[0,4] = c[-1,2]
return results
What I'm trying to do is run this function many, many times, appending the returned one row, 5 column results to a running data set. But the append function, and a for-loop are both ruinously inefficient as I understand it, and I'm both trying to improve my code and the number of runs is going to be large enough that that kind of inefficiency isn't doing me any favors.
Whats the best way to do the following such that it induces the least overhead:
Create a new numpy array to hold the results
Insert the results of N calls of that function into the array in 1?
You're correct in thinking that numpy.append or numpy.concatenate are going to be expensive if repeated many times (this is to do with numpy declaring a new array for the two previous arrays).
The best suggestion (If you know how much space you're going to need in total) would be to declare that before you run your routine, and then just put the results in place as they become available.
If you're going to run this nrows times, then
results = np.zeros([nrows, 5])
and then add your results
def function(x, i, results):
<.. snip ..>
results[i,0] = a
results[i,1] = b
results[i,2] = c[-1,0]
results[i,3] = c[-1,1]
results[0,4] = c[-1,2]
Of course, if you don't know how many times you're going to be running function this won't work. In that case, I'd suggest a less elegant approach;
Declare a possibly large results array and add to results[i, x] as above (keeping track of i and the size of results.
When you reach the size of results, then do the numpy.append (or concatenate) on a new array. This is less bad than appending repetitively and shouldn't destroy performance - but you will have to write some wrapper code.
There are other ideas you could pursue. Off the top of my head you could
Write the results to disk, depending on the speed of OtherFunctionThatReturnsAThreeColumnArray and the size of your data this may not be too daft an idea.
Save your results in a list comprehension (forgetting numpy until after the run). If function returned (a, b, c) not results;
results = [function(x) for x in my_data]
and now do some shuffling to get results into the form you need.

Categories