I'm trying to solve the following simplified problem. I have an array with data and two arrays with start and end indices stored. What I would like is to double the values of the dataset that fall between the indices (so between and including start[0], end[0] and start[1], end[1] etc). I tried a nested loop as follows:
data = np.array([0,1,2,8,4,5,6,5,4,5,6,7,8])
start = np.array([0,5,7])
end = np.array([3,6,9])
new_data = np.zeros(len(data))
for i in range(len(start)):
for j in range(len(data)):
if (j >= start[i]) & (j <= end[i]):
new_data[j] = data[j]*2
else:
new_data[j] = data[j]
The result should be [0,2,4,16,4,10,12,10,8,10,6,7,8], and yet the code returns:
[ 0. 1. 2. 8. 4. 5. 6. 10. 8. 10. 6. 7. 8.]
Only the part between the last indices is correct. Any ideas why? And what if I want to triple the values not satisfying the if statement?
You're repeatedly assigning to new_data, overwriting previous changes.
I.e.:
new_data[j] = data[j]*2 # won't work
data[j] = data[j]*2 # will work.
I'd suggest using python's zip command in order to simultaneously unpack multiple lists.
Code explaination:
Since the datapoints which don't belong to any sublists remain the same, I used numpy's copy method to create a new copy of your original data. Then I muliplied your desired sublists by two.
import numpy as np
data = np.array([0, 1, 2, 8, 4, 5, 6, 5, 4, 5, 6, 7, 8])
start = np.array([0, 5, 7])
end = np.array([3, 6, 9])
new_data = data.copy() #Not to mess up our original dataset.
for s, e in zip(start, end):
new_data[s:e+1] *= 2 #Because it's an inclusive set.
Initialize new_data with data and remove the else.
Edit:
This should answer you, it's more efficient, since you're iterating through the whole array the number of intervals you have, and that will be slow for large inputs, but here you'll go through the array once.
And if you want to triple the other elements, a naive solution is to multiply new_data by 3 before the for loop, so put new_data = np.copy(data)*3.
import numpy as np
data = np.array([0,1,2,8,4,5,6,5,4,5,6,7,8])
start = np.array([0,5,7])
end = np.array([3,6,9])
new_data = np.copy(data) # change to new_data = np.copy(data)*3 to triple other elements.
for i in range(len(start)):
for j in range(start[i], end[i]+1):
new_data[j] = data[j]*2
Another notice in your code is (j >= start[i]) & (j <= end[i]), this could be simplified to start[i] <= j <= end[i], probably more fast.
Related
So making a title that actually explains what i want is harder than i thought, so here goes me explaining it.
I have an array filled with zeros that adds values every time a condition is met, so after 1 time step iteration i get something like this (minus the headers):
current_array =
bubble_size y_coord
14040 42
3943 71
6345 11
0 0
0 0
....
After this time step is complete this current_array gets set as previous_array and is wiped with zeros because there is not a guaranteed number of entries each time.
NOW the real question is i want to be able to check all rows in the first column of the previous_array and see if the current bubble size is within say 5% either side and if so i want to take the current y position away for the value associated with the matching bubble size number in the previous_array's second column.
currently i have something like;
if bubble_size in current_array[:, 0]:
do_whatever
but i don't know how to pull out the associated y_coord without using a loop, which i am fine with doing (there is about 100 rows to the array and atleast 1000 time steps so i want to make it as efficient as possible) but would like to avoid
i have included my thoughts on the for loop (note the current and previous_array are actually current and previous_frame)
for y in range (0, array_size):
if bubble_size >> previous_frame[y,0] *.95 &&<< previous_frame[y, 0] *1.05:
distance_travelled = current_y_coord - previous_frame[y,0]
y = y + 1
Any help is greatly appreciated :)
I probably did not get your issue here but if you want to first check if the bubble size is in between the same row element 95 % you can use the following:
import numpy as np
def apply(p, c): # For each element check the bubblesize grow
if(p*0.95 < c < p*1.05):
return 1
else:
return 0
def dist(p, c): # Calculate the distance
return c-p
def update(prev, cur):
assert isinstance(
cur, np.ndarray), 'Current array is not a valid numpy array'
assert isinstance(
prev, np.ndarray), 'Previous array is not a valid numpy array'
assert prev.shape == cur.shape, 'Arrays size mismatch'
applyvec = np.vectorize(apply)
toapply = applyvec(prev[:, 0], cur[:, 0])
print(toapply)
distvec = np.vectorize(dist)
distance = distvec(prev[:, 1], cur[:, 1])
print(distance)
current = np.array([[14040, 42],
[3943,71],
[6345,11],
[0,0],
[0,0]])
previous = np.array([[14039, 32],
[3942,61],
[6344,1],
[0,0],
[0,0]])
update(previous,current)
PS: Please, could you tell us what is the final array you look for based on my examples?
As I understand it (correct me if Im wrong):
You have a current bubble size (integer) and a current y value (integer)
You have a 2D array (prev_array) that contains bubble sizes and y coords
You want to check whether your current bubble size is within 5% (either way) of each stored bubble size in prev_array
If they are within range, subtract your current y value from the stored y coord
This will result in a new array, containing only bubble sizes that are within range, and the newly subtracted y value
You want to do this without an explicit loop
You can do that using boolean indexing in numpy...
Setup the previous array:
prev_array = np.array([[14040, 42], [3943, 71], [6345, 11], [3945,0], [0,0]])
prev_array
array([[14040, 42],
[ 3943, 71],
[ 6345, 11],
[ 3945, 0],
[ 0, 0]])
You have your stored bubble size you want to use for comparison, and a current y coord value:
bubble_size = 3750
cur_y = 10
Next we can create a boolean mask where we only select rows of prev_array that meets the 5% criteria:
ind = (bubble_size > prev_array[:,0]*.95) & (bubble_size < prev_array[:,0]*1.05)
# ind is a boolean array that looks like this: [False, True, False, True, False]
Then we use ind to index prev_array, and calculate the new (subtracted) y coords:
new_array = prev_array[ind]
new_array[:,1] = cur_y - new_array[:,1]
Giving your final output array:
array([[3943, -61],
[3945, 10]])
As its not clear what you want your output to actually look like, instead of creating a new array, you can also just update prev_array with the new y values:
ind = (bubble_size > prev_array[:,0]*.95) & (bubble_size < prev_array[:,0]*1.05)
prev_array[ind,1] = cur_y - prev_array[ind,1]
Which gives:
array([[14040, 42],
[ 3943, -61],
[ 6345, 11],
[ 3945, 10],
[ 0, 0]])
Is there a way to get rid of the loop in the code below and replace it with vectorized operation?
Given a data matrix, for each row I want to find the index of the minimal value that fits within ranges defined (per row) in a separate array.
Here's an example:
import numpy as np
np.random.seed(10)
# Values of interest, for this example a random 6 x 100 matrix
data = np.random.random((6,100))
# For each row, define an inclusive min/max range
ranges = np.array([[0.3, 0.4],
[0.35, 0.5],
[0.45, 0.6],
[0.52, 0.65],
[0.6, 0.8],
[0.75, 0.92]])
# For each row, find the index of the minimum value that fits inside the given range
result = np.zeros(6).astype(np.int)
for i in xrange(6):
ind = np.where((ranges[i][0] <= data[i]) & (data[i] <= ranges[i][1]))[0]
result[i] = ind[np.argmin(data[i,ind])]
print result
# Result: [35 8 22 8 34 78]
print data[np.arange(6),result]
# Result: [ 0.30070006 0.35065639 0.45784951 0.52885388 0.61393513 0.75449247]
Approach #1 : Using broadcasting and np.minimum.reduceat -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
r,c = np.nonzero(mask)
cut_idx = np.unique(r, return_index=1)[1]
out = np.minimum.reduceat(data[mask], cut_idx)
Improvement to avoid np.nonzero and compute cut_idx directly from mask :
cut_idx = np.concatenate(( [0], np.count_nonzero(mask[:-1],1).cumsum() ))
Approach #2 : Using broadcasting and filling invalid places with NaNs and then using np.nanargmin -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
result = np.nanargmin(np.where(mask, data, np.nan), axis=1)
out = data[np.arange(6),result]
Approach #3 : If you are not iterating enough (just like you have a loop of 6 iterations in the sample), you might want to stick to a loop for memory efficiency, but make use of more efficient masking with a boolean array instead -
out = np.zeros(6)
for i in xrange(6):
mask_i = (ranges[i,0] <= data[i]) & (data[i] <= ranges[i,1])
out[i] = np.min(data[i,mask_i])
Approach #4 : There is one more loopy solution possible here. The idea would be to sort each row of data. Then, use the two range limits for each row to decide on the start and stop indices with help from np.searchsorted. Further, we would use those indices to slice and then get the minimum values. Benefit with slicing that way is, we would be working with views and as such would be very efficient, both on memory and performance.
The implementation would look something like this -
out = np.zeros(6)
sdata = np.sort(data, axis=1)
for i in xrange(6):
start = np.searchsorted(sdata[i], ranges[i,0])
stop = np.searchsorted(sdata[i], ranges[i,1], 'right')
out[i] = np.min(sdata[i,start:stop])
Furthermore, we could get those start, stop indices in a vectorized manner following an implementation of vectorized searchsorted.
Based on suggestion by #Daniel F for the case when we are dealing with ranges that are within the limits of given data, we could simply use the start indices -
out[i] = sdata[i, start]
Assuming at least one value in range, you don't even have to bother with the upper limit:
result = np.empty(6)
for i in xrange(6):
lt = (ranges[i,0] >= data[i]).sum()
result[i] = np.argpartition(data[i], lt)[lt]
Actually, you could even vectorize the whole thing using argpartition
lt = (ranges[:,None,0] >= data).sum(1)
result = np.argpartition(data, lt)[np.arange(data.shape[0]), lt]
Of course, this is only efficient if data.shape[0] << data.shape[1], as otherwise you're basically sorting
Thanks for taking the time to read this :) I'm implementing my own version of quick-sort in python and i'm trying to get it too work within some restrictions from a previous school assignment. Note that the reasons I've avoided using IN is because it wasn't allowed in the project i worked on (not sure why :3).
it was working fine for integers and strings but i cannot manage to adapt it for my CounterList() which is a list of nodes containing an arbitrary integer and string in each even though i'm only sorting by the integers contained in those nodes.
Pastebins:
My QuickSort: http://pastebin.com/mhAm3YYp.
The CounterList and CounterNode, code. http://pastebin.com/myn5xuv6.
from classes_1 import CounterNode, CounterList
def bulk_append(array1, array2):
# takes all the items in array2 and appends them to array1
itr = 0
array = array1
while itr < len(array2):
array.append(array2[itr])
itr += 1
return array
def quickSort(array):
lss = CounterList()
eql = CounterList()
mre = CounterList()
if len(array) <= 1:
return array # Base case.
else:
pivot = array[len(array)-1].count # Pivoting on the last item.
itr = 0
while itr < len(array)-1:
# Essentially editing "for i in array:" to handle CounterLists
if array[itr].count < pivot:
lss.append(array[itr])
elif array[itr].count > pivot:
mre.append(array[itr])
else:
eql.append(array[itr])
itr += 1
# Recursive step and combining seperate lists.
lss = quickSort(lss)
eql = quickSort(eql)
mre = quickSort(mre)
fnl = bulk_append(lss, eql)
fnl = bulk_append(fnl, mre)
return fnl
I know it is probably quite straightforward but i just can't seem to see the issue.
(Pivoting on last item)
Here is the test im using:
a = CounterList()
a.append(CounterNode("ack", 11))
a.append(CounterNode("Boo", 12))
a.append(CounterNode("Cah", 9))
a.append(CounterNode("Doh", 7))
a.append(CounterNode("Eek", 5))
a.append(CounterNode("Fuu", 3))
a.append(CounterNode("qck", 1))
a.append(CounterNode("roo", 2))
a.append(CounterNode("sah", 4))
a.append(CounterNode("toh", 6))
a.append(CounterNode("yek", 8))
a.append(CounterNode("vuu", 10))
x = quickSort(a)
print("\nFinal List: \n", x)
And the resulting CounterList:
['qck': 1, 'Fuu': 3, 'Eek': 5, 'Doh': 7, 'Cah': 9, 'ack': 11]
Which as you can tell, is missing multiple values?
Either way thanks for any advice, and your time.
There are two mistakes in the code:
You don't need "eql = quickSort(eql)" line because it contains all equal values, so no need to sort.
In every recursive call you loose pivot (reason for missing entries) as you don't append it to any list. You need to append it to eql. So after the code line shown below:
pivot = array[len(array)-1].count
insert this line:
eql.append(array[len(array)-1])
Also remove the below line from your code as it may cause recursion depth sometimes (only with arrays with some repeating values if any repeated value selected as pivot):
eql = quickSort(eql)
I've got a 2-row array called C like this:
from numpy import *
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = vstack((A,B))
I want to take all the columns in C where the value in the first row falls between i and i+2, and average them. I can do this with just A no problem:
i = 0
A_avg = []
while(i<6):
selection = A[logical_and(A >= i, A < i+2)]
A_avg.append(mean(selection))
i += 2
then A_avg is:
[1.0,2.5,4.5]
I want to carry out the same process with my two-row array C, but I want to take the average of each row separately, while doing it in a way that's dictated by the first row. For example, for C, I want to end up with a 2 x 3 array that looks like:
[[1.0,2.5,4.5],
[50,35,15]]
Where the first row is A averaged in blocks between i and i+2 as before, and the second row is B averaged in the same blocks as A, regardless of the values it has. So the first entry is unchanged, the next two get averaged together, and the next two get averaged together, for each row separately. Anyone know of a clever way to do this? Many thanks!
I hope this is not too clever. TIL boolean indexing does not broadcast, so I had to manually do the broadcasting. Let me know if anything is unclear.
import numpy as np
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = np.vstack((A,B)) # float so that I can use np.nan
i = np.arange(0, 6, 2)[:, None]
selections = np.logical_and(A >= i, A < i+2)[None]
D, selections = np.broadcast_arrays(C[:, None], selections)
D = D.astype(float) # allows use of nan, and makes a copy to prevent repeated behavior
D[~selections] = np.nan # exclude these elements from mean
D = np.nanmean(D, axis=-1)
Then,
>>> D
array([[ 1. , 2.5, 4.5],
[ 50. , 35. , 15. ]])
Another way, using np.histogram to bin your data. This may be faster for large arrays, but is only useful for few rows, since a hist must be done with different weights for each row:
bins = np.arange(0, 7, 2) # include the end
n = np.histogram(A, bins)[0] # number of columns in each bin
a_mean = np.histogram(A, bins, weights=A)[0]/n
b_mean = np.histogram(A, bins, weights=B)[0]/n
D = np.vstack([a_mean, b_mean])
I am working on performing image processing using Numpy, specifically a running standard deviation stretch. This reads in X number of columns, finds the Std. and performs a percentage linear stretch. It then iterates to the next "group" of columns and performs the same operations. The input image is a 1GB, 32-bit, single band raster which is taking quite a long time to process (hours). Below is the code.
I realize that I have 3 nested for loops which is, presumably where the bottleneck is occurring. If I process the image in "boxes", that is to say loading an array that is [500,500] and iterating through the image processing time is quite short. Unfortunately, camera error requires that I iterate in extremely long strips (52,000 x 4) (y,x) to avoid banding.
Any suggestions on speeding this up would be appreciated:
def box(dataset, outdataset, sampleSize, n):
quiet = 0
sample = sampleSize
#iterate over all of the bands
for j in xrange(1, dataset.RasterCount + 1): #1 based counter
band = dataset.GetRasterBand(j)
NDV = band.GetNoDataValue()
print "Processing band: " + str(j)
#define the interval at which blocks are created
intervalY = int(band.YSize/1)
intervalX = int(band.XSize/2000) #to be changed to sampleSize when working
#iterate through the rows
scanBlockCounter = 0
for i in xrange(0,band.YSize,intervalY):
#If the next i is going to fail due to the edge of the image/array
if i + (intervalY*2) < band.YSize:
numberRows = intervalY
else:
numberRows = band.YSize - i
for h in xrange(0,band.XSize, intervalX):
if h + (intervalX*2) < band.XSize:
numberColumns = intervalX
else:
numberColumns = band.XSize - h
scanBlock = band.ReadAsArray(h,i,numberColumns, numberRows).astype(numpy.float)
standardDeviation = numpy.std(scanBlock)
mean = numpy.mean(scanBlock)
newMin = mean - (standardDeviation * n)
newMax = mean + (standardDeviation * n)
outputBlock = ((scanBlock - newMin)/(newMax-newMin))*255
outRaster = outdataset.GetRasterBand(j).WriteArray(outputBlock,h,i)#array, xOffset, yOffset
scanBlockCounter = scanBlockCounter + 1
#print str(scanBlockCounter) + ": " + str(scanBlock.shape) + str(h)+ ", " + str(intervalX)
if numberColumns == band.XSize - h:
break
#update progress line
if not quiet:
gdal.TermProgress_nocb( (float(h+1) / band.YSize) )
Here is an update:
Without using the profile module, as I did not want to start wrapping small sections of the code into functions I used a mix of print and exit statements to get a really rough idea about which lines were taking the most time. Luckily (and I do understand how lucky I was) one line was dragging everything down.
outRaster = outdataset.GetRasterBand(j).WriteArray(outputBlock,h,i)#array, xOffset, yOffset
It appears that GDAL is quite inefficient when opening the output file and writing out the array. With this in mind I decided to add my modified arrays "outBlock" to a python list, then write out chunks. Here is the segment that I changed:
The outputBlock was just modified ...
#Add the array to a list (tuple)
outputArrayList.append(outputBlock)
#Check the interval counter and if it is "time" write out the array
if len(outputArrayList) >= (intervalX * writeSize) or finisher == 1:
#Convert the tuple to a numpy array. Here we horizontally stack the tuple of arrays.
stacked = numpy.hstack(outputArrayList)
#Write out the array
outRaster = outdataset.GetRasterBand(j).WriteArray(stacked,xOffset,i)#array, xOffset, yOffset
xOffset = xOffset + (intervalX*(intervalX * writeSize))
#Cleanup to conserve memory
outputArrayList = list()
stacked = None
finisher=0
Finisher is simply a flag that handles the edges. It took a bit of time to figure out how to build an array from the list. In that, using numpy.array was creating a 3-d array (anyone care to explain why?) and write array requires a 2d array. Total processing time is now varying from just under 2 minutes to 5 minutes. Any idea why the range of times might exist?
Many thanks to everyone who posted! The next step is to really get into Numpy and learn about vectorization for additional optimization.
One way to speed up operations over numpy data is to use vectorize. Essentially, vectorize takes a function f and creates a new function g that maps f over an array a. g is then called like so: g(a).
>>> sqrt_vec = numpy.vectorize(lambda x: x ** 0.5)
>>> sqrt_vec(numpy.arange(10))
array([ 0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
Without having the data you're working with available, I can't say for certain whether this will help, but perhaps you can rewrite the above as a set of functions that can be vectorized. Perhaps in this case you could vectorize over an array of indices into ReadAsArray(h,i,numberColumns, numberRows). Here's an example of the potential benefit:
>>> print setup1
import numpy
sqrt_vec = numpy.vectorize(lambda x: x ** 0.5)
>>> print setup2
import numpy
def sqrt_vec(a):
r = numpy.zeros(len(a))
for i in xrange(len(a)):
r[i] = a[i] ** 0.5
return r
>>> timeit.timeit(stmt='a = sqrt_vec(numpy.arange(1000000))', setup=setup1, number=1)
0.30318188667297363
>>> timeit.timeit(stmt='a = sqrt_vec(numpy.arange(1000000))', setup=setup2, number=1)
4.5400981903076172
A 15x speedup! Note also that numpy slicing handles the edges of ndarrays elegantly:
>>> a = numpy.arange(25).reshape((5, 5))
>>> a[3:7, 3:7]
array([[18, 19],
[23, 24]])
So if you could get your ReadAsArray data into an ndarray you wouldn't have to do any edge-checking shenanigans.
Regarding your question about reshaping -- reshaping doesn't fundamentally alter the data at all. It just changes the "strides" by which numpy indices the data. When you call the reshape method, the value returned is a new view into the data; the data isn't copied or altered at all, nor is the old view with the old stride information.
>>> a = numpy.arange(25)
>>> b = a.reshape((5, 5))
>>> a
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24])
>>> b
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
>>> a[5]
5
>>> b[1][0]
5
>>> a[5] = 4792
>>> b[1][0]
4792
>>> a.strides
(8,)
>>> b.strides
(40, 8)
Answered as requested.
If you are IO bound, you should chunk your reads/writes. Try dumping ~500 MB of data to an ndarray, process it all, write it out and then grab the next ~500 MB. Make sure to reuse the ndarray.
Without trying to completely understand exactly what you are doing, I notice that you aren't using any numpy slices or array broadcasting, both of which may speed up your code, or, at the very least, make it more readable. My apologies if these aren't germane to your problem.