Merging two arrays using 'for loop' - python

I want to merge two arrays in python 2.7 using 'for loop' given:
from array import *
ary_1 = array ('i',[11,12,13])
ary_2 = array ('i',[14,15,16])
ary_3 = array ('i')
should give the output on ary_3 ,so ary_3 will display like this in specific order:
ary_3 = array ('i',[11,12,13,14,15,16])
Here's my code so far:
from array import *
ary_1 = array ('i',[11,12,13])
ary_2 = array ('i',[14,15,16])
ary_3 = array ('i')
ary_len = len (ary_1) + len (ary_2)
for i in range (0,ary_len):
ary_3.append (ary_1 [i])
ary_3.append (ary_2 [i])
if len (ary_3) == len (ary_1) + len (ary_2):
print ary_3,
break
Then the output was:
array('i',[11,14,12,15,13,16])
Not in order actually, and also if I add a new integer on either ary_1 or ary_2, it gives "index out of range" error so I found out that ary_1 and ary_2 should have an equal amount of integer/s to prevent this error.

If you want to combine the arrays, you can use the built-in method .extend:
ary_1.extend(ary_2)
print ary_1 #array('i', [11, 12, 13, 14, 15, 16])
As SethMMorton points out in the comments, if you do not want to override your first array:
ary_3 = ary_1 + ary_2
print ary_3 #array('i', [11, 12, 13, 14, 15, 16])
You should use one of the approaches above, but for learning purposes
in your original for loop you are (incorrectly) interleaving the two arrays by doing
ary_3.append (ary_1 [i])
ary_3.append (ary_2 [i])
If you wanted to keep the for loop, it should look something like:
ary_1_len = len(ary_1)
for i in range (0,ary_len):
if i < ary_1_len:
ary_3.append (ary_1 [i])
else:
ary_3.append (ary_2 [i-ary_1_len])
if len (ary_3) == len (ary_1) + len (ary_2):
print ary_3
break
Such that you populate the third array with the first array, and then the second array.

Related

Conditionnal Loop over a numpy array?

I am a beginner in Python and I usually program in C.
So, I have a numpy 2D array. I do the mean of the (i,j),(i+1,j),(i,j+1) and (i+1,j+1) values and I sum this mean if it is above a chosen value.
This is my python code :
Z=np.array([[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]])
sum=0.
value=7.
for i in range(np.shape(Z)[0]-1):
for j in range(np.shape(Z)[1]-1):
a = (Z[i,j] + Z[i+1,j] + Z[i,j+1] + Z[i+1,j+1]) / 4
if (a>=value):
sum+=a
print (sum)
I know It does not sound very pythonic. How can I write it in pythonic way to speed up this code on a large 2D numpy array ?
Thanks for answer
I'd do it this way:
quads = Z[:-1,:-1] + Z[1:,:-1] + Z[:-1,1:] + Z[1:,1:]
sum = quads[quads >= value * 4].sum() / 4
The first line computes the entire (x-1,y-1) array of sums of 2x2 elements:
array([[16, 20, 24, 28],
[36, 40, 44, 48]])
The second line compares each of those 8 elements with value * 4, rather than dividing quads / 4 which would create another array of the same size unnecessarily. This lets us do a single scalar multiply and a scalar divide at the end, instead of an array divide. But you could also write it this way if you don't care about the optimization:
quads /= 4
sum = quads[quads >= value].sum()

Nested loop with indices

I'm trying to solve the following simplified problem. I have an array with data and two arrays with start and end indices stored. What I would like is to double the values of the dataset that fall between the indices (so between and including start[0], end[0] and start[1], end[1] etc). I tried a nested loop as follows:
data = np.array([0,1,2,8,4,5,6,5,4,5,6,7,8])
start = np.array([0,5,7])
end = np.array([3,6,9])
new_data = np.zeros(len(data))
for i in range(len(start)):
for j in range(len(data)):
if (j >= start[i]) & (j <= end[i]):
new_data[j] = data[j]*2
else:
new_data[j] = data[j]
The result should be [0,2,4,16,4,10,12,10,8,10,6,7,8], and yet the code returns:
[ 0. 1. 2. 8. 4. 5. 6. 10. 8. 10. 6. 7. 8.]
Only the part between the last indices is correct. Any ideas why? And what if I want to triple the values not satisfying the if statement?
You're repeatedly assigning to new_data, overwriting previous changes.
I.e.:
new_data[j] = data[j]*2 # won't work
data[j] = data[j]*2 # will work.
I'd suggest using python's zip command in order to simultaneously unpack multiple lists.
Code explaination:
Since the datapoints which don't belong to any sublists remain the same, I used numpy's copy method to create a new copy of your original data. Then I muliplied your desired sublists by two.
import numpy as np
data = np.array([0, 1, 2, 8, 4, 5, 6, 5, 4, 5, 6, 7, 8])
start = np.array([0, 5, 7])
end = np.array([3, 6, 9])
new_data = data.copy() #Not to mess up our original dataset.
for s, e in zip(start, end):
new_data[s:e+1] *= 2 #Because it's an inclusive set.
Initialize new_data with data and remove the else.
Edit:
This should answer you, it's more efficient, since you're iterating through the whole array the number of intervals you have, and that will be slow for large inputs, but here you'll go through the array once.
And if you want to triple the other elements, a naive solution is to multiply new_data by 3 before the for loop, so put new_data = np.copy(data)*3.
import numpy as np
data = np.array([0,1,2,8,4,5,6,5,4,5,6,7,8])
start = np.array([0,5,7])
end = np.array([3,6,9])
new_data = np.copy(data) # change to new_data = np.copy(data)*3 to triple other elements.
for i in range(len(start)):
for j in range(start[i], end[i]+1):
new_data[j] = data[j]*2
Another notice in your code is (j >= start[i]) & (j <= end[i]), this could be simplified to start[i] <= j <= end[i], probably more fast.

Value in an array between two numbers in python

So making a title that actually explains what i want is harder than i thought, so here goes me explaining it.
I have an array filled with zeros that adds values every time a condition is met, so after 1 time step iteration i get something like this (minus the headers):
current_array =
bubble_size y_coord
14040 42
3943 71
6345 11
0 0
0 0
....
After this time step is complete this current_array gets set as previous_array and is wiped with zeros because there is not a guaranteed number of entries each time.
NOW the real question is i want to be able to check all rows in the first column of the previous_array and see if the current bubble size is within say 5% either side and if so i want to take the current y position away for the value associated with the matching bubble size number in the previous_array's second column.
currently i have something like;
if bubble_size in current_array[:, 0]:
do_whatever
but i don't know how to pull out the associated y_coord without using a loop, which i am fine with doing (there is about 100 rows to the array and atleast 1000 time steps so i want to make it as efficient as possible) but would like to avoid
i have included my thoughts on the for loop (note the current and previous_array are actually current and previous_frame)
for y in range (0, array_size):
if bubble_size >> previous_frame[y,0] *.95 &&<< previous_frame[y, 0] *1.05:
distance_travelled = current_y_coord - previous_frame[y,0]
y = y + 1
Any help is greatly appreciated :)
I probably did not get your issue here but if you want to first check if the bubble size is in between the same row element 95 % you can use the following:
import numpy as np
def apply(p, c): # For each element check the bubblesize grow
if(p*0.95 < c < p*1.05):
return 1
else:
return 0
def dist(p, c): # Calculate the distance
return c-p
def update(prev, cur):
assert isinstance(
cur, np.ndarray), 'Current array is not a valid numpy array'
assert isinstance(
prev, np.ndarray), 'Previous array is not a valid numpy array'
assert prev.shape == cur.shape, 'Arrays size mismatch'
applyvec = np.vectorize(apply)
toapply = applyvec(prev[:, 0], cur[:, 0])
print(toapply)
distvec = np.vectorize(dist)
distance = distvec(prev[:, 1], cur[:, 1])
print(distance)
current = np.array([[14040, 42],
[3943,71],
[6345,11],
[0,0],
[0,0]])
previous = np.array([[14039, 32],
[3942,61],
[6344,1],
[0,0],
[0,0]])
update(previous,current)
PS: Please, could you tell us what is the final array you look for based on my examples?
As I understand it (correct me if Im wrong):
You have a current bubble size (integer) and a current y value (integer)
You have a 2D array (prev_array) that contains bubble sizes and y coords
You want to check whether your current bubble size is within 5% (either way) of each stored bubble size in prev_array
If they are within range, subtract your current y value from the stored y coord
This will result in a new array, containing only bubble sizes that are within range, and the newly subtracted y value
You want to do this without an explicit loop
You can do that using boolean indexing in numpy...
Setup the previous array:
prev_array = np.array([[14040, 42], [3943, 71], [6345, 11], [3945,0], [0,0]])
prev_array
array([[14040, 42],
[ 3943, 71],
[ 6345, 11],
[ 3945, 0],
[ 0, 0]])
You have your stored bubble size you want to use for comparison, and a current y coord value:
bubble_size = 3750
cur_y = 10
Next we can create a boolean mask where we only select rows of prev_array that meets the 5% criteria:
ind = (bubble_size > prev_array[:,0]*.95) & (bubble_size < prev_array[:,0]*1.05)
# ind is a boolean array that looks like this: [False, True, False, True, False]
Then we use ind to index prev_array, and calculate the new (subtracted) y coords:
new_array = prev_array[ind]
new_array[:,1] = cur_y - new_array[:,1]
Giving your final output array:
array([[3943, -61],
[3945, 10]])
As its not clear what you want your output to actually look like, instead of creating a new array, you can also just update prev_array with the new y values:
ind = (bubble_size > prev_array[:,0]*.95) & (bubble_size < prev_array[:,0]*1.05)
prev_array[ind,1] = cur_y - prev_array[ind,1]
Which gives:
array([[14040, 42],
[ 3943, -61],
[ 6345, 11],
[ 3945, 10],
[ 0, 0]])

Quicksort in python3. Last Pivot

Thanks for taking the time to read this :) I'm implementing my own version of quick-sort in python and i'm trying to get it too work within some restrictions from a previous school assignment. Note that the reasons I've avoided using IN is because it wasn't allowed in the project i worked on (not sure why :3).
it was working fine for integers and strings but i cannot manage to adapt it for my CounterList() which is a list of nodes containing an arbitrary integer and string in each even though i'm only sorting by the integers contained in those nodes.
Pastebins:
My QuickSort: http://pastebin.com/mhAm3YYp.
The CounterList and CounterNode, code. http://pastebin.com/myn5xuv6.
from classes_1 import CounterNode, CounterList
def bulk_append(array1, array2):
# takes all the items in array2 and appends them to array1
itr = 0
array = array1
while itr < len(array2):
array.append(array2[itr])
itr += 1
return array
def quickSort(array):
lss = CounterList()
eql = CounterList()
mre = CounterList()
if len(array) <= 1:
return array # Base case.
else:
pivot = array[len(array)-1].count # Pivoting on the last item.
itr = 0
while itr < len(array)-1:
# Essentially editing "for i in array:" to handle CounterLists
if array[itr].count < pivot:
lss.append(array[itr])
elif array[itr].count > pivot:
mre.append(array[itr])
else:
eql.append(array[itr])
itr += 1
# Recursive step and combining seperate lists.
lss = quickSort(lss)
eql = quickSort(eql)
mre = quickSort(mre)
fnl = bulk_append(lss, eql)
fnl = bulk_append(fnl, mre)
return fnl
I know it is probably quite straightforward but i just can't seem to see the issue.
(Pivoting on last item)
Here is the test im using:
a = CounterList()
a.append(CounterNode("ack", 11))
a.append(CounterNode("Boo", 12))
a.append(CounterNode("Cah", 9))
a.append(CounterNode("Doh", 7))
a.append(CounterNode("Eek", 5))
a.append(CounterNode("Fuu", 3))
a.append(CounterNode("qck", 1))
a.append(CounterNode("roo", 2))
a.append(CounterNode("sah", 4))
a.append(CounterNode("toh", 6))
a.append(CounterNode("yek", 8))
a.append(CounterNode("vuu", 10))
x = quickSort(a)
print("\nFinal List: \n", x)
And the resulting CounterList:
['qck': 1, 'Fuu': 3, 'Eek': 5, 'Doh': 7, 'Cah': 9, 'ack': 11]
Which as you can tell, is missing multiple values?
Either way thanks for any advice, and your time.
There are two mistakes in the code:
You don't need "eql = quickSort(eql)" line because it contains all equal values, so no need to sort.
In every recursive call you loose pivot (reason for missing entries) as you don't append it to any list. You need to append it to eql. So after the code line shown below:
pivot = array[len(array)-1].count
insert this line:
eql.append(array[len(array)-1])
Also remove the below line from your code as it may cause recursion depth sometimes (only with arrays with some repeating values if any repeated value selected as pivot):
eql = quickSort(eql)

Speeding up iterating over Numpy Arrays

I am working on performing image processing using Numpy, specifically a running standard deviation stretch. This reads in X number of columns, finds the Std. and performs a percentage linear stretch. It then iterates to the next "group" of columns and performs the same operations. The input image is a 1GB, 32-bit, single band raster which is taking quite a long time to process (hours). Below is the code.
I realize that I have 3 nested for loops which is, presumably where the bottleneck is occurring. If I process the image in "boxes", that is to say loading an array that is [500,500] and iterating through the image processing time is quite short. Unfortunately, camera error requires that I iterate in extremely long strips (52,000 x 4) (y,x) to avoid banding.
Any suggestions on speeding this up would be appreciated:
def box(dataset, outdataset, sampleSize, n):
quiet = 0
sample = sampleSize
#iterate over all of the bands
for j in xrange(1, dataset.RasterCount + 1): #1 based counter
band = dataset.GetRasterBand(j)
NDV = band.GetNoDataValue()
print "Processing band: " + str(j)
#define the interval at which blocks are created
intervalY = int(band.YSize/1)
intervalX = int(band.XSize/2000) #to be changed to sampleSize when working
#iterate through the rows
scanBlockCounter = 0
for i in xrange(0,band.YSize,intervalY):
#If the next i is going to fail due to the edge of the image/array
if i + (intervalY*2) < band.YSize:
numberRows = intervalY
else:
numberRows = band.YSize - i
for h in xrange(0,band.XSize, intervalX):
if h + (intervalX*2) < band.XSize:
numberColumns = intervalX
else:
numberColumns = band.XSize - h
scanBlock = band.ReadAsArray(h,i,numberColumns, numberRows).astype(numpy.float)
standardDeviation = numpy.std(scanBlock)
mean = numpy.mean(scanBlock)
newMin = mean - (standardDeviation * n)
newMax = mean + (standardDeviation * n)
outputBlock = ((scanBlock - newMin)/(newMax-newMin))*255
outRaster = outdataset.GetRasterBand(j).WriteArray(outputBlock,h,i)#array, xOffset, yOffset
scanBlockCounter = scanBlockCounter + 1
#print str(scanBlockCounter) + ": " + str(scanBlock.shape) + str(h)+ ", " + str(intervalX)
if numberColumns == band.XSize - h:
break
#update progress line
if not quiet:
gdal.TermProgress_nocb( (float(h+1) / band.YSize) )
Here is an update:
Without using the profile module, as I did not want to start wrapping small sections of the code into functions I used a mix of print and exit statements to get a really rough idea about which lines were taking the most time. Luckily (and I do understand how lucky I was) one line was dragging everything down.
outRaster = outdataset.GetRasterBand(j).WriteArray(outputBlock,h,i)#array, xOffset, yOffset
It appears that GDAL is quite inefficient when opening the output file and writing out the array. With this in mind I decided to add my modified arrays "outBlock" to a python list, then write out chunks. Here is the segment that I changed:
The outputBlock was just modified ...
#Add the array to a list (tuple)
outputArrayList.append(outputBlock)
#Check the interval counter and if it is "time" write out the array
if len(outputArrayList) >= (intervalX * writeSize) or finisher == 1:
#Convert the tuple to a numpy array. Here we horizontally stack the tuple of arrays.
stacked = numpy.hstack(outputArrayList)
#Write out the array
outRaster = outdataset.GetRasterBand(j).WriteArray(stacked,xOffset,i)#array, xOffset, yOffset
xOffset = xOffset + (intervalX*(intervalX * writeSize))
#Cleanup to conserve memory
outputArrayList = list()
stacked = None
finisher=0
Finisher is simply a flag that handles the edges. It took a bit of time to figure out how to build an array from the list. In that, using numpy.array was creating a 3-d array (anyone care to explain why?) and write array requires a 2d array. Total processing time is now varying from just under 2 minutes to 5 minutes. Any idea why the range of times might exist?
Many thanks to everyone who posted! The next step is to really get into Numpy and learn about vectorization for additional optimization.
One way to speed up operations over numpy data is to use vectorize. Essentially, vectorize takes a function f and creates a new function g that maps f over an array a. g is then called like so: g(a).
>>> sqrt_vec = numpy.vectorize(lambda x: x ** 0.5)
>>> sqrt_vec(numpy.arange(10))
array([ 0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
Without having the data you're working with available, I can't say for certain whether this will help, but perhaps you can rewrite the above as a set of functions that can be vectorized. Perhaps in this case you could vectorize over an array of indices into ReadAsArray(h,i,numberColumns, numberRows). Here's an example of the potential benefit:
>>> print setup1
import numpy
sqrt_vec = numpy.vectorize(lambda x: x ** 0.5)
>>> print setup2
import numpy
def sqrt_vec(a):
r = numpy.zeros(len(a))
for i in xrange(len(a)):
r[i] = a[i] ** 0.5
return r
>>> timeit.timeit(stmt='a = sqrt_vec(numpy.arange(1000000))', setup=setup1, number=1)
0.30318188667297363
>>> timeit.timeit(stmt='a = sqrt_vec(numpy.arange(1000000))', setup=setup2, number=1)
4.5400981903076172
A 15x speedup! Note also that numpy slicing handles the edges of ndarrays elegantly:
>>> a = numpy.arange(25).reshape((5, 5))
>>> a[3:7, 3:7]
array([[18, 19],
[23, 24]])
So if you could get your ReadAsArray data into an ndarray you wouldn't have to do any edge-checking shenanigans.
Regarding your question about reshaping -- reshaping doesn't fundamentally alter the data at all. It just changes the "strides" by which numpy indices the data. When you call the reshape method, the value returned is a new view into the data; the data isn't copied or altered at all, nor is the old view with the old stride information.
>>> a = numpy.arange(25)
>>> b = a.reshape((5, 5))
>>> a
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24])
>>> b
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
>>> a[5]
5
>>> b[1][0]
5
>>> a[5] = 4792
>>> b[1][0]
4792
>>> a.strides
(8,)
>>> b.strides
(40, 8)
Answered as requested.
If you are IO bound, you should chunk your reads/writes. Try dumping ~500 MB of data to an ndarray, process it all, write it out and then grab the next ~500 MB. Make sure to reuse the ndarray.
Without trying to completely understand exactly what you are doing, I notice that you aren't using any numpy slices or array broadcasting, both of which may speed up your code, or, at the very least, make it more readable. My apologies if these aren't germane to your problem.

Categories