My python loop through a dataframe slows down over time

My python loop through a dataframe slows down over time - python

I'm looping through a very large dataframe(11361 x 22679) and converting the values of each row to a pixel image using pyplot. So in the end I should have 11361 images with 151 x 151 pixels (I add 0's to the end to make it square).
allDF is a list of 33 DataFrames that correspond to the 33 subirectories in newFileNames the images need to save to.
I've tried deleting each DataFrame and image at the end of each iteration.
I've tried converting the float values to int.
I've tried gc.collect() at the end of each iteration (even though I know it's redundant)
I've taken measures not to store any additional values by always referencing the original data.
The only thing that helps is if I process one frame at a time. It still slows down, but because there are less iterations it's not as slow. So, I think the inner loop or one of the functions is the issue.
def shape_pixels(imglist):
for i in range(122):
imglist.append(0.0)
imgarr = np.array(imglist).reshape((151,151))
imgarr.reshape((151,151))
return imgarr
def create_rbg_image(subpath,imgarr,imgname):
# create/save image
img = plt.imshow(imgarr, cmap=rgbmap)
plt.axis('off')
plt.savefig(dirpath+subpath+imgname,
transparent=True,
bbox_inches=0,pad_inches=0)
for i in range(len(allDF)):
for j in range(len(allDF[i])):
fname = allDF[i]['File Name'].iloc[j][0:36]
newlist = allDF[i].iloc[j][1:].tolist()
newarr = shape_pixels(allDF[i].iloc[j][1:].tolist())
create_rbg_image(newFileNames[i]+'\\',shape_pixels(allDF[i].iloc[j][1:].tolist()),allDF[i]['File Name'].iloc[j][0:36])
I'd like to be able to run the code for the entire dataset and just come back to it when it's done, but I ran it overnight and got less than 1/3 of the way through. If it continues to slow down I'll never be done.
The first minute generates over 150 images The second generates 80. Then 48, 32, 27, and so on.. eventually it takes several minutes to create just one.
I don

plot.close('all') helped significantly, but I switched to using PIL and hexadec values, This was significantly more efficient and I was able to generate all 11k+ images in under 20 minutes

Related

Python : reduce execution time, optimize loops

def SSIM_compute(files,WorkingFolder,DestinationAlikeFolder,DestinationUniqueFolder,results, start_time):
NumberAlike=0
loop=1
while True:
files=os.listdir(WorkingFolder)
if files==[]:
break
IsAlike=False
CountAlike=1
print("Loop : "+str(loop)+" --- starttime : "+str(time.time()-start_time))
for i in range (1,len(files)):
#print("\ti= "+str(i)+" : "+str(time.time()-start_time))
img1=cv2.imread(WorkingFolder+"/"+files[0])
img2=cv2.imread(WorkingFolder+"/"+files[i])
x1,y1=img1.shape[:2]
x2,y2=img2.shape[:2]
x=min(x1,x2)
y=min(y1,y2)
img1=cv2.resize(img1,(x,y),1)
img2=cv2.resize(img2,(x,y),1)
threshold=ssim(img1,img2,multichannel=True)
if threshold>0.8:
IsAlike=True
if os.path.exists((WorkingFolder+"/"+files[i])):
shutil.move((WorkingFolder+"/"+files[i]),DestinationAlikeFolder+"/alike"+str(NumberAlike)+"_"+str(CountAlike)+".jpg")
CountAlike+=1
#results.write("ALIKE : " +files[0] +" --- " +files[i]+"\n")
results.write("ALIKE : /alike"+str(NumberAlike)+"_0"+".jpg --- /alike"+str(NumberAlike)+"_"+str(CountAlike)+".jpg -> "+str(threshold))
if IsAlike:
if os.path.exists((WorkingFolder+"/"+files[0])):
shutil.move((WorkingFolder+"/"+files[0]),DestinationAlikeFolder+"/alike"+str(NumberAlike)+"_0"+".jpg")
NumberAlike+=1
else :
if os.path.exists((WorkingFolder+"/"+files[0])):
shutil.move((WorkingFolder+"/"+files[0]),DestinationUniqueFolder)
loop+=1
I have this code that must compare images to determine if they are identical or if some of them were modified (compression, artefact, etc...).
So, to check if two images are strictly similar, I just compute and compare their respective hashes (in another function not shown here), and to check if they are similar I compute the SSIM on those two files.
The next part is where the trouble begins : when I test this code on a quiet small set of pictures (approx. 50), the execution time is decent, but if I make the set bigger (something like 200 pictures), the execution time becomes way too high (several hours) as expected considering I have two imbricated for loops.
As I'm not very creative, has anybody ideas to reduce the time execution on a larger dataset ? Maybe a method in order to avoid those imbricated loops ?
Thank you for any provided help :)

You're comparing each image with every other image - you could pull reading the first image img1 out, of the for loop and just do it once per file.
But as you're comparing each file with every other file that's going to slowdown as O(N^2/2) i.e. 200 will be 8x slower than 50. Maybe you could resize to a much smaller size like 64x64 which would be much quicker to compare with ssim(), and only if similar at that small size do a full-size comparison?

Efficient Way to Repeatedly Split Large NumPy Array and Record Middle

I have a large NumPy array nodes = np.arange(100_000_000) and I need to rearrange this array by:
Recording and then removing the middle value in the array
Split the array into the left half and right half
Repeat Steps 1-2 for each half
Stop when all values are exhausted
So, for a smaller input example nodes = np.arange(10), the output would be:
[5 2 8 1 4 7 9 0 3 6]
This was accomplished by naively doing:
import numpy as np
def split(node, out):
mid = len(node) // 2
out.append(node[mid])
return node[:mid], node[mid+1:]
def reorder(a):
nodes = [a.tolist()]
out = []
while nodes:
tmp = []
for node in nodes:
for n in split(node, out):
if n:
tmp.append(n)
nodes = tmp
return np.array(out)
if __name__ == "__main__":
nodes = np.arange(10)
print(reorder(nodes))
However, this is way too slow for nodes = np.arange(100_000_000) and so I am looking for a much faster solution.

You can vectorize your function with Numpy by working on groups of slices.
Here is an implementation:
# Similar to [e for tmp in zip(a, b) for e in tmp] ,
# but on Numpy arrays and much faster
def interleave(a, b):
assert len(a) == len(b)
return np.column_stack((a, b)).reshape(len(a) * 2)
# n is the length of the input range (len(a) in your example)
def fast_reorder(n):
if n == 0:
return np.empty(0, dtype=np.int32)
startSlices = np.array([0], dtype=np.int32)
endSlices = np.array([n], dtype=np.int32)
allMidSlices = np.empty(n, dtype=np.int32) # Similar to "out" in your implementation
midInsertCount = 0 # Actual size of allMidSlices
# Generate a bunch of middle values as long as there is valid slices to split
while midInsertCount < n:
# Generate the new mid/left/right slices
midSlices = (endSlices + startSlices) // 2
# Computing the next slices is not needed for the last step
if midInsertCount + len(midSlices) < n:
# Generate the nexts slices (possibly with invalid ones)
newStartSlices = interleave(startSlices, midSlices+1)
newEndSlices = interleave(midSlices, endSlices)
# Discard invalid slices
isValidSlices = newStartSlices < newEndSlices
startSlices = newStartSlices[isValidSlices]
endSlices = newEndSlices[isValidSlices]
# Fast appending
allMidSlices[midInsertCount:midInsertCount+len(midSlices)] = midSlices
midInsertCount += len(midSlices)
return allMidSlices[0:midInsertCount]
On my machine, this is 89 times faster than your scalar implementation with the input np.arange(100_000_000) dropping from 2min35 to 1.75s. It also consume far less memory (rougthly 3~4 times less). Note that if you want a faster code, then you probably need to use a native language like C or C++.

Edit:
The question has been updated to have a much smaller input array so I leave the below for historical reasons. Basically it was likely a typo but we often get accustomed to computers working with insanely large numbers and when memory is involved they can be a real problem.
There is already a numpy based solution submitted by someone else that I think fits the bill.
Your code requires an insane amount of RAM just to hold 100 billion 64 bit integers. Do you have 800GB of RAM? Then you convert the numpy array to a list which will be substantially larger than the array (each packed 64 bit int in the numpy array will become a much less memory efficient python int object and the list will have a pointer to that object). Then you make a lot of slices of the list which will not duplicate the data but will duplicate the pointers to the data and use even more RAM. You also append all the result values to a list a single value at a time. Lists are very fast for adding items generally but with such an extreme size this will not only be slow but the way the list is allocated is likely to be extremely wasteful RAM wise and contribute to major problems (I believe they double in size when they get to a certain level of fullness so you will end up allocating more RAM than you need and doing many allocations and likely copies). What kind of machine are you running this on? There are ways to improve your code but unless you're running it on a super computer I don't know that you're going to ever finish that calculation. I only..only? have 32GB of RAM and I'm not going to even try to create a 100B int_64 numpy array as I don't want to use up ssd write life for a mass of virtual memory.
As for improving your code stick to numpy arrays don't change to a python list it will greatly increase the RAM you need. Preallocate a numpy array to put the answer in. Then you need a new algorithm. Anything recursive or recursive like (ie a loop splitting the input,) will require tracking a lot of state, your nodes list is going to be extraordinarily gigantic and again use a lot of RAM. You could use len(a) to indicate values that are removed from your list and scan through the entire array each time to figure out what to do next but that will save RAM in favour of a tremendous amount of searching a gigantic array. I feel like there is an algorithm to cut numbers from each end and place them in the output and just track the beginning and end but I haven't figured it out at least not yet.
I also think there is a simpler algorithm where you just track the number of splits you've done instead of making a giant list of slices and keeping it all in memory. Take the middle of the left half and then the middle of the right then count up one and when you take the middle of the left half's left half you know you have to jump to the right half then the count is one so you jump over to the original right half's left half and on and on... Based on the depth into the halves and the length of the input you should be able to jump around without scanning or tracking all of those slices though I haven't been able to dedicate much time to thinking this through in my head.
With a problem of this nature if you really need to push the limits you should consider using C/C++ so you can be as efficient as possible with RAM usage and because you're doing an insane number of tiny things which doesn't map well to python performance.

Detect significant changes in a data-set that gradually changes

I have a list of data in python that represents amount of resources used per minute. I want to find the number of times it changes significantly in that data set. What I mean by significant change is a bit different from what I've read so far.
For e.g. if I have a dataset like
[10,15,17,20,30,40,50,70,80,60,40,20]
I say a significant change happens when data increases by double or reduces by half with respect to the previous normal.
For e.g. since the list starts with 10, that is our starting normal point
Then when data doubles to 20, I count that as one significant change and set the normal to 20.
Then when data doubles to 40, it is considered a significant change and the normal is now 40
Then when data doubles to 80, it is considered a significant change and the normal is now 80
After that when data reduces by half to 40, it is considered as another significant change and the normal becomes 40
Finally when data reduces by half to 20, it is the last significant change
Here there are a total of 5 significant changes.
Is it similar to any other change detection algorithm? How can this be done efficiently in python?

This is relatively straightforward. You can do this with a single iteration through the list. We simply update our base when a 'significant' change occurs.
Note that my implementation will work for any iterable or container. This is useful if you want to, for example, read through a file without having to load it all into memory.
def gen_significant_changes(iterable, *, tol = 2):
iterable = iter(iterable) # this is necessary if it is container rather than generator.
# note that if the iterable is already a generator iter(iterable) returns itself.
base = next(iterable)
for x in iterable:
if x >= (base * tol) or x <= (base/tol):
yield x
base = x
my_list = [10,15,17,20,30,40,50,70,80,60,40,20]
print(list(gen_significant_changes(my_list)))

I can't help with the Python part, but in terms of math, the problem you're asking is fairly simple to solve using log base 2. A significant change occurs when the current value divided by a constant can be reached by raising 2 to a different power (as an integer) than the previous value. (The constant is needed since the first value in the array forms the basis of comparison.)
For each element at t, compute:
current = math.log(Array[t] /Array[0], 2)
previous = math.log(Array[t-1]/Array[0], 2)
if math.floor(current) <> math.floor(previous) a significant change has occurred
Using this method you do not need to keep track of a "normal point" at all, you just need the array. By removing the additional state variable we enable the array to be processed in any order, and we could give portions of the array to different threads if the dataset were very large. You wouldn't be able to do that with your current method.

How to read multiple images and create a 3D matrix with them?

I have a bunch of images (300 images of 400 X 400 pixels) with filenames like:
001.bmp
002.bmp
003.bmp
...
First, I tried reading one of them, e.g. using imread I get a (400L, 400L, 3L) matrix, the problem is the 3L (I think is RBG format), so the question here is: how can I read them and get a (400L, 400L, 1L) matrix that I need to proccess them?
Second, I tried to read the 300 images using a loop like the following:
data = np.zeros((400,400,300))
for i in range(300):
data[:,:,i] = imread('{0}.bmp'.format(i))
but it doesn't work, very probably my code is wrong. Actually doing this, I want to concatenate each (300) image data (400 X 400) into a matrix of (400 X 400 X 300).
When trying to use:
data[:,:,i] = imread('{0}.bmp'.format(i))
search for '1.bmp' and not '001.bmp', but due to the list go from 000 to 299, I got a problem with that and I cant write '00{0}.bmp'.format(i) to complete the filename, because for two- and three digits numbers I got '0012.bmp' or '00123.bmp'
Well, after hours, I got to do this
arrays = []
for number in range(0, 299):
numstr = str(number).zfill(3)
fname = numstr + '.bmp'
a = imread(fname, flatten=1)
arrays.append(a)
data = np.array(arrays)
This code its work well. Thankyou, for give me clues!

First, you are right that the last dimension are the color channels. I assume you want a grayscale image, which you can get with:
data = imread(fname, flatten=1)
That comes from the imread documentation here.
Second, your issue with the loop can be due to a couple of things. First, I don't see indentation in the code in your post, so make sure that is there on the loop body in the code that you are actually trying to run. Second, the code has a ".txt" extension. Are you sure you don't actually want ".bmp"?

Fast string to array copying python

I'm looking to cut up image data into regularly sized screen blocks. Currently the method I've been using is this:
def getScreenBlocksFastNew(bmpstr):
pixelData = array.array('c')
step = imgWidth * 4
pixelCoord = (blockY * blockSizeY * imgWidth +
blockSizeX * blockX)* 4
for y in range(blockSizeY):
pixelData.extend( bmpstr[pixelCoord : pixelCoord + blockSizeX * 4] )
pixelCoord += step
return pixelData
bmpstr is a string of the raw pixel data, stored as one byte per RGBA value. (I also have the option of using a tuple of ints. They seem to take about the same amount of time for each). This creates an array of a block of pixels, depicted by setting blockX, blockY and blockSizeX, blockSizeY. Currently blockSizeX = blockSizeY = 22, which is the optimal size screen block for what I am doing.
My problem is that this process takes .0045 seconds per 5 executions, and extrapolating that out to the 2000+ screen blocks to fill the picture resolution requires about 1.7 seconds per picture, which is far too slow.
I am looking to make this process faster, but I'm not sure what the proper algorithm will be. I am looking to have my pixelData array pre-created so I don't have to reinstantiate it every time. However this leaves me with a question: what is the fastest way to copy the pixel RGBA values from bmpstr to an array, without using extend or append? Do I need to set each value individually? That can't be the most efficient way.
For example, how can I copy values bmpstr[0:100] into pixelData[0:100] without using extend or setting each value individually?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.