I am writing some python code to process huge amounts of data (almost 6 million pieces!).
In the code, I'm using a huge for loop to process each set. In that loop, I'm using the same variables every loop and overwriting them. When I ran the program, I noticed that the longer I ran it, the slower it got. Furthermore, upon further experimenting, I discovered that the speed if you ran it for values 10,000 - 10,100 was the same as from 0 to 100. Thus I concluded that since I was not creating more variables and merely processing existing ones, every time I overwrote a variable, it must be being saved somewhere by python.
So:
Am I right? must it be python saving my overwritten somewhere?
Or am I wrong? Is something else happening?
Python uses reference counting to keep track of variables. When all references to a variable are removed, that variable is garbage collected. However that garbage collection is done by python at it's own whim, not right away.
It could be that your code is going faster than python garbage collects, or that you have something wrong with your code. Since you didn't give any of your code there's no real way to know.
Python does not copy the original value of a variable before saving an overwritten variables.
Possibly you are seeing the effect of various caches with the program slowing down. Or if you are creating objects, an garbage collector is being called to delete the objects you created that are no longer referenced.
Do you have example code that shows this behavior you are seeing?
For example:
import hashlib
import random
import time
def test():
t = []
for i in xrange(20000):
if (i == 0) | (i==100)|(i==10000)|(i==10100):
t.append(time.time())
for j in range(1,10):
a = hashlib.sha512(str(random.random()))
b = hashlib.sha512(str(random.random()))
c = hashlib.sha512(str(random.random()))
d = hashlib.sha512(str(random.random()))
e = hashlib.sha512(str(random.random()))
f = hashlib.sha512(str(random.random()))
g = hashlib.sha512(str(random.random()))
print t[1]-t[0], t[3]-t[2]
Then running 10 times:
>>> for i in range(10):
test()
0.0153768062592 0.0147190093994
0.0148379802704 0.0147860050201
0.0145788192749 0.0147390365601
0.0147459506989 0.0146520137787
0.0147008895874 0.0147621631622
0.0145609378815 0.0146908760071
0.0144789218903 0.014506816864
0.0146539211273 0.0145659446716
0.0145878791809 0.0146989822388
0.0146920681 0.0147240161896
Gives nearly identical times to within standard error (especially if I exclude the very first interval where it was slightly slower where it had to first initialize a,b,c,d,e,f,g).
Related
When I run this code in a lambda function in which the memory allocation setting is set to max (10240):
df_compare = first_less_dupes[compare_columns].compare(second_less_dupes[compare_columns])
I'm seeing this error:
Unable to allocate 185. MiB for an array with shape (2697080, 9) and data type float64 - Error type:<class 'numpy.core._exceptions._ArrayMemoryError'>
I've run this code many times with smaller dfs without issue. So I began attacking this from a memory capacity/clean-up approach my assumption being: I need to free up memory. I use two snippets of code to audit my memory usage:
def print_current_memory():
'''
Gets the current process and checks current memory usage
'''
process = psutil.Process(os.getpid())
mbs = round(process.memory_info().rss / 1024 / 1024,2)
print('Current memory usage:',mbs, 'MB')
And
for obj_name in list(locals().keys()):
size = str(sys.getsizeof(locals()[obj_name]))
mbs = str(round(int(size) / 1024 / 1024,2))
print(f'{obj_name}: {mbs}MB. {size}B.')
The print_current_memory function does just what it says in its comments. The loop prints out a list of all local variables and their size. Using the loop I identified several objects that I did not need. (Strangely the summed size of the listed objects should have greatly exceeded the lambda limit (even before the error)).
So I delete those objects and garbage collect (I understand gc may not be necessary).
print_current_memory()
print('Deleting first & limited')
del first
del first_limited
print('Deleting second & limited')
del second
del second_limited
print('Deleting both_df')
del both_df
print('Garbage collecting')
gc.collect()
print_current_memory()
After running this I see:
I am clearly doing something wrong since the current memory usage doesn't decrease. And that is my main concern: How do I decrease memory usage to make space for this new dataframe? But Perhaps I'm asking the wrong question and need to question my assumptions like: Can I monitor current-memory-usuage in a Lambda the same way I would with a Window OS? Am I deleting objects the right way? My use of gc probably illustrates how little I know about it so am I using that correctly?
del does not properly delete objects, but simple drops the reference tied to the variable name being deleted. You must make sure that every other reference is properly dropped too.
Then you might still need to wait for garbage collection to happen. However with pandas and numpy, the actual data is managed in C++, and therefore the garbage collection should be immediate when the last reference is dropped.
Since you are working in an Amazon lambda, the data you transform does not need to be kept, because you just want your result out. Then it is certainly safe for you to use inplace to replace the original data with your processed data, and therefore free up space. Perhaps such tutorial could get you started.
I wrote a python script that backs up my files while I'm sleeping at night. The program is designed to run whenever the computer is on and to automatically shut down the computer after the backups are done. My code looks like this:
from datetime import datetime
from os import system
from backup import backup
while True:
today = datetime.now()
# Perform backups on Monday at 3:00am.
if today.weekday() == 0 and today.hour == 3:
print('Starting backups...')
# Perform backups.
backup("C:\\Users\\Jeff Moorhead\\Desktop", "E:\\")
backup("C:\\Users\\Jeff Moorhead\\Desktop", "F:\\")
backup("C:\\Users\\Jeff Moorhead\\OneDrive\\Documents", "E:\\")
backup("C:\\Users\\Jeff Moorhead\\OneDrive\\Documents", "F:\\")
# Shutdown computer after backups finish.
system('shutdown /s /t 10')
break
else:
del today
continue
The backup function is from another file that I wrote to perform more customized backups on a case by by case basis. This code all works perfectly fine, but I'm wondering if the del statement
del today
is really necessary. I put it in thinking that it would prevent my memory from getting filled up by thousands of datetime objects, but then I read that Python uses garbage collection, similar to Java. Further, does the todayvariable automatically get replaced with each pass through the while loop? I know that the program works as intended with the del statement, but if it is unnecessary, then I would like to get rid of it if only for the sake of brevity! What are it's actual effects on memory?
I put it in thinking that it would prevent my memory from getting filled up by thousands of datetime objects
The del statement is not necessary, you may simply remove that block. Python will free the space from those local variables automatically.
... but then I read that Python uses garbage collection, similar to Java.
The above statement is misguided: this has nothing to do with the garbage collector, which exists to break up circular references. In CPython, the memory is released when the object reference count decreases to zero, and that would occur even if the garbage collector is disabled.
Further, does the today variable automatically get replaced with each pass through the while loop? I know that the program works as intended with the del statement, but if it is unnecessary, then I would like to get rid of it if only for the sake of brevity! What are it's actual effects on memory?
A new datetime object is created on each iteration of the loop.
The today name in scope will be rebound to the newly created datetime instance. The old datetime instance will be deleted because no reference exists on it (since the only existing reference is lost once you rebound the name today to a different object). Once again, I stress that this is just ref-counting and has nothing to do with gc.
On an unrelated note, your program will busy-loop and consume an entire CPU with this while loop. You should consider adding a call to time.sleep into the loop so the process will remain mostly idle. Or, better yet, schedule the task to run periodically using cron.
I'm using an in-house Python library for scientific computing. I need to consecutively copy an object, modify it, and then delete it. The object is huge which causes my machine to run out of memory after a few cycles.
The first problem is that I use python's del to delete the object, which apparently only dereferences the object, rather than freeing up RAM.
The second problem is that even when I encapsulate the whole process in a function, after the function is invoked, the RAM is still not freed up. Here's a code snippet to better explain the issue.
ws = op.core.Workspace()
net = op.network.Cubic(shape=[100,100,100], spacing=1e-6)
proj = net.project
def f():
for i in range(5):
clone = ws.copy_project(proj)
result = do_something_with(clone)
del clone
f()
gc.collect()
>>> ws
{'sim_01': [<openpnm.network.Cubic object at 0x7fed1c417780>],
'sim_02': [<openpnm.network.Cubic object at 0x7fed1c417888>],
'sim_03': [<openpnm.network.Cubic object at 0x7fed1c417938>],
'sim_04': [<openpnm.network.Cubic object at 0x7fed1c417990>],
'sim_05': [<openpnm.network.Cubic object at 0x7fed1c4179e8>],
'sim_06': [<openpnm.network.Cubic object at 0x7fed1c417a40>]}
My question is how do I completely delete a Python object?
Thanks!
PS. In the code snippet, each time ws.copy_project is called, a copy of proj is stored in ws dictionary.
There are some really smart python people on here. They may be able to tell you better ways to keep your memory clear, but I have used leaky libraries before, and found one (so-far) foolproof way to guarantee that your memory gets cleared after use: execute the memory hog in another process.
To do this, you'd need to arrange for an easy way to make your long calculation be executable separately. I have done this by adding special flags to my existing python script that tells it just to run that function; you may find it easier to put that function in a separate .py file, e.g.:
do_something_with.py
import sys
def do_something_with(i)
# Your example is still too vague. Clearly, something differentiates
# each do_something_with, otherwise you're just taking the
# same inputs 5 times over.
# Whatever the difference is, pass it in as an argument to the function
ws = op.core.Workspace()
net = op.network.Cubic(shape=[100,100,100], spacing=1e-6)
proj = net.project
# You may not even need to clone anymore?
clone = ws.copy_project(proj)
result = do_something_with(clone)
# Whatever arg(s) you need to get to the function, just pass it in on the command line
if __name__ == "__main__":
sys.exit(do_something_with(sys.args[1:]))
You can do this using any of the python tools that handle subprocesses. In python 3.5+, the recommended way to do this is subprocess.run. You could change your bigger function to something like this:
import subprocess
invoke_do_something(i):
completed_args = subprocess.run(["python", "do_something_with.py", str(i)], check=False)
return completed_args.returncode
results = map(invoke_do_something, range(5))
You'll obviously need to tailor this to fit your own situation, but by running in a subprocess, you're guaranteed to not have to worry about the memory getting cleaned up. As an added bonus, you could potentially use multiprocess.Pool.map to use multiple processors at one time. (I deliberately coded this to use map to make such a transition simple. You could still use your for loop if you prefer, and then you don't need the invoke... function.) Multiprocessing could speed up your processing, but since you're already worried about memory, is almost certainly a bad idea - with multiple processes of the big memory hog, your system itself will likely quickly run out of memory and kill your process.
Your example is fairly vague, so I've written this at a high level. I can answer some questions if you need.
For some background on my problem, I'm importing a module, data_read_module.pyd, written by someone else, and I cannot see the contents of that module.
I have one file, let's called it myfunctions. Ignore the ### for now, I'll comment on the commented portions later.
import data_read_module
def processData(fname):
data = data_read_module.read_data(fname)
''' process data here '''
return t, x
### return 1
I call this within the framework of a larger program, a TKinter GUI specifically. For purposes of this post, I've pared down to the bare essentials. Within the GUI code, I call the above as follows:
import myfunctions
class MyApplication:
def __init__(self,parent):
self.t = []
self.x = []
def openFileAndProcessData(self):
# self.t = None
# self.x = None
self.t,self.x = myfunctions.processData(fname)
## myfunctions.processData(fname)
I noticed what every time I run openFileAndProcessData, Windows Task Manager reports that my memory usage increases, so I thought that I had a memory leak somewhere in my GUI application. So the first thing I tried is the
# self.t = None
# self.x = None
that you see commented above. Next, I tried calling myfunctions.processData without assigning the output to any variables as follows:
## myfunctions.processData(fname)
This also had no effect. As a last ditch effort, I changed the processData function so it simply returns 1 without even processing any of the data that comes from the module, data_read_module.pyd. Unfortunately, even this results in more memory being taken up with each successive call to processData, which narrows the problem down to data_read_module.read_data. I thought that within the Python framework, this is the exact type of thing that is automatically taken care of. Referring to this website, it seems that memory taken up by a function will be released when the function terminates. In my case, I would expect the memory used in processData to be released after a call [with the exception of the output that I am keeping track of with self.t and self.x]. I understand I won't get a fix to this kind of issue without access to data_read_module.pyd, but I'd like to understand how this can happen to begin with.
A .pyd file is basically a DLL. You're calling code written in C, C++, or another such compiled language. If that code allocates memory and doesn't release it properly, you will get a memory leak. The fact that the code is being called from Python won't magically fix it.
I have a tab-separated data file with a little over 2 million lines and 19 columns.
You can find it, in US.zip: http://download.geonames.org/export/dump/.
I started to run the following but with for l in f.readlines(). I understand that just iterating over the file is supposed to be more efficient so I'm posting that below. Still, with this small optimization, I'm using 30% of my memory on the process and have only done about 6.5% of the records. It looks like, at this pace, it will run out of memory like it did before. Also, the function I have is very slow. Is there anything obvious I can do to speed it up? Would it help to del the objects with each pass of the for loop?
def run():
from geonames.models import POI
f = file('data/US.txt')
for l in f:
li = l.split('\t')
try:
p = POI()
p.geonameid = li[0]
p.name = li[1]
p.asciiname = li[2]
p.alternatenames = li[3]
p.point = "POINT(%s %s)" % (li[5], li[4])
p.feature_class = li[6]
p.feature_code = li[7]
p.country_code = li[8]
p.ccs2 = li[9]
p.admin1_code = li[10]
p.admin2_code = li[11]
p.admin3_code = li[12]
p.admin4_code = li[13]
p.population = li[14]
p.elevation = li[15]
p.gtopo30 = li[16]
p.timezone = li[17]
p.modification_date = li[18]
p.save()
except IndexError:
pass
if __name__ == "__main__":
run()
EDIT, More details (the apparently important ones):
The memory consumption is going up as the script runs and saves more lines.
The method, .save() is an adulterated django model method with unique_slug snippet that is writing to a postgreSQL/postgis db.
SOLVED: DEBUG database logging in Django eats memory.
Make sure that Django's DEBUG setting is set to False
This looks perfectly fine to me. Iterating over the file like that or using xreadlines() will read each line as needed (with sane buffering behind the scenes). Memory usage should not grow as you read in more and more data.
As for performance, you should profile your app. Most likely the bottleneck is somewhere in a deeper function, like POI.save().
There's no reason to worry in the data you've given us: is memory consumption going UP as you read more and more lines? Now that would be cause for worry -- but there's no indication that this would happen in the code you've shown, assuming that p.save() saves the object to some database or file and not in memory, of course. There's nothing real to be gained by adding del statements, as the memory is getting recycled at each leg of the loop anyway.
This could be sped up if there's a faster way to populate a POI instance than binding its attributes one by one -- e.g., passing those attributes (maybe as keyword arguments? positional would be faster...) to the POI constructor. But whether that's the case depends on that geonames.models module, of which I know nothing, so I can only offer very generic advice -- e.g., if the module lets you save a bunch of POIs in a single gulp, then making them (say) 100 at a time and saving them in bunches should yield a speedup (at the cost of slightly higher memory consumption).