RAM is not freed after a Python function is invoked - python

I'm using an in-house Python library for scientific computing. I need to consecutively copy an object, modify it, and then delete it. The object is huge which causes my machine to run out of memory after a few cycles.
The first problem is that I use python's del to delete the object, which apparently only dereferences the object, rather than freeing up RAM.
The second problem is that even when I encapsulate the whole process in a function, after the function is invoked, the RAM is still not freed up. Here's a code snippet to better explain the issue.
ws = op.core.Workspace()
net = op.network.Cubic(shape=[100,100,100], spacing=1e-6)
proj = net.project
def f():
for i in range(5):
clone = ws.copy_project(proj)
result = do_something_with(clone)
del clone
f()
gc.collect()
>>> ws
{'sim_01': [<openpnm.network.Cubic object at 0x7fed1c417780>],
'sim_02': [<openpnm.network.Cubic object at 0x7fed1c417888>],
'sim_03': [<openpnm.network.Cubic object at 0x7fed1c417938>],
'sim_04': [<openpnm.network.Cubic object at 0x7fed1c417990>],
'sim_05': [<openpnm.network.Cubic object at 0x7fed1c4179e8>],
'sim_06': [<openpnm.network.Cubic object at 0x7fed1c417a40>]}
My question is how do I completely delete a Python object?
Thanks!
PS. In the code snippet, each time ws.copy_project is called, a copy of proj is stored in ws dictionary.

There are some really smart python people on here. They may be able to tell you better ways to keep your memory clear, but I have used leaky libraries before, and found one (so-far) foolproof way to guarantee that your memory gets cleared after use: execute the memory hog in another process.
To do this, you'd need to arrange for an easy way to make your long calculation be executable separately. I have done this by adding special flags to my existing python script that tells it just to run that function; you may find it easier to put that function in a separate .py file, e.g.:
do_something_with.py
import sys
def do_something_with(i)
# Your example is still too vague. Clearly, something differentiates
# each do_something_with, otherwise you're just taking the
# same inputs 5 times over.
# Whatever the difference is, pass it in as an argument to the function
ws = op.core.Workspace()
net = op.network.Cubic(shape=[100,100,100], spacing=1e-6)
proj = net.project
# You may not even need to clone anymore?
clone = ws.copy_project(proj)
result = do_something_with(clone)
# Whatever arg(s) you need to get to the function, just pass it in on the command line
if __name__ == "__main__":
sys.exit(do_something_with(sys.args[1:]))
You can do this using any of the python tools that handle subprocesses. In python 3.5+, the recommended way to do this is subprocess.run. You could change your bigger function to something like this:
import subprocess
invoke_do_something(i):
completed_args = subprocess.run(["python", "do_something_with.py", str(i)], check=False)
return completed_args.returncode
results = map(invoke_do_something, range(5))
You'll obviously need to tailor this to fit your own situation, but by running in a subprocess, you're guaranteed to not have to worry about the memory getting cleaned up. As an added bonus, you could potentially use multiprocess.Pool.map to use multiple processors at one time. (I deliberately coded this to use map to make such a transition simple. You could still use your for loop if you prefer, and then you don't need the invoke... function.) Multiprocessing could speed up your processing, but since you're already worried about memory, is almost certainly a bad idea - with multiple processes of the big memory hog, your system itself will likely quickly run out of memory and kill your process.
Your example is fairly vague, so I've written this at a high level. I can answer some questions if you need.

Related

Deleted objects in Python 3 not being released from memory? [duplicate]

I wrote a python script that backs up my files while I'm sleeping at night. The program is designed to run whenever the computer is on and to automatically shut down the computer after the backups are done. My code looks like this:
from datetime import datetime
from os import system
from backup import backup
while True:
today = datetime.now()
# Perform backups on Monday at 3:00am.
if today.weekday() == 0 and today.hour == 3:
print('Starting backups...')
# Perform backups.
backup("C:\\Users\\Jeff Moorhead\\Desktop", "E:\\")
backup("C:\\Users\\Jeff Moorhead\\Desktop", "F:\\")
backup("C:\\Users\\Jeff Moorhead\\OneDrive\\Documents", "E:\\")
backup("C:\\Users\\Jeff Moorhead\\OneDrive\\Documents", "F:\\")
# Shutdown computer after backups finish.
system('shutdown /s /t 10')
break
else:
del today
continue
The backup function is from another file that I wrote to perform more customized backups on a case by by case basis. This code all works perfectly fine, but I'm wondering if the del statement
del today
is really necessary. I put it in thinking that it would prevent my memory from getting filled up by thousands of datetime objects, but then I read that Python uses garbage collection, similar to Java. Further, does the todayvariable automatically get replaced with each pass through the while loop? I know that the program works as intended with the del statement, but if it is unnecessary, then I would like to get rid of it if only for the sake of brevity! What are it's actual effects on memory?
I put it in thinking that it would prevent my memory from getting filled up by thousands of datetime objects
The del statement is not necessary, you may simply remove that block. Python will free the space from those local variables automatically.
... but then I read that Python uses garbage collection, similar to Java.
The above statement is misguided: this has nothing to do with the garbage collector, which exists to break up circular references. In CPython, the memory is released when the object reference count decreases to zero, and that would occur even if the garbage collector is disabled.
Further, does the today variable automatically get replaced with each pass through the while loop? I know that the program works as intended with the del statement, but if it is unnecessary, then I would like to get rid of it if only for the sake of brevity! What are it's actual effects on memory?
A new datetime object is created on each iteration of the loop.
The today name in scope will be rebound to the newly created datetime instance. The old datetime instance will be deleted because no reference exists on it (since the only existing reference is lost once you rebound the name today to a different object). Once again, I stress that this is just ref-counting and has nothing to do with gc.
On an unrelated note, your program will busy-loop and consume an entire CPU with this while loop. You should consider adding a call to time.sleep into the loop so the process will remain mostly idle. Or, better yet, schedule the task to run periodically using cron.

Segmentation fault when initializing array

I am getting a segmentation fault when initializing an array.
I have a callback function from when an RFID tag gets read
IDS = []
def readTag(e):
epc = str(e.epc, 'utf-8')
if not epc in IDS:
now = datetime.datetime.now().strftime('%m/%d/%Y %H:%M:%S')
IDS.append([epc, now, "name.instrument"])
and a main function from which it's called
def main():
for x in vals:
IDS.append([vals[0], vals[1], vals[2]])
for x in IDS:
print(x[0])
r = mercury.Reader("tmr:///dev/ttyUSB0", baudrate=9600)
r.set_region("NA")
r.start_reading(readTag, on_time=1500)
input("press any key to stop reading: ")
r.stop_reading()
The error occurs because of the line IDS.append([epc, now, "name.instrument"]). I know because when I replace it with a print call instead the program will run just fine. I've tried using different types for the array objects (integers), creating an array of the same objects outside of the append function, etc. For some reason just creating an array inside the "readTag" function causes the segmentation fault like row = [1,2,3]
Does anyone know what causes this error and how I can fix it? Also just to be a little more specific, the readTag function will work fine for the first two (only ever two) calls, but then it crashes and the Reader object that has the start_reading() function is from the mercury-api
This looks like a scoping issue to me; the mercury library doesn't have permission to access your list's memory address, so when it invokes your callback function readTag(e) a segfault occurs. I don't think that the behavior that you want is supported by that library
To extend Michael's answer, this appears to be an issue with scoping and the API you're using. In general pure-Python doesn't seg-fault. Or at least, it shouldn't seg-fault unless there's a bug in the interpreter, or some extension that you're using. That's not to say pure-Python won't break, it's just that a genuine seg-fault indicates the problem is probably the result of something messy outside of your code.
I'm assuming you're using this Python API.
In that case, the README.md mentions that the Reader.start_reader() method you're using is "asynchronous". Meaning it invokes a new thread or process and returns immediately and then the background thread continues to call your callback each time something is scanned.
I don't really know enough about the nitty gritty of CPython to say exactly what going on, but you've declared IDS = [] as a global variable and it seems like the background thread is running the callback with a different context to the main program. So when it attempts to access IDS it's reading memory it doesn't own, hence the seg-fault.
Because of how restrictive the callback is and the apparent lack of a buffer, this might be an oversight on the behalf of the developer. If you really need asynchronous reads it's worth sending them an issue report.
Otherwise, considering you're just waiting for input you probably don't need the asynchronous reads, and you could use the synchronous Reader.read() method inside your own busy loop instead with something like:
try:
while True:
readTags(r.read(timeout=10))
except KeyboardInterrupt: ## break loop on SIGINT (Ctrl-C)
pass
Note that r.read() returns a list of tags rather than just one, so you'd need to modify your callback slightly, and if you're writing more than just a quick script you probably want to use threads to interrupt the loop properly as SIGINT is pretty hacky.

Python Memory Leak - Why is it happening?

For some background on my problem, I'm importing a module, data_read_module.pyd, written by someone else, and I cannot see the contents of that module.
I have one file, let's called it myfunctions. Ignore the ### for now, I'll comment on the commented portions later.
import data_read_module
def processData(fname):
data = data_read_module.read_data(fname)
''' process data here '''
return t, x
### return 1
I call this within the framework of a larger program, a TKinter GUI specifically. For purposes of this post, I've pared down to the bare essentials. Within the GUI code, I call the above as follows:
import myfunctions
class MyApplication:
def __init__(self,parent):
self.t = []
self.x = []
def openFileAndProcessData(self):
# self.t = None
# self.x = None
self.t,self.x = myfunctions.processData(fname)
## myfunctions.processData(fname)
I noticed what every time I run openFileAndProcessData, Windows Task Manager reports that my memory usage increases, so I thought that I had a memory leak somewhere in my GUI application. So the first thing I tried is the
# self.t = None
# self.x = None
that you see commented above. Next, I tried calling myfunctions.processData without assigning the output to any variables as follows:
## myfunctions.processData(fname)
This also had no effect. As a last ditch effort, I changed the processData function so it simply returns 1 without even processing any of the data that comes from the module, data_read_module.pyd. Unfortunately, even this results in more memory being taken up with each successive call to processData, which narrows the problem down to data_read_module.read_data. I thought that within the Python framework, this is the exact type of thing that is automatically taken care of. Referring to this website, it seems that memory taken up by a function will be released when the function terminates. In my case, I would expect the memory used in processData to be released after a call [with the exception of the output that I am keeping track of with self.t and self.x]. I understand I won't get a fix to this kind of issue without access to data_read_module.pyd, but I'd like to understand how this can happen to begin with.
A .pyd file is basically a DLL. You're calling code written in C, C++, or another such compiled language. If that code allocates memory and doesn't release it properly, you will get a memory leak. The fact that the code is being called from Python won't magically fix it.

Python Mock Process for Unit Testing

Background:
I am currently writing a process monitoring tool (Windows and Linux) in Python and implementing unit test coverage. The process monitor hooks into the Windows API function EnumProcesses on Windows and monitors the /proc directory on Linux to find current processes. The process names and process IDs are then written to a log which is accessible to the unit tests.
Question:
When I unit test the monitoring behavior I need a process to start and terminate. I would love if there would be a (cross-platform?) way to start and terminate a fake system process that I could uniquely name (and track its creation in a unit test).
Initial ideas:
I could use subprocess.Popen() to open any system process but this runs into some issues. The unit tests could falsely pass if the process I'm using to test is run by the system as well. Also, the unit tests are run from the command line and any Linux process I can think of suspends the terminal (nano, etc.).
I could start a process and track it by its process ID but I'm not exactly sure how to do this without suspending the terminal.
These are just thoughts and observations from initial testing and I would love it if someone could prove me wrong on either of these points.
I am using Python 2.6.6.
Edit:
Get all Linux process IDs:
try:
processDirectories = os.listdir(self.PROCESS_DIRECTORY)
except IOError:
return []
return [pid for pid in processDirectories if pid.isdigit()]
Get all Windows process IDs:
import ctypes, ctypes.wintypes
Psapi = ctypes.WinDLL('Psapi.dll')
EnumProcesses = self.Psapi.EnumProcesses
EnumProcesses.restype = ctypes.wintypes.BOOL
count = 50
while True:
# Build arguments to EnumProcesses
processIds = (ctypes.wintypes.DWORD*count)()
size = ctypes.sizeof(processIds)
bytes_returned = ctypes.wintypes.DWORD()
# Call enum processes to find all processes
if self.EnumProcesses(ctypes.byref(processIds), size, ctypes.byref(bytes_returned)):
if bytes_returned.value &lt size:
return processIds
else:
# We weren't able to get all the processes so double our size and try again
count *= 2
else:
print "EnumProcesses failed"
sys.exit()
Windows code is from here
edit: this answer is getting long :), but some of my original answer still applies, so I leave it in :)
Your code is not so different from my original answer. Some of my ideas still apply.
When you are writing Unit Test, you want to only test your logic. When you use code that interacts with the operating system, you usually want to mock that part out. The reason being that you don't have much control over the output of those libraries, as you found out. So it's easier to mock those calls.
In this case, there are two libraries that are interacting with the sytem: os.listdir and EnumProcesses. Since you didn't write them, we can easily fake them to return what we need. Which in this case is a list.
But wait, in your comment you mentioned:
"The issue I'm having with it however is that it really doesn't test
that my code is seeing new processes on the system but rather that the
code is correctly monitoring new items in a list."
The thing is, we don't need to test the code that actually monitors the processes on the system, because it's a third party code. What we need to test is that your code logic handles the returned processes. Because that's the code you wrote. The reason why we are testing over a list, is because that's what your logic is doing. os.listir and EniumProcesses return a list of pids (numeric strings and integers, respectively) and your code acts on that list.
I'm assuming your code is inside a Class (you are using self in your code). I'm also assuming that they are isolated inside their own methods (you are using return). So this will be sort of what I suggested originally, except with actual code :) Idk if they are in the same class or different classes, but it doesn't really matter.
Linux method
Now, testing your Linux process function is not that difficult. You can patch os.listdir to return a list of pids.
def getLinuxProcess(self):
try:
processDirectories = os.listdir(self.PROCESS_DIRECTORY)
except IOError:
return []
return [pid for pid in processDirectories if pid.isdigit()]
Now for the test.
import unittest
from fudge import patched_context
import os
import LinuxProcessClass # class that contains getLinuxProcess method
def test_LinuxProcess(self):
"""Test the logic of our getLinuxProcess.
We patch os.listdir and return our own list, because os.listdir
returns a list. We do this so that we can control the output
(we test *our* logic, not a built-in library's functionality).
"""
# Test we can parse our pdis
fakeProcessIds = ['1', '2', '3']
with patched_context(os, 'listdir', lamba x: fakeProcessIds):
myClass = LinuxProcessClass()
....
result = myClass.getLinuxProcess()
expected = [1, 2, 3]
self.assertEqual(result, expected)
# Test we can handle IOERROR
with patched_context(os, 'listdir', lamba x: raise IOError):
myClass = LinuxProcessClass()
....
result = myClass.getLinuxProcess()
expected = []
self.assertEqual(result, expected)
# Test we only get pids
fakeProcessIds = ['1', '2', '3', 'do', 'not', 'parse']
.....
Windows method
Testing your Window's method is a little trickier. What I would do is the following:
def prepareWindowsObjects(self):
"""Create and set up objects needed to get the windows process"
...
Psapi = ctypes.WinDLL('Psapi.dll')
EnumProcesses = self.Psapi.EnumProcesses
EnumProcesses.restype = ctypes.wintypes.BOOL
self.EnumProcessses = EnumProcess
...
def getWindowsProcess(self):
count = 50
while True:
.... # Build arguments to EnumProcesses and call enun process
if self.EnumProcesses(ctypes.byref(processIds),...
..
else:
return []
I separated the code into two methods to make it easier to read (I believe you are already doing this). Here is the tricky part, EnumProcesses is using pointers and they are not easy to play with. Another thing is, that I don't know how to work with pointers in Python, so I couldn't tell you of an easy way to mock that out =P
What I can tell you is to simply not test it. Your logic there is very minimal. Besides increasing the size of count, everything else in that function is creating the space EnumProcesses pointers will use. Maybe you can add a limit to the count size but other than that, this method is short and sweet. It returns the windows processes and nothing more. Just what I was asking for in my original comment :)
So leave that method alone. Don't test it. Make sure though, that anything that uses getWindowsProcess and getLinuxProcess get's mocked out as per my original suggestion.
Hopefully this makes more sense :) If it doesn't let me know and maybe we can have a chat session or do a video call or something.
original answer
I'm not exactly sure how to do what you are asking, but whenever I need to test code that depends on some outside force (external libraries, popen or in this case processes) I mock out those parts.
Now, I don't know how your code is structured, but maybe you can do something like this:
def getWindowsProcesses(self, ...):
'''Call Windows API function EnumProcesses and
return the list of processes
'''
# ... call EnumProcesses ...
return listOfProcesses
def getLinuxProcesses(self, ...):
'''Look in /proc dir and return list of processes'''
# ... look in /proc ...
return listOfProcessses
These two methods only do one thing, get the list of processes. For Windows, it might just be a call to that API and for Linux just reading the /proc dir. That's all, nothing more. The logic for handling the processes will go somewhere else. This makes these methods extremely easy to mock out since their implementations are just API calls that return a list.
Your code can then easy call them:
def getProcesses(...):
'''Get the processes running.'''
isLinux = # ... logic for determining OS ...
if isLinux:
processes = getLinuxProcesses(...)
else:
processes = getWindowsProcesses(...)
# ... do something with processes, write to log file, etc ...
In your test, you can then use a mocking library such as Fudge. You mock out these two methods to return what you expect them to return.
This way you'll be testing your logic since you can control what the result will be.
from fudge import patched_context
...
def test_getProcesses(self, ...):
monitor = MonitorTool(..)
# Patch the method that gets the processes. Whenever it gets called, return
# our predetermined list.
originalProcesses = [....pids...]
with patched_context(monitor, "getLinuxProcesses", lamba x: originalProcesses):
monitor.getProcesses()
# ... assert logic is right ...
# Let's "add" some new processes and test that our logic realizes new
# processes were added.
newProcesses = [...]
updatedProcesses = originalProcessses + (newProcesses)
with patched_context(monitor, "getLinuxProcesses", lamba x: updatedProcesses):
monitor.getProcesses()
# ... assert logic caught new processes ...
# Let's "kill" our new processes and test that our logic can handle it
with patched_context(monitor, "getLinuxProcesses", lamba x: originalProcesses):
monitor.getProcesses()
# ... assert logic caught processes were 'killed' ...
Keep in mind that if you test your code this way, you won't get 100% code coverage (since your mocked methods won't be run), but this is fine. You're testing your code and not third party's, which is what matters.
Hopefully this might be able to help you. I know it doesn't answer your question, but maybe you can use this to figure out the best way to test your code.
Your original idea of using subprocess is a good one. Just create your own executable and name it something that identifies it as a testing thing. Maybe make it do something like sleep for a while.
Alternately, you could actually use the multiprocessing module. I've not used python in windows much, but you should be able to get process identifying data out of the Process object you create:
p = multiprocessing.Process(target=time.sleep, args=(30,))
p.start()
pid = p.getpid()

Save memory in Python. How to iterate over the lines and save them efficiently with a 2million line file?

I have a tab-separated data file with a little over 2 million lines and 19 columns.
You can find it, in US.zip: http://download.geonames.org/export/dump/.
I started to run the following but with for l in f.readlines(). I understand that just iterating over the file is supposed to be more efficient so I'm posting that below. Still, with this small optimization, I'm using 30% of my memory on the process and have only done about 6.5% of the records. It looks like, at this pace, it will run out of memory like it did before. Also, the function I have is very slow. Is there anything obvious I can do to speed it up? Would it help to del the objects with each pass of the for loop?
def run():
from geonames.models import POI
f = file('data/US.txt')
for l in f:
li = l.split('\t')
try:
p = POI()
p.geonameid = li[0]
p.name = li[1]
p.asciiname = li[2]
p.alternatenames = li[3]
p.point = "POINT(%s %s)" % (li[5], li[4])
p.feature_class = li[6]
p.feature_code = li[7]
p.country_code = li[8]
p.ccs2 = li[9]
p.admin1_code = li[10]
p.admin2_code = li[11]
p.admin3_code = li[12]
p.admin4_code = li[13]
p.population = li[14]
p.elevation = li[15]
p.gtopo30 = li[16]
p.timezone = li[17]
p.modification_date = li[18]
p.save()
except IndexError:
pass
if __name__ == "__main__":
run()
EDIT, More details (the apparently important ones):
The memory consumption is going up as the script runs and saves more lines.
The method, .save() is an adulterated django model method with unique_slug snippet that is writing to a postgreSQL/postgis db.
SOLVED: DEBUG database logging in Django eats memory.
Make sure that Django's DEBUG setting is set to False
This looks perfectly fine to me. Iterating over the file like that or using xreadlines() will read each line as needed (with sane buffering behind the scenes). Memory usage should not grow as you read in more and more data.
As for performance, you should profile your app. Most likely the bottleneck is somewhere in a deeper function, like POI.save().
There's no reason to worry in the data you've given us: is memory consumption going UP as you read more and more lines? Now that would be cause for worry -- but there's no indication that this would happen in the code you've shown, assuming that p.save() saves the object to some database or file and not in memory, of course. There's nothing real to be gained by adding del statements, as the memory is getting recycled at each leg of the loop anyway.
This could be sped up if there's a faster way to populate a POI instance than binding its attributes one by one -- e.g., passing those attributes (maybe as keyword arguments? positional would be faster...) to the POI constructor. But whether that's the case depends on that geonames.models module, of which I know nothing, so I can only offer very generic advice -- e.g., if the module lets you save a bunch of POIs in a single gulp, then making them (say) 100 at a time and saving them in bunches should yield a speedup (at the cost of slightly higher memory consumption).

Categories