clean up re.match objects - python

This loop is used in barcode scanning software. It may run as many times as a barcode is scanned, which is hundreds of times in an hour.
# locpats is a list of regular expression patterns of possible depot locations
for pat in locpats:
q = re.match(pat, scannedcode)
if q:
print(q)
return True
q is a Match object. The print(q) tells me that every match object gets its own little piece of memory. They'll add up. I have no idea to what amount in total.
I don't need the Match object anymore once inside the if. Should I wipe it, like so?
q = re.match(pat, scannedcode)
if q:
q = None
return True
Or is there a cleaner way? Should I bother at all?
If I understand right (from this), garbage collection with gc.collect() won't happen until a process terminates, which in my case is at the end of the day when the user is done scanning. Until that time, these objects won't be regarded as garbage, even.

cPython uses reference counting (plus some cyclical reference detection, not applicable here) to handle gc of objects. Once an object reaches 0 extant references, it will be immediately gc'd.
In the case of your loop:
for pat in locpats:
q = re.match(pat, scannedcode)
Each successive pat in locpats binds a new re.match object to q. This implies that the old re.match object has 0 remaining references, and will be immediately garbage collected. A similar situation applies when you return from your function.
This is all an implementation detail of cPython; other flavors of python will handle gc differently. In all cases, don't prematurely optimize. Unless you can pinpoint a specific reason to do so, leaving the gc alone is likely to be the most performant solution.

This is not a problem, since q is local, and therefore won't persist after you return.
If you want to make yourself feel better, you can try
if re.match(pat, scannedcode):
return True
which will do what you're doing now without ever naming the match - but it won't change your memory footprint.
(I'm assuming that you don't care about the printed value at all, it's just diagnostic)

If your print statement is showing that each match is getting its own piece of memory then it looks like one of two things is happening:
1) As others have mentioned you are not using CPython as your interpreter and the interpreter you have chosen is doing something strange with garbage collection
2) There is code you haven't shown us here which is keeping a reference to the match object so that the GC code never frees it as the reference count to the match object never reaches zero
Is either of these the case?

Related

Is there a way to check if a generator is in use anywhere globally? Such that an active generator will bail if no one is using it

Is there a way to check if a generator is in use anywhere globally? Such that an active generator will bail no one is using it.
This is mostly academic but I can think of numerous situations where it would be good to detect this. So you understand, here is an example:
def accord():
_accord = None
_inuse = lambda: someutilmodule.scopes_using(_accord) > 1
def gen():
uid = 0
while _inuse():
uid += 1
yield uid
else:
print("I'm done, although you obviously forgot about me.")
_accord = gen()
return _accord
a = accord()
a.__next__()
a.__next__()
a.__next__()
a = None
"""
<<< 1
<<< 2
<<< 3
<<< I'm done, although you obviously forgot about me.
"""
The triple quote is the text I would expect to see if someutilmodule.scopes_using reported the number of uses of the variable. By uses I mean how many copies or references exist.
Note the that the generator has an infinite loop which is generally bad practice but in cases like a unique id generator and other not widely or complexly used, it is often useful and won't create huge overhead. Obviously another way would simply be to expose a function or method that would see the flag where that the loop was using as it's condition. But again it's good to know ways to do various ways to do things.
In this case, when you do
a = accord()
A reference counter behind the scenes keeps track of the fact that a variable is referencing that generator object. This keeps it in memory because there's a chance it may be needed in the future.
Once you do this however:
a = None
The reference to the generator is lost, and the reference counter associated with it is decremented. Once it reaches 0 (which it would, because you only had one reference to it), the system knows that nothing can ever refer to that object again, which frees the data associated with that object up for garbage collection.
This is all handled behind the scenes. There's no need for you to intervene.
The best way to see what's going on, for better or worse, is to examine the relevant source code for CPython. Ultimately, _Py_DECREF is called when references are lost. You can see a little further down, after interpreting some convoluted logic, that once the reference is 0, _Py_Dealloc(op); is called on PyObject *op. I can't for the life of me find the actual call to free though that I'm sure ultimately results from _Py_Dealloc. It seems to be somewhere in the Py_TRASHCAN_END macro, but good lord. That's one of the longest rabbit holes I've ever gone down where I have nothing to show for it.

Creating a dynamic array name in python

I've been running multiple threads (by "symbol" below) but have encountered a weird issue where there appears to be a potential memory leak depending on which gets processed first. I believe the issue is due to me using the same field name / array name in each thread.
Below is an example of the code I am running to assign values to an array:
for i in range(level+1):
accounting_price[i] = yahoo_prices[j]['accg'][i][0]
It works fine but when I query multiple "symbols" and run a thread for each symbol, I am sometimes getting symbol A's "accounting_price[i]" being returned in Symbol C's and vice versa. Not sure if this could be a memory leak from one thread to the other, but the only quick solution I have is to make the "account_price[i]" unique to each symbol. Would it be correct if I implement the below?
symbol = "AAPL"
d = {}
for i in range(level+1):
d['accounting_price_{}'.format(symbol)][i] = yahoo_prices[j]['accg'][i][0]
When I run it, I get an error thrown up.
I would be extremely grateful for a solution on how to dynamically create unique arrays to each thread. Alternatively, a solution to the "memory leak".
Thanks!
If you think there’s a race here causing conflicting writes to the dict, using a lock is both the best way to rule that out, and probably the best solution if you’re right.
I should point out that in general, thanks to the global interpreter lock, simple assignments to dict and list members are already thread-safe. But it’s not always easy to prove that your case is one of the “in general” ones.
Anyway, if you have a mutable global object that’s being shared, you need to have a global lock that’s shared along with it, and you need to acquire that lock around every access (read and write) to the object.
If at all possible, you should do this using a with statement to ensure that it’s impossible to abandon the lock (which could cause other threads to block forever waiting for the same lock).
It’s also important to make sure you don’t do any expensive work, like downloading and parsing a web page, with the lock acquired (which could cause all of your threads to end up serialized instead of usefully running in parallel).
So, at the global level, where you create accounting_info, create a corresponding lock:
accounting_info = […etc.…]
accounting_info_lock = threading.Lock()
Then, inside the thread, wherever you use it:
with accounting_price_lock:
setup_info = accounting_price[...]
yahoo_prices = do_expensive_stuff(setup_info)
with accounting_price_lock:
for i in range(level+1):
accounting_price[i] = yahoo_prices[j]['accg'][i][0]
If you end up often having lots of reads and few writes, this can cause excessive and unnecessary contention, but you can fix that by just replacing the generic lock with a read-write lock. They’re a bit slower in general, but a lot faster if a bunch of threads want to read in parallel.
The error is presumably a KeyError, right? It's because you're indexing two levels into your dictionary when only one exists. Try this:
symbol = "AAPL"
d = {}
for i in range(level+1):
name = 'accounting_price_{}'.format(symbol)
d[name] = {}
d[name][i] = yahoo_prices[j]['accg'][i][0]

Managing Memory with Python Reading Objects of Varying Sizes from OODB's

I'm reading in a collection of objects (tables like sqlite3 tables or dataframes) from an Object Oriented DataBase, most of which are small enough that the Python garbage collector can handle without incident. However, when they get larger in size (less than 10 MB's) the GC doesn't seem to be able to keep up.
psuedocode looks like this:
walk = walkgenerator('/path')
objs = objgenerator(walk)
with db.transaction(bundle=True, maxSize=10000, maxParts=10):
oldobj = None
oldtable = None
for obj in objs:
currenttable = obj.table
if oldtable and oldtable in currenttable:
db.delete(oldobj.path)
del oldtable
oldtable = currenttable
del oldobj
oldobj = obj
if not count % 100:
gc.collect()
I'm looking for an elegant way to manage memory while allowing Python to handle it when possible.
Perhaps embarrassingly, I've tried using del to help clean up reference counts.
I've tried gc.collect() at varying modulo counts in my for loops:
100 (no difference),
1 (slows loop quite a lot, and I will still get a memory error of some type),
3 (loop is still slow but memory still blows up eventually)
Suggestions are appreciated!!!
Particularly, if you can give me tools to assist with introspection. I've used Windows Task Manager here, and it seems to more or less randomly spring a memory leak. I've limited the transaction size as much as I feel comfortable, and that seems to help a little bit.
There's not enough info here to say much, but what I do have to say wouldn't fit in a comment so I'll post it here ;-)
First, and most importantly, in CPython garbage collection is mostly based on reference counting. gc.collect() won't do anything for you (except burn time) unless trash objects are involved in reference cycles (an object A can be reached from itself by following a chain of pointers transitively reachable from A). You create no reference cycles in the code you showed, but perhaps the database layer does.
So, after you run gc.collect(), does memory use go down at all? If not, running it is pointless.
I expect it's most likely that the database layer is holding references to objects longer than necessary, but digging into that requires digging into exact details of how the database layer is implemented.
One way to get clues is to print the result of sys.getrefcount() applied to various large objects:
>>> import sys
>>> bigobj = [1] * 1000000
>>> sys.getrefcount(bigobj)
2
As the docs say, the result is generally 1 larger than you might hope, because the refcount of getrefcount()'s argument is temporarily incremented by 1 simply because it is being used (temporarily) as an argument.
So if you see a refcount greater than 2, del won't free the object.
Another way to get clues is to pass the object to gc.get_referrers(). That returns a list of objects that directly refer to the argument (provided that a referrer participates in Python's cyclic gc).
BTW, you need to be clearer about what you mean by "doesn't seem to work" and "blows up eventually". Can't guess. What exactly goes wrong? For example, is MemoryError raised? Something else? Traebacks often yield a world of useful clues.

what happens to a dict with no pointer

I have a ponter to a dict() and I keep update it, like that:
def creating_new_dict():
MyDict = dict()
#... rest of code - get data to MyDict
return MyDict
def main():
while(1):
MyDict = creating_new_dict()
time.sleep(20)
I know that MyDict is a pointer to the dict so My question is what happens to the old data in MyDict?
Is it deleted or should I make sure it's gone?
Coming back around to answer your question since nobody else did:
I know that MyDict is a pointer to the dict
Ehhh. MyDict is a reference to your dict object, not a pointer. Slight digression follows:
You can't mutate immutable types within a function, because you're handling references, not pointers. See:
def adder(x):
x = x + 1
a = 0
adder(a)
a
Out[4]: 0
I won't go too far down the rabbit hole, but just this: there is nothing I could have put into the body of adder that would have caused a to change. Because variables are references, not pointers.
Okay, we're back.
so My question is what happens to the old data in MyDict?
Well, it's still there, in some object somewhere, but you blew up the reference to it. The reference is no more, it has ceased to be. In fact, there are exactly zero extant references to the old dict, which will be significant in the next paragraph!
Is it deleted
It's not deleted per se, but you could basically think of it as deleted. It has no remaining references, which means it is eligible for garbage collection. Python's gc, when triggered, will come along and get rid of all the objects that have no remaining references, which includes all your orphaned dicts.
But like I said in my comment: gc will not be triggered -now-, but some time in the (possibly very near) future. How soon that happens depends on.. stuff, including what flavor of python you're running. cPython (the one you're probably running) will usually gc the object immediately, but this is not guaranteed and should not be relied upon - it is left up to the implementation. What you can count on - the gc will come along and clean up eventually. (well, as long as you haven't disabled it)
or should I make sure it's gone?
Nooooo. Nope. Don't do the garbage collector's job. Only when you have a very specific performance reason should you do so.
I still don't believe you, show me an example.
Er, you said that, right? Okay:
class DeleteMe:
def __del__(self): #called when the object is destroyed, i.e. gc'd
print(':(')
d = DeleteMe()
d = DeleteMe() # first one will be garbage collected, leading to sad face
:(
And, like I said before, you shouldn't count on :( happening immediately, just that gc will happen eventually.

Deletion of a list in python with and without ':' operator

I've been working with python for quite a bit of time and I'm confused regarding few issues in the areas of Garbage Collection, memory management as well as the real deal with the deletion of the variables and freeing memory.
>>> pop = range(1000)
>>> p = pop[100:700]
>>> del pop[:]
>>> pop
[]
>>> p
[100.. ,200.. 300...699]
In the above piece of code, this happens. But,
>>> pop = range(1000)
>>> k = pop
>>> del pop[:]
>>> pop
[]
>>> k
[]
Here in the 2nd case, it implies that the k is just pointing the list 'pop'.
First Part of the question :
But, what's happening in the 1st code block? Is the memory containing [100:700] elements not getting deleted or is it duplicated when list 'p' is created?
Second Part of the question :
Also, I've tried including gc.enable and gc.collect statements in between wherever possible but there's no change in the memory utilization in both the codes. This is kind of puzzling. Isn't this bad that python is not returning free memory back to OS? Correct me if I'm wrong in the little research I've did. Thanks in advance.
Slicing a sequence results in a new sequence, with a shallow copy of the appropriate elements.
Returning the memory to the OS might be bad, since the script may turn around and create new objects, at which point Python would have to request the memory from the OS again.
1st part:
In the 1st code block, you create a new object where the elements of the old one are copied before deleting that one.
In the 2nd code block, however, you just assign a reference to the same object to another variable. Then you empty the list, which, of course, is visible via both references.
2nd part: Memory is returned when appropriate, but not always. Under the hood of Python, there is a memory allocator which has control over where the memory comes from. There are 2 ways: via the brk()/sbrk() mechanism (for smaller memory blocks) and via mmap() (larger blocks).
Here we have rather smaller blocks which get allocated directly at the end of the data segment:
datadatadata object1object1 object2object2
If we only free object1, we have a memory gap which can be reused for the next object, but cannot easily freed and returned to the OS.
If we free both objects, memory could be returned. But there probably is a threshold for keeping memory back for a while, because returning everything immediately is not the very best thing.

Categories