I've been running multiple threads (by "symbol" below) but have encountered a weird issue where there appears to be a potential memory leak depending on which gets processed first. I believe the issue is due to me using the same field name / array name in each thread.
Below is an example of the code I am running to assign values to an array:
for i in range(level+1):
accounting_price[i] = yahoo_prices[j]['accg'][i][0]
It works fine but when I query multiple "symbols" and run a thread for each symbol, I am sometimes getting symbol A's "accounting_price[i]" being returned in Symbol C's and vice versa. Not sure if this could be a memory leak from one thread to the other, but the only quick solution I have is to make the "account_price[i]" unique to each symbol. Would it be correct if I implement the below?
symbol = "AAPL"
d = {}
for i in range(level+1):
d['accounting_price_{}'.format(symbol)][i] = yahoo_prices[j]['accg'][i][0]
When I run it, I get an error thrown up.
I would be extremely grateful for a solution on how to dynamically create unique arrays to each thread. Alternatively, a solution to the "memory leak".
Thanks!
If you think there’s a race here causing conflicting writes to the dict, using a lock is both the best way to rule that out, and probably the best solution if you’re right.
I should point out that in general, thanks to the global interpreter lock, simple assignments to dict and list members are already thread-safe. But it’s not always easy to prove that your case is one of the “in general” ones.
Anyway, if you have a mutable global object that’s being shared, you need to have a global lock that’s shared along with it, and you need to acquire that lock around every access (read and write) to the object.
If at all possible, you should do this using a with statement to ensure that it’s impossible to abandon the lock (which could cause other threads to block forever waiting for the same lock).
It’s also important to make sure you don’t do any expensive work, like downloading and parsing a web page, with the lock acquired (which could cause all of your threads to end up serialized instead of usefully running in parallel).
So, at the global level, where you create accounting_info, create a corresponding lock:
accounting_info = […etc.…]
accounting_info_lock = threading.Lock()
Then, inside the thread, wherever you use it:
with accounting_price_lock:
setup_info = accounting_price[...]
yahoo_prices = do_expensive_stuff(setup_info)
with accounting_price_lock:
for i in range(level+1):
accounting_price[i] = yahoo_prices[j]['accg'][i][0]
If you end up often having lots of reads and few writes, this can cause excessive and unnecessary contention, but you can fix that by just replacing the generic lock with a read-write lock. They’re a bit slower in general, but a lot faster if a bunch of threads want to read in parallel.
The error is presumably a KeyError, right? It's because you're indexing two levels into your dictionary when only one exists. Try this:
symbol = "AAPL"
d = {}
for i in range(level+1):
name = 'accounting_price_{}'.format(symbol)
d[name] = {}
d[name][i] = yahoo_prices[j]['accg'][i][0]
Related
I am using LLDB Python scripting support to add custom Variable Formatting for a complex C++ class type in XCode.
This is working well for simple situations, but I have hit a wall when I need to call a method which uses a pass-by-reference parameter, which it populates with results. This would require me to create a variable to pass here, but I can't find a way to do this?
I have tried using the target's CreateValueFromData method, as below, but this doesn't seem to work.
import lldb
def MyClass(valobj, internal_dict):
class2_type = valobj.target.FindFirstType('class2')
process = valobj.process
class2Data = [0]
data = lldb.SBData.CreateDataFromUInt32Array(process.GetByteOrder(), process.GetAddressByteSize(), class2Data)
valobj.target.CreateValueFromData("testClass2", data, class2_type)
valobj.EvaluateExpression("getType(testClass2)")
class2Val = valobj.frame.FindVariable("testClass2")
if not class2Val.error.success:
return class2Val.error.description
return class2Val.GetValueAsUnsigned()
Is there some way to be able to achieve what I'm trying to do?
SBValue names are just labels for the SBValue, they aren't guaranteed to exist as symbols in the target. For instance if the value you are formatting is an ivar of some other object, it's name will be the ivar name... And lldb does not inject new SBValue's names into the symbol table - that would end up causing lots of name collisions. So they don't exist in the namespace the expression evaluator queries when looking up names.
If the variable you are formatting is a pointer, you can get the pointer value and cons up an expression that casts the pointer value to the appropriate type for your getType function, and pass that to your function. If the value is not a pointer, you can still use SBValue.AddressOf to get the memory location of the value. If the value exists only in lldb (AddressOf will return an invalid address) then you would have to push it to the target with SBProcess.AllocateMemory/WriteMemory, but that should only happen if you have another data formatter that makes these objects out of whole cloth for its own purposes.
It's better not to call functions in formatters if you can help it. But if you really must call a function in your data formatter, you should to do that judiciously.
They can cause performance problems (if you have an array of 100 elements of this type, your formatter will require 100 function calls in the target to render the array... That's 200 context switches between your process and the debugger, plus a bunch of memory reads and writes) for every step operation.
Also, since you can't ensure that the data in your value is correct (it might represent a variable that has not been initialized yet, or already deallocated) you either need to have your function handle bad data, or at least be prepared for the expression to crash. lldb can clean up the stack and suppress the exception from crashes, but it can't undo any side-effects the expression might have had before crashing.
For instance, if the function you called took some lock before crashing that it was expecting to release on the way out, your formatter will damage the state of the program. So you have to be careful what you call...
And by default, EvaluateExpression will allow all threads to run so that expressions don't deadlock against a lock held by another thread. You probably don't want that to happen, since that means looking at the locals of one thread will "change" the state of another thread. So you really should only call functions you are sure don't take locks. And use the version of EvaluateExpression that takes an SBExpressionOption, in which you set the SBExpressionOptions.StopOthers to True, and SetTryAllThreads to False.
Is there a way to check if a generator is in use anywhere globally? Such that an active generator will bail no one is using it.
This is mostly academic but I can think of numerous situations where it would be good to detect this. So you understand, here is an example:
def accord():
_accord = None
_inuse = lambda: someutilmodule.scopes_using(_accord) > 1
def gen():
uid = 0
while _inuse():
uid += 1
yield uid
else:
print("I'm done, although you obviously forgot about me.")
_accord = gen()
return _accord
a = accord()
a.__next__()
a.__next__()
a.__next__()
a = None
"""
<<< 1
<<< 2
<<< 3
<<< I'm done, although you obviously forgot about me.
"""
The triple quote is the text I would expect to see if someutilmodule.scopes_using reported the number of uses of the variable. By uses I mean how many copies or references exist.
Note the that the generator has an infinite loop which is generally bad practice but in cases like a unique id generator and other not widely or complexly used, it is often useful and won't create huge overhead. Obviously another way would simply be to expose a function or method that would see the flag where that the loop was using as it's condition. But again it's good to know ways to do various ways to do things.
In this case, when you do
a = accord()
A reference counter behind the scenes keeps track of the fact that a variable is referencing that generator object. This keeps it in memory because there's a chance it may be needed in the future.
Once you do this however:
a = None
The reference to the generator is lost, and the reference counter associated with it is decremented. Once it reaches 0 (which it would, because you only had one reference to it), the system knows that nothing can ever refer to that object again, which frees the data associated with that object up for garbage collection.
This is all handled behind the scenes. There's no need for you to intervene.
The best way to see what's going on, for better or worse, is to examine the relevant source code for CPython. Ultimately, _Py_DECREF is called when references are lost. You can see a little further down, after interpreting some convoluted logic, that once the reference is 0, _Py_Dealloc(op); is called on PyObject *op. I can't for the life of me find the actual call to free though that I'm sure ultimately results from _Py_Dealloc. It seems to be somewhere in the Py_TRASHCAN_END macro, but good lord. That's one of the longest rabbit holes I've ever gone down where I have nothing to show for it.
I have a quite complex python (2.7 on ubuntu) code which is leaking memory unexpectedly. To break it down, it is a method which is repeatedly called (and itself calls different methods) and returns a very small object. After finishing the method the used memory is not released. As far as I know it is not unusual to reserve some memory for later usages, but if I use big enough input my machine eventually consumes all memory and freezes. This is not the case if I use a subprocess with concurrent.futures ProcessPoolExecutor, thus I need to assume it is not my code but some underlying problems?!
Is this a known issue? Might it be a problem in 3rd party libraries I am using (e.g. PyQgis)? Where should I start to search for the problem?
Some more Background to eliminate silly reasons (because I am still somewhat of a beginner):
The method uses some global variables but in my understanding these should only be active in the file where they are declared and anyways should be overwritten in the next call of the method?!
To clarify in pseudocode:
def main():
load input from file
for x in input:
result = extra_file.initialization(x)
#here is the point where memory should get released in my opinion
#extra file
def initialization(x):
global input
input = x
result_container = []
while not result do:
part_of_result = method1()
result_container.append(part_of_result)
if result_container fulfills condition to be the final result:
result = result_container
del input
return result
def method1():
#do stuff
method2()
#do stuff
return part_of_result
def method2():
#do stuff with input not altering it
Numerous different methods and global variables are involved and the global declaration is used to not pass like 5 different input variables through multiple methods which don't even use them.
Should I try using garbage collection? All references after finishing the method should be deleted and python itself should take care of it?
Definitely try using garbage collection. I don't believe it's a known problem.
I am trying to write a python module which checks consistency of the mac addresses stored in the HW memory. The scale could go upto 80K mac addresses. But when I make multiple calls to get a list of mac addresses through a python method, the memory does not get freed up and eventually I am running out of memory.
An example of what I am doing is:
import resource
import copy
def get_list():
list1 = None
list1 = []
for j in range(1,10):
for i in range(0,1000000):
list1.append('abcdefg')
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000)
return list1
for i in range(0,5):
x=get_list()
On executing the script, I get:
45805
53805
61804
69804
77803
85803
93802
101801
109805
118075
126074
134074
142073
150073
158072
166072
174071
182075
190361
198361
206360
214360
222359
230359
238358
246358
254361
262365
270364
278364
286363
294363
302362
310362
318361
326365
334368
342368
350367
358367
366366
374366
382365
390365
398368
i.e. the memory usage reported keeps going up.
Is it that I am looking at the memory usage in a wrong way?
And if not, is there a way to not have the memory usage go up between function calls in a loop. (In my case with mac addresses, I do not call the same list of mac addresses again. I get the list from a different section of the HW memory. i.e. all the calls to get mac addresses are valid, but after each call the data obtained is useless and can be discarded.
Python is a managed language. Memory is, generally speaking, the concern of the implementation rather than the average developer. The system is designed to reclaim memory that you are no longer using automatically.
If you are using CPython, an object will be destroyed when its reference count reaches zero, or when the cyclic garbage collector finds and collects it. If you want to reclaim the memory belonging to an object, you need to ensure that no references to it remain, or at least that it is not reachable from any stack frame's variables. That is to say, it should not be possible to refer to the data you want reclaimed, either directly or through some expression such as foo.bar[42], from any currently executing function.
If you are using another implementation, such as PyPy, the rules may vary. In particular, reference counting is not required by the Python language standard, so objects may not go away until the next garbage collection run (and then you may have to wait for the right generation to be collected).
For older versions of Python (prior to Python 3.4), you also need to worry about reference cycles which involve finalizers (__del__() methods). The old garbage collector cannot collect such cycles, so they will (basically) get leaked. Most built-in types do not have finalizers, are not capable of participating in reference cycles, or both, but this is a legitimate concern if you are creating your own classes.
For your use case, you should empty or replace the list when you no longer need its contents (with e.g. list1 = [] or del list1[:]), or return from the function which created it (assuming it's a local variable, rather than a global variable or some other such thing). If you find that you are still running out of memory after that, you should either switch to a lower-overhead language like C or invest in more memory. For more complicated cases, you can use the gc module to test and evaluate how the garbage collector is interacting with your program.
Try this : it might not Lways free the memory as it may still be in use.
See if it works
gc.collect()
I have a dictionary being updated by one thread, and in another thread I'd like to iterate over its values. Normally I'd use a lock, but this code is very performance-critical, and I want to avoid that if at all possible.
A special feature of my case is that I don't care about absolute correctness of the iterator; if it misses entries that were removed after iteration started, or picks up ones added afterwards, that's fine. I only require that it doesn't raise any sort of 'dictionary size changed during iteration' exception.
Given this relaxed constraint on correctness, is there an efficient way to iterate the dictionary without using a lock?
Note: I'm aware that keys() is threadsafe in Python 2.x, but since that behavior has changed in 3.x, I want to avoid it.
No personal experience with this, but I read this some time ago: http://www.python.org/dev/peps/pep-3106/
These operations are thread-safe only to the extent that using them in a thread-unsafe way may cause an exception but will not cause corruption of the internal representation.
As in Python 2.x, mutating a dict while iterating over it using an iterator has an undefined effect and will in most cases raise a RuntimeError exception. (This is similar to the guarantees made by the Java Collections Framework.)
I would consider using a lock long enough to retrieve the values that you want to iterate over:
with lock:
values = the_dict.values() # python 2
# values = list(the_dict.values()) # python 3
for value in values:
# do stuff
Or, you could try it without a lock and catch RuntimeError, and if you get one, try to retrieve the values again.
[edit] Below slightly rephrased per J.F. Sebastian's suggesion:
while True:
try:
values = list(the_dict.values())
break
except RuntimeError:
pass
I personally would go with the lock.
Two things:
Dump the keys into a queue and read that safely.
Performance-critical code should probably not be using Python threads.
Sometimes an example is better than words.
Array iteration is NOT thread-safe, see live example for Python 3.6