Why python does not release the memory? [duplicate] - python

I have a program written in python 2.6 that creates a large number of short lived instances (it is a classic producer-consumer problem). I noticed that the memory usage as reported by top and pmap seems to increase when these instances are created and never goes back down. I was concerned that some python module I was using might be leaking memory so I carefully isolated the problem in my code. I then proceeded to reproduce it in as short as example as possible. I came up with this:
class LeaksMemory(list):
timesDelCalled = 0
def __del__(self):
LeaksMemory.timesDelCalled +=1
def leakSomeMemory():
l = []
for i in range(0,500000):
ml = LeaksMemory()
ml.append(float(i))
ml.append(float(i*2))
ml.append(float(i*3))
l.append(ml)
import gc
import os
leakSomeMemory()
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(gc.collect()) +" objects collected")
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(os.getpid()) + " : check memory usage with pmap or top")
If you run this with something like 'python2.6 -i memoryleak.py' it will halt and you can use pmap -x PID to check the memory usage. I added the del method so I could verify that GC was occuring. It is not there in my actual program and does not appear to make any functional difference. Each call to leakSomeMemory() increases the amount of memory consumed by this program. I fear I am making some simple error and that references are getting kept by accident, but cannot identify it.

Python will release the objects, but it will not release the memory back to the operating system immediately. Instead, it will re-use the same segments for future allocations within the same interpreter.
Here's a blog post about the issue: http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
UPDATE: I tested this myself with Python 2.6.4 and didn't notice persistent increases in memory usage. Some invocations of leakSomeMemory() caused the memory footprint of the Python process to increase, and some made it decrease again. So it all depends on how the allocator is re-using the memory.

According to Alex Martelli:
"The only really reliable way to
ensure that a large but temporary use
of memory DOES return all resources to
the system when it's done, is to have
that use happen in a subprocess, which
does the memory-hungry work then
terminates."
So, in your situation it sounds like it would make sense to use the multiprocessing module to run the short-lived functions in separate processes to ensure the return of resources when the process finishes.
import multiprocessing as mp
def NOT_leakSomeMemory():
# do stuff
return result
if __name__=='__main__':
pool = mp.Pool()
results=pool.map(NOT_leakSomeMemory, range(500000))
For more ideas on how to set things up using multiprocessing, see Doug Hellman's tutorial:

Related

Python multithreading - memory not released when ran using While statement

I built a scraper (worker) launched XX times through multithreading (via Jupyter Notebook, python 2.7, anaconda).
Script is of the following format, as described on python.org:
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
for item in source():
q.put(item)
q.join() # block until all tasks are done
When I run the script as is, there are no issues. Memory is released after script finishes.
However, I want to run the said script 20 times (batching of sort),
so I turn the script mentioned into a function, and run the function using code below:
def multithreaded_script():
my script #code from above
x = 0
while x<20:
x +=1
multithredaded_script()
memory builds up with each iteration, and eventually the system start writing it to disk.
Is there a way to clear out the memory after each run?
I tried:
setting all the variables to None
setting sleep(30) at end of each iteration (in case it takes time for ram to release)
and nothing seems to help.
Any ideas on what else I can try to get the memory to clear out after each run within the While statement?
If not, is there a better way to execute my script XX times, that would not eat up the ram?
Thank you in advance.
TL;DR Solution: Make sure to end each function with return to ensure all local variables are destroyed from ram**
Per Pavel's suggestion, I used memory tracker (unfortunately suggested mem tracker did't work for me, so i used Pympler.)
Implementation was fairly simple:
from pympler.tracker import SummaryTracker
tracker = SummaryTracker()
~~~~~~~~~YOUR CODE
tracker.print_diff()
The tracker gave a nice output, which made it obvious that local variables generated by functions were not being destroyed.
Adding "return" at the end of every function fixed the issue.
Takeaway:
If you are writing a function that processes info/generates local variables, but doesn't pass local variables to anything else -> make sure to end the function with return anyways. This will prevent any issues that you may run into with memory leaks.
Additional notes on memory usage & BeautifulSoup:
If you are using BeautifulSoup / BS4 with multithreading and multiple workers, and have limited amount of free ram, you can also use soup.decompose() to destroy soup variable right after you are done with it, instead of waiting for the function to return/code to stop running.

Python multiprocessing unexpected high memory usage

In our system we have started to experience problems with a task that was working ok, but now it seems to be hanging and uses a high amount of memory (so other tasks fail and raise MemoryError).
Context: example code below had no problem, but for a new database huge_dataframe became a lot bigger. The question is that if I run both parts separately it works. But running all together in process_data_task will lead to the problem. Running under Python 2.7 on Linux.
I suspect that is something with fork(), but how could each sub-process take so much memory? huge_dataframe is deleted before starting multiprocessing. Also is strange that do_recalculations hangs only when is called inside process_data_task (child not joining?), but it doesn't throw any exception. Any explanation or idea for troubleshooting?
def process_data_task():
# part 1, high memory usage
huge_dataframe = retrieve_data()
process_table(huge_dataframe)
# added this trying to fix the problem but it didn't help
del huge_dataframe
gc.collect()
# part 2, heavy CPU usage, multiprocessing
do_recalculations() # recalculate items in parallel
# Multiprocessing done here
def do_recalculations():
processes = cpu_count()
items_to_update = [...] # query database
work_total_list = chunkify(items_to_update, processes)
p = Pool(processes)
result = p.map(sub_process_func, work_total_list)
p.close()
p.join()

I/O slowdown with multithreading in python

I have a python script, which works on the following scheme: read a large file (e.g., movie) - compose selected information from it into a number of small temporary files - spawn in subprocesses a C++ application to perform the files processing/calculations (separately for each file) - read the application output. To speed up the script I used multiprocessing. However, it has major drawback: each process has to maintain in RAM the whole copy of the large input file, and therefore I can run only few processes, as I run out of memory. Thus I decided to try multithreading instead (or some combination of multiprocessing and multithreading) due to the fact that threads share the address space. As the python part most of the time works with file I/O or waits for the C++ application to complete, I thought that GIL must not be an issue here. Nevertheless, instead of some gain in performance I observe drastic slowdown, mainly owing to the I/O part.
I illustrate the problem with the following code (saved as test.py):
import sys, threading, tempfile, time
nthreads = int(sys.argv[1])
class IOThread (threading.Thread):
def __init__(self, thread_id, obj):
threading.Thread.__init__(self)
self.thread_id = thread_id
self.obj = obj
def run(self):
run_io(self.thread_id, self.obj)
def gen_object(nlines):
obj = []
for i in range(nlines):
obj.append(str(i) + '\n')
return obj
def run_io(thread_id, obj):
ntasks = 100 // nthreads + (1 if thread_id < 100 % nthreads else 0)
for i in range(ntasks):
tmpfile = tempfile.NamedTemporaryFile('w+')
with open(tmpfile.name, 'w') as ofile:
for elem in obj:
ofile.write(elem)
with open(tmpfile.name, 'r') as ifile:
content = ifile.readlines()
tmpfile.close()
obj = gen_object(100000)
starttime = time.time()
threads = []
for thread_id in range(nthreads):
threads.append(IOThread(thread_id, obj))
threads[thread_id].start()
for thread in threads:
thread.join()
runtime = time.time() - starttime
print('Runtime: {:.2f} s'.format(runtime))
When I run it with different number of threads, I get this:
$ python3 test.py 1
Runtime: 2.84 s
$ python3 test.py 1
Runtime: 2.77 s
$ python3 test.py 1
Runtime: 3.34 s
$ python3 test.py 2
Runtime: 6.54 s
$ python3 test.py 2
Runtime: 6.76 s
$ python3 test.py 2
Runtime: 6.33 s
Can someone explain me the result, as well as give some advice, how to effectively parallelize I/O using multithreading?
EDIT:
The slowdown is not due to HDD performance, because:
1) the files are getting cached to RAM anyway
2) the same operations with multiprocessing (not multithreading) are indeed getting faster (almost by factor of CPUs number)
As I delved deeper into the problem, I made comparison benchmarks for 4 different parallelisation methods, 3 of which are using python and 1 is using java (the purpose of the test was not to compare I/O machinery between different languages but to see if multithreading can boost I/O operations). The test was performed on Ubuntu 14.04.3, all files were placed to a RAM disk.
Although the data are quite noisy, the clear trend is evident (see the chart; n=5 for each bar, error bars represent SD): python multithreading fails to boost the I/O performance. The most probable reason is GIL, and therefore there is no way around it.
I think your performance measures don't lie: you're asking your hard disk to do many things at the same time. Reads, writes, fsync when closing the files, ... and on several files at the same time. It triggers a lot of hardware physical operations. And the more files you write at the same time, the more contention you get.
So the CPU is waiting for the disk operation to finish...
Moreover, maybe you don't have a SSD hard disk, so the syncs actually mean some physical moves.
EDIT: it could be a GIL problem. When you iterate elem in obj in run_io, you execute python code between each write. The ofile.write probably release the GIL, so that the IO doesnt block the other threads, but the lock is released/acquired with each iteration. So maybe your writes don't really run "concurrently".
EDIT2: to test the hypothesis you can try to replace:
for elem in obj:
ofile.write(elem)
with:
ofile.write("".join(obj))
and see if perf gets better

Multiprocessing Pool in Python - Only single CPU is utilized

Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.
First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.
The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

How do i use subprocesses to force python to release memory?

I was reading up on Python Memory Management and would like to reduce the memory footprint of my application. It was suggested that subprocesses would go a long way in mitigating the problem; but i'm having trouble conceptualizing what needs to be done. Could some one please provide a simple example of how to turn this...
def my_function():
x = range(1000000)
y = copy.deepcopy(x)
del x
return y
#subprocess_witchcraft
def my_function_dispatcher(*args):
return my_function()
...into a real subprocessed function that doesn't store an extra "free-list"?
Bonus Question:
Does this "free-list" concept apply to python c-extensions as well?
The important thing about the optimization suggestion is to make sure that my_function() is only invoked in a subprocess. The deepcopy and del are irrelevant — once you create five million distinct integers in a process, holding onto all of them at the same time, it's game over. Even if you stop referring to those objects, Python will free them by keeping references to five million empty integer-object-sized fields in a limbo where they await reuse for the next function that wants to create five million integers. This is the free list mentioned in the other answer, and it buys blindingly fast allocation and deallocation of ints and floats. It is only fair to Python to note that this is not a memory leak since the memory is definitely made available for further allocations. However, that memory will not get returned to the system until the process ends, nor will it be reused for anything other than allocating numbers of the same type.
Most programs don't have this problem because most programs do not create pathologically huge lists of numbers, free them, and then expect to reuse that memory for other objects. Programs using numpy are also safe because numpy stores numeric data of its arrays in tightly packed native format. For programs that do follow this usage pattern, the way to mitigate the problem is by not creating a large number of the integers at the same time in the first place, at least not in the process which needs to return memory to the system. It is unclear what exact use case you have, but a real-world solution will likely require more than a "magic decorator".
This is where subprocess come in: if the list of numbers is created in another process, then all the memory associated with the list, including but not limited to storage of ints, is both freed and returned to the system by the mere act of terminating the subprocess. Of course, you must design your program so that the list can be both created and processed in the subsystem, without requiring the transfer of all these numbers. The subprocess can receive information needed to create the data set, and can send back the information obtained from processing the list.
To illustrate the principle, let's upgrade your example so that the whole list actually needs to exist - say we're benchmarking sorting algorithms. We want to create a huge list of integers, sort it, and reliably free the memory associated with the list, so that the next benchmark can allocate memory for its own needs without worrying of running out of RAM. To spawn the subprocess and communicate, this uses the multiprocessing module:
# To run this, save it to a file that looks like a valid Python module, e.g.
# "foo.py" - multiprocessing requires being able to import the main module.
# Then run it with "python foo.py".
import multiprocessing, random, sys, os, time
def create_list(size):
# utility function for clarity - runs in subprocess
maxint = sys.maxint
randrange = random.randrange
return [randrange(maxint) for i in xrange(size)]
def run_test(state):
# this function is run in a separate process
size = state['list_size']
print 'creating a list with %d random elements - this can take a while... ' % size,
sys.stdout.flush()
lst = create_list(size)
print 'done'
t0 = time.time()
lst.sort()
t1 = time.time()
state['time'] = t1 - t0
if __name__ == '__main__':
manager = multiprocessing.Manager()
state = manager.dict(list_size=5*1000*1000) # shared state
p = multiprocessing.Process(target=run_test, args=(state,))
p.start()
p.join()
print 'time to sort: %.3f' % state['time']
print 'my PID is %d, sleeping for a minute...' % os.getpid()
time.sleep(60)
# at this point you can inspect the running process to see that it
# does not consume excess memory
Bonus Answer
It is hard to provide an answer to the bonus question, since the question is unclear. The "free list concept" is exactly that, a concept, an implementation strategy that needs to be explicitly coded on top of the regular Python allocator. Most Python types do not use that allocation strategy, for example it is not used for instances of classes created with the class statement. Implementing a free list is not hard, but it is fairly advanced and rarely undertaken without good reason. If some extension author has chosen to use a free list for one of its types, it can be expected that they are aware of the tradeoff a free list offers — gaining extra-fast allocation/deallocation at the cost of some additional space (for the objects on the free list and the free list itself) and inability to reuse the memory for something else.

Categories