I have been playing with memory_profiler for some time and got this interesting but confusing results from the small program below:
import pandas as pd
import numpy as np
#profile
def f(p):
tmp = []
for _, frame in p.iteritems():
tmp.append([list(record) for record in frame.to_records(index=False)])
# initialize a list of pandas panels
lp = []
for j in xrange(50):
d = {}
for i in xrange(50):
df = pd.DataFrame(np.random.randn(200, 50))
d[i] = df
lp.append(pd.Panel(d))
# execution (iteration)
for panel in lp:
f(panel)
Then if I use memory_profiler's mprof to analyze the memory usage during runtime, mprof run test.py without any other parameters, I get this:
.
There seems to be memory unreleased after each function call f().
tmp is just a local list and should be reassigned and memory reallocated each time f() is called. Obviously there is some discrepancy here in the graph attached. I know that python has its own memory management blocks and also has free list for int and other types, and gc.collect() should do the magic. It turns out that explicit gc.collect() doesn't work. (Maybe because we are working with pandas objects, panels and frames? I don't know.)
The most confusing part is, I don't change or modify any variable in f(). All it does is just put some list representation copies in a local list. Therefore python doesn't need to make a copy of anything. Then why and how does this happen?
=================
Some other observations:
1) If I call f() with f(panel.copy()) (last line of code), passing the copy instead of the original object reference, I have a totally different memory usage result: . Is python that smart to tell that this value passed is a copy so that it could do some internal tricks to release the memory after each function call?
2) I think it might be because of df.to_records(). Well if I change it to frame.values, I would get similar flat memory curve, just like memory_profiling_results_2.png shown above, during the iteration (Although I do need to_records() because it maintains the column dtype, while .values messes the dtypes up). But I looked into frame.py's implementation on to_records(). I don't see why it would hold on the memory out there, while .values would work just fine.
I am running the program on Windows, with python 2.7.8, memory_profiler 0.43 and psutil 5.0.1.
This is not a memory leak. What you are seeing is a side effect of pandas.core.NDFrame caching some results. This allows it to return the same information the second time you ask for it without running the calculations again. Change the end of your sample code to look like the following code and run it. You should find that the second time through the memory increase will not happen, and the execution time will be less.
import time
# execution (iteration)
start_time = time.time()
for panel in lp:
f(panel)
print(time.time() - start_time)
print('-------------------------------------')
start_time = time.time()
for panel in lp:
f(panel)
print(time.time() - start_time)
Related
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
import numpy as np
import time
#creating iterable
testDict = {}
for i in range(1000):
testDict[i] = np.random.randint(1,10)
#default method
stime = time.time()
newdict = []
for k, v in testDict.items():
for i in range(1000):
v = np.tanh(v)
newdict.append(v)
etime = time.time()
print(etime - stime)
#output: 1.1139910221099854
#multi processing
stime = time.time()
testresult = []
def f(item):
x = item[1]
for i in range(1000):
x = np.tanh(x)
return x
def main(testDict):
with ProcessPoolExecutor(max_workers = 8) as executor:
futures = [executor.submit(f, item) for item in testDict.items()]
for future in as_completed(futures):
testresult.append(future.result())
if __name__ == '__main__':
main(testDict)
etime = time.time()
print(etime - stime)
#output: 3.4509658813476562
Learning multiprocessing and testing stuff. Ran a test to check if I have implemented this correctly. Looking at the output time taken, concurrent method is 3 times slower. So what's wrong?
My objective is to parallelize a script which mostly operates on a dictionary of around 500 items. Each loop, values of those 500 items are processed and updated. This loops for let's say 5000 generations. None of the k,v pairs interact with other k,v pairs. [Its a genetic algorithm].
I am also looking at guidance on how to parallelize the above described objective. If I use the correct concurrent futures method on each of my function in my genetic algorithm code, where each function takes an input of a dictionary and outputs a new dictionary, will it be useful? Any guides/resources/help is appreciated.
Edit: If I run this example: https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor-example, it takes 3 times more to solve than a default for loop check.
There are a couple basic problems here, you're using numpy but you're not vectorizing your calculations. You'll not benefit from numpy's speed benefit with the way you write your code here, and might as well just use the standard library math module, which is faster than numpy for this style of code:
# 0.089sec
import math
for k, v in testDict.items():
for i in range(1000):
v = math.tanh(v)
newdict.append(v)
Once you vectorise the operation, only then you see the benefit of numpy:
# 0.016sec
for k, v in testDict.items():
arr = no.full(1000, v)
arr2 = np.tanh(arr)
newdict.append(arr2[-1])
For comparison, your original single threaded code runs in 1.171sec on my test machine. As you can see here, when it's not used properly, NumPy can be a couple orders of magnitude slower than even pure Python.
Now on to why you're seeing what you're seeing.
To be honest, I can't replicate your timing results. Your original multiprocessing code runs in 0.299sec for me macOS on Python 3.6), which is faster than the single process code. But if I have to take a guess, you're probably using Windows? In some platforms like Windows, creating a child process and setting up an environment to run multiprocessing task is very expensive, so using multiprocessing for a task that lasts less than a few seconds is of dubious benefit. If your are interested in why, read here.
Also, in platforms that lacks a usable fork() like MacOS after Python 3.8 or Windows, when you use multiprocessing, the child process has to reimport the module, so if you put both code in the same file, it has to run your single threaded code in the child processes before it can run the multiprocessing code. You'll likely want to put your test code in a function and protect the top level code with if __name__ == "__main__" block. On Mac with Python 3.8 or higher, you can also revert to using fork method by calling multiprocessing.set_start_method("fork") if you're not calling into Mac's non-fork-safe framework libraries.
With that out of the way, on to your title question.
When you use multiprocessing, you need to copy data to the child process and back to the main process to retrieve the result and there's a cost to spawning child processes. To benefit from multiprocessing, you need to design your workload so that this part of the cost is negligible.
If your data comes from external source, try loading the data in the child processes, rather than having the main process load the data then transfer it to the child process, have the main process tell the child how to fetch its slice of data. Here you're generating the testDict in the main process, so if you can, parallelize that and move them to the children instead.
Also, since you're using numpy, if you vectorise your operations properly, numpy will release the GIL while doing vectorised operations, so you may be able to just use multithreading instead. Since numpy doesn't hold GIL during vector operation, you can take advantage of multiple threads in a single Python process, and you don't need to fork or copy data over to child processes, as threads share memory.
I have been working on a python script using dask to speed up the processing time. At a high level, the script calls a dask delayed function a number of times to perform new computations. Each time the dask delayed function is called, it has no relation to the previous call. A simplified example of what I have is shown below.
def func1(list1):
for subList in list1:
t2 = dask.delayed(func2)(subList)
output.append(t2)
output = dask.compute(*output)
return output
def func2(subList):
#Some operations
return tuple2 # Large Tuple with combination of lists and numpy arrays
if __name__ == '__main__':
largeList = [..]
for list1 in largeList:
output = func1(list1)
print(output)
I noticed that as this program was being executed, each time func1 was called, the time to completion gradually got longer. At first I believed this was a memory issue because the variable, output, is typically a large tuple with many arrays and lists. However, looking at the Dask dashboard while the program was running, the 'Bytes Stored' plot didn't seem to max out. I know this is a very vague example, but does anybody have any ideas about why the func1 slows down the more times it is called? My hunch is still that the large tuple output has something to do with the issue. If so, how can I fix this problem? Any feedback on this would be greatly appreciated.
I was reading up on Python Memory Management and would like to reduce the memory footprint of my application. It was suggested that subprocesses would go a long way in mitigating the problem; but i'm having trouble conceptualizing what needs to be done. Could some one please provide a simple example of how to turn this...
def my_function():
x = range(1000000)
y = copy.deepcopy(x)
del x
return y
#subprocess_witchcraft
def my_function_dispatcher(*args):
return my_function()
...into a real subprocessed function that doesn't store an extra "free-list"?
Bonus Question:
Does this "free-list" concept apply to python c-extensions as well?
The important thing about the optimization suggestion is to make sure that my_function() is only invoked in a subprocess. The deepcopy and del are irrelevant — once you create five million distinct integers in a process, holding onto all of them at the same time, it's game over. Even if you stop referring to those objects, Python will free them by keeping references to five million empty integer-object-sized fields in a limbo where they await reuse for the next function that wants to create five million integers. This is the free list mentioned in the other answer, and it buys blindingly fast allocation and deallocation of ints and floats. It is only fair to Python to note that this is not a memory leak since the memory is definitely made available for further allocations. However, that memory will not get returned to the system until the process ends, nor will it be reused for anything other than allocating numbers of the same type.
Most programs don't have this problem because most programs do not create pathologically huge lists of numbers, free them, and then expect to reuse that memory for other objects. Programs using numpy are also safe because numpy stores numeric data of its arrays in tightly packed native format. For programs that do follow this usage pattern, the way to mitigate the problem is by not creating a large number of the integers at the same time in the first place, at least not in the process which needs to return memory to the system. It is unclear what exact use case you have, but a real-world solution will likely require more than a "magic decorator".
This is where subprocess come in: if the list of numbers is created in another process, then all the memory associated with the list, including but not limited to storage of ints, is both freed and returned to the system by the mere act of terminating the subprocess. Of course, you must design your program so that the list can be both created and processed in the subsystem, without requiring the transfer of all these numbers. The subprocess can receive information needed to create the data set, and can send back the information obtained from processing the list.
To illustrate the principle, let's upgrade your example so that the whole list actually needs to exist - say we're benchmarking sorting algorithms. We want to create a huge list of integers, sort it, and reliably free the memory associated with the list, so that the next benchmark can allocate memory for its own needs without worrying of running out of RAM. To spawn the subprocess and communicate, this uses the multiprocessing module:
# To run this, save it to a file that looks like a valid Python module, e.g.
# "foo.py" - multiprocessing requires being able to import the main module.
# Then run it with "python foo.py".
import multiprocessing, random, sys, os, time
def create_list(size):
# utility function for clarity - runs in subprocess
maxint = sys.maxint
randrange = random.randrange
return [randrange(maxint) for i in xrange(size)]
def run_test(state):
# this function is run in a separate process
size = state['list_size']
print 'creating a list with %d random elements - this can take a while... ' % size,
sys.stdout.flush()
lst = create_list(size)
print 'done'
t0 = time.time()
lst.sort()
t1 = time.time()
state['time'] = t1 - t0
if __name__ == '__main__':
manager = multiprocessing.Manager()
state = manager.dict(list_size=5*1000*1000) # shared state
p = multiprocessing.Process(target=run_test, args=(state,))
p.start()
p.join()
print 'time to sort: %.3f' % state['time']
print 'my PID is %d, sleeping for a minute...' % os.getpid()
time.sleep(60)
# at this point you can inspect the running process to see that it
# does not consume excess memory
Bonus Answer
It is hard to provide an answer to the bonus question, since the question is unclear. The "free list concept" is exactly that, a concept, an implementation strategy that needs to be explicitly coded on top of the regular Python allocator. Most Python types do not use that allocation strategy, for example it is not used for instances of classes created with the class statement. Implementing a free list is not hard, but it is fairly advanced and rarely undertaken without good reason. If some extension author has chosen to use a free list for one of its types, it can be expected that they are aware of the tradeoff a free list offers — gaining extra-fast allocation/deallocation at the cost of some additional space (for the objects on the free list and the free list itself) and inability to reuse the memory for something else.
I've hit the common problem of getting a pickle error when using the multiprocessing module.
My exact problem is that I need to give the function I'm calling some state before I call it in the pool.map function, but in doing so, I cause the attribute lookup __builtin__.function failed error found here.
Based on the linked SO answer, it looks like the only way to use a function in pool.map is to call the defined function itself so that it is looked up outside the scope of the current function.
I feel like I explained the above poorly, so here is the issue in code. :)
Testing without pool
# Function to be called by the multiprocessing pool
def my_func(x):
massive_list, medium_list, index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = map(my_func, to_crunch)
This works A-OK and just as expected. The only thing "wrong" with it is that it's slow.
Pool Attempt 1
# (Note: my_func() remains the same)
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(my_func, to_crunch)
This technically works, but it is a stunning 18x slower! The slow down must be coming from not only copying the two massive data structures on each call, but also pickling/unpickling them as they get passed around. The non-pool version benefits from only having to pass the reference to the massive list around, rather than the actual list.
So, having tracked down the bottleneck, I try to store the two massive lists as state inside of my_func. That way, if I understand correctly, it will only need to be copied once for each worker (in my case, 4).
Pool Attempt 2:
I wrap up my_func in a closure passing in the two lists as stored state.
def build_myfunc(m,s):
def my_func(x):
massive_list = m # close the state in there
small_list = s
index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
return my_func
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
modified_func = build_myfunc(data, source)
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(modified_func, to_crunch)
However, this returns the pickle error as (based on the above linked SO question) you cannot call a function with multiprocessing from inside of the same scope.
Error:
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
So, is there a way around this problem?
Map is a way to distribute workload. If you store the data in the func i think you vanish the initial purpose.
Let's try to find why it is slower. It's not normal and there must be something else.
First, the number of processes must be suitable for the machine running them. In your example you're using a pool of 2 processes so a total of 3 processes is involved. How many cores are on the system you're using? What else is running? What's the system load while crunching data?
What does the function do with the data? Does it access disk? Or maybe it uses DB which means there is probably another process accessing disk and cores.
What about memory? Is it sufficient for storing the initial lists?
The right implementation is your Attempt 1.
Try to profile the execution using iostat for example. This way you can spot the bottlenecks.
If it stalls on the cpu then you can try some tweaks to the code.
From another answer on Stackoverflow (by me so no problem copy and pasting it here :P ):
You're using .map() which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.
You can try using .imap() which is the iterator version on .map() or even the .imap_unordered() if the order of results is not important (as it seems from your example).
Here's the relevant documentation. Worth noting the line:
For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
I have three large lists. First contains bitarrays (module bitarray 0.8.0) and the other two contain arrays of integers.
l1=[bitarray 1, bitarray 2, ... ,bitarray n]
l2=[array 1, array 2, ... , array n]
l3=[array 1, array 2, ... , array n]
These data structures take quite a bit of RAM (~16GB total).
If i start 12 sub-processes using:
multiprocessing.Process(target=someFunction, args=(l1,l2,l3))
Does this mean that l1, l2 and l3 will be copied for each sub-process or will the sub-processes share these lists? Or to be more direct, will I use 16GB or 192GB of RAM?
someFunction will read some values from these lists and then performs some calculations based on the values read. The results will be returned to the parent-process. The lists l1, l2 and l3 will not be modified by someFunction.
Therefore i would assume that the sub-processes do not need and would not copy these huge lists but would instead just share them with the parent. Meaning that the program would take 16GB of RAM (regardless of how many sub-processes i start) due to the copy-on-write approach under linux?
Am i correct or am i missing something that would cause the lists to be copied?
EDIT:
I am still confused, after reading a bit more on the subject. On the one hand Linux uses copy-on-write, which should mean that no data is copied. On the other hand, accessing the object will change its ref-count (i am still unsure why and what does that mean). Even so, will the entire object be copied?
For example if i define someFunction as follows:
def someFunction(list1, list2, list3):
i=random.randint(0,99999)
print list1[i], list2[i], list3[i]
Would using this function mean that l1, l2 and l3 will be copied entirely for each sub-process?
Is there a way to check for this?
EDIT2 After reading a bit more and monitoring total memory usage of the system while sub-processes are running, it seems that entire objects are indeed copied for each sub-process. And it seems to be because reference counting.
The reference counting for l1, l2 and l3 is actually unneeded in my program. This is because l1, l2 and l3 will be kept in memory (unchanged) until the parent-process exits. There is no need to free the memory used by these lists until then. In fact i know for sure that the reference count will remain above 0 (for these lists and every object in these lists) until the program exits.
So now the question becomes, how can i make sure that the objects will not be copied to each sub-process? Can i perhaps disable reference counting for these lists and each object in these lists?
EDIT3 Just an additional note. Sub-processes do not need to modify l1, l2 and l3 or any objects in these lists. The sub-processes only need to be able to reference some of these objects without causing the memory to be copied for each sub-process.
Because this is still a very high result on google and no one else has mentioned it yet, I thought I would mention the new possibility of 'true' shared memory which was introduced in python version 3.8.0: https://docs.python.org/3/library/multiprocessing.shared_memory.html
I have here included a small contrived example (tested on linux) where numpy arrays are used, which is likely a very common use case:
# one dimension of the 2d array which is shared
dim = 5000
import numpy as np
from multiprocessing import shared_memory, Process, Lock
from multiprocessing import cpu_count, current_process
import time
lock = Lock()
def add_one(shr_name):
existing_shm = shared_memory.SharedMemory(name=shr_name)
np_array = np.ndarray((dim, dim,), dtype=np.int64, buffer=existing_shm.buf)
lock.acquire()
np_array[:] = np_array[0] + 1
lock.release()
time.sleep(10) # pause, to see the memory usage in top
print('added one')
existing_shm.close()
def create_shared_block():
a = np.ones(shape=(dim, dim), dtype=np.int64) # Start with an existing NumPy array
shm = shared_memory.SharedMemory(create=True, size=a.nbytes)
# # Now create a NumPy array backed by shared memory
np_array = np.ndarray(a.shape, dtype=np.int64, buffer=shm.buf)
np_array[:] = a[:] # Copy the original data into shared memory
return shm, np_array
if current_process().name == "MainProcess":
print("creating shared block")
shr, np_array = create_shared_block()
processes = []
for i in range(cpu_count()):
_process = Process(target=add_one, args=(shr.name,))
processes.append(_process)
_process.start()
for _process in processes:
_process.join()
print("Final array")
print(np_array[:10])
print(np_array[10:])
shr.close()
shr.unlink()
Note that because of the 64 bit ints this code can take about 1gb of ram to run, so make sure that you won't freeze your system using it. ^_^
Generally speaking, there are two ways to share the same data:
Multithreading
Shared memory
Python's multithreading is not suitable for CPU-bound tasks (because of the GIL), so the usual solution in that case is to go on multiprocessing. However, with this solution you need to explicitly share the data, using multiprocessing.Value and multiprocessing.Array.
Note that usually sharing data between processes may not be the best choice, because of all the synchronization issues; an approach involving actors exchanging messages is usually seen as a better choice. See also Python documentation:
As mentioned above, when doing concurrent programming it is usually
best to avoid using shared state as far as possible. This is
particularly true when using multiple processes.
However, if you really do need to use some shared data then
multiprocessing provides a couple of ways of doing so.
In your case, you need to wrap l1, l2 and l3 in some way understandable by multiprocessing (e.g. by using a multiprocessing.Array), and then pass them as parameters.
Note also that, as you said you do not need write access, then you should pass lock=False while creating the objects, or all access will be still serialized.
For those interested in using Python3.8 's shared_memory module, it still has a bug (github issue link here) which hasn't been fixed and is affecting Python3.8/3.9/3.10 by now (2021-01-15). The bug affects posix systems and is about resource tracker destroys shared memory segments when other processes should still have valid access. So take care if you use it in your code.
If you want to make use of copy-on-write feature and your data is static(unchanged in child processes) - you should make python don't mess with memory blocks where your data lies. You can easily do this by using C or C++ structures (stl for instance) as containers and provide your own python wrappers that will use pointers to data memory (or possibly copy data mem) when python-level object will be created if any at all.
All this can be done very easy with almost python simplicity and syntax with cython.
# pseudo cython
cdef class FooContainer:
cdef char * data
def __cinit__(self, char * foo_value):
self.data = malloc(1024, sizeof(char))
memcpy(self.data, foo_value, min(1024, len(foo_value)))
def get(self):
return self.data
# python part
from foo import FooContainer
f = FooContainer("hello world")
pid = fork()
if not pid:
f.get() # this call will read same memory page to where
# parent process wrote 1024 chars of self.data
# and cython will automatically create a new python string
# object from it and return to caller
The above pseudo-code is badly written. Dont use it. In place of self.data should be C or C++ container in your case.
You can use memcached or redis and set each as a key value pair
{'l1'...