In my code, I create a list of objects inside a for-loop. Something like this: (The get_size functions was obtained from here: How do I determine the size of an object in Python?)
def mem_now():
mem = psutil.Process(os.getpid()).memory_info()
mem = mem[0]/(1024.**2)
return mem
class lista():
def __init__(self, val):
self.x = val
self.y = val**2
self.z = val - 1
intlst = []
objlst = []
for i in range(int(1e5)):
g = lista(i)
objlst.append(g)
intlst.append(g.z)
print 'int_list_size = {:.2f} Mb'.format(get_size(intlst)/1e6)
print 'obj_list_size = {:.2f} Mb'.format(get_size(objlst)/1e6)
print 'mem_now (no del) = {:.2f} Mb'.format(mem_now())
del intlst
print 'mem_now (del int) = {:.2f} Mb'.format(mem_now())
del objlst
print 'mem_now (del obj) = {:.2f} Mb'.format(mem_now())
and the output is as follows:
mem_now = 45.11 Mb
int_list_size = 3.22 Mb
obj_list_size = 43.22 Mb
mem_now (no del) = 103.35 Mb
mem_now (del int) = 103.35 Mb
mem_now (del obj) = 69.10 Mb
As you can see, after deleting the lists, I still haven't cleared all the memory. The more attributes and the more iterations I run, the more the final memory value is. I want to complete delete all the objects I created, how can I do that?
I want to complete delete all the objects I created, how can I do that?
You did. You don't even need to use del (all that del does is remove a reference; the underlying object is cleaned up when there are no more references to it).
Consider a simpler example:
def alloc():
global x
x = list(range(1000000))
for i in range(1000):
alloc()
We repeatedly create the list from scratch in a loop. Even though we did nothing to "delete" the list from previous iterations (you do not need to, and should not try), we do not run out of memory - because the cleanup is automatic. (If you actually did run out of memory, Python would raise MemoryError for a failed allocation.) x takes about 36 megabytes to represent on my system (8 for the list, and 28 for the integers); I do not have 36 gigabytes of RAM, but there is no problem. It just takes a bit of time (about 19 seconds on my system); memory usage does not creep upwards over time.
When I then del x, on my system the process memory usage - as reported by Task Manager - promptly drops. Here's the thing, though: it might not do so consistently, or at all, or immediately, or it may depend on what tool you use to check. It's likely to depend on your operating system and other platform details. In any event, it's not something you could dream of controlling completely from Python, and I suspect a lot of the time you couldn't even do it from assembly. Don't worry about it. The operating system knows better than you how to assign memory to processes. That's why your computer has one.
Related
I got the following code:
import sys
from sentence_transformers import InputExample
from lib import DataLoader as DL
def load_train_data():
train_sentences = DL.load_entire_corpus(data_corpus_path) # Loading from 9GB files
# 16GB of memory is now allocated by this process
train_data = []
for i in range(len(train_sentences)):
s = train_sentences.pop() # Use pop to release item for garbage collector
train_data.append(InputExample(texts=[s, s])) # Problem is around here I guess
return train_data
train_data = load_train_data()
The files loaded in DL.load_entire_corpus contain lists of sentences.
The code crashes because more than 32GB of RAM are being allocated during the process. Until the for-loop around 16GB is being allocated. During the for loop it raises until 32GB which leads to a crash or a hanging system.
print(sys.getsizeof(train_sentences) + sys.getsizeof(train_data)) within the for loop is never over 10GB. There is no other process that can allocate RAM.
What am I missing?
getsizeof() generally goes only one level deep. For a list, it returns a measure of the bytes consumed by the list structure itself, but not bytes consumed by the objects the list contains. For example,
>>> import sys
>>> x = list(range(2000))
>>> sys.getsizeof(x)
16056
>>> for i in range(len(x)):
... x[i] = None
>>> sys.getsizeof(x)
16056
See? The result doesn't change regardless of what the list contains. The only thing that matters to getsizeof() is len(the_list).
So the "missing" RAM is almost certainly being consumed by the InputExample(texts=[s, s]) objects you're appending to train_data.
I have a class that has a cache implemented as a dict for numpy arrays, which can occupy GBs of data.
class WorkOperations(object):
def __init__(self):
self.data_cache: Dict[str, Dict[str, Tuple[np.ndarray, np.ndarray]]] = {}
def get_data(key):
if key not in data_cache:
add_data(key)
return self.data_cache[key]
def add_data(key)
result = run_heavy_calculation(key)
self.data_cache[key] = result
I am testing the code with this function -
import gc
def perform_operations()
work_operations = WorkOperations()
# input_keys gives a list of keys to process
for keys in input_keys():
data = work_operations.get_data(key)
do_some_operation(data)
del work_operations
perform_operations()
gc.collect()
The result of run_heavy_calculation is heavy in memory and soon data_cache grows and occupies memory in GBs (which is expected).
But memory does not get released even after perform_operations() is done. I tried adding del work_operations and invoking gc.collect() but that did not help either. I checked memory of the process after several hours, but the memory was still not freed up.
If I don't use caching (data_cache) at all (at the cost of latency), memory never goes high.
I am wondering what is it that is taking memory. I tried running tracemalloc, but it just showed lines occupying memory in KBs. I also took a memory dump with gdb by looking at memory address from process pmap and /proc/<pid>/smaps, but that is really long and even with hexeditor I couldn't figure out much.
I am measuring memory used by the process using top command and looking at RES. I also tried outputting memory in the end from within python process as well with -
import psutils
import gc
import logging
GIGABYTE = 1024.0 * 1024.0 * 1024.0
perform_operations()
gc.collect()
memory_full_info = psutil.Process().memory_full_info()
logging.info(f"process memory after running the process {memory_full_info.uss / GIGABYTE}")
Could not reproduce on Ubuntu with this :
import itertools
import time
import os
from typing import Dict, Tuple
import numpy as np
import psutil # installed with pip
process = psutil.Process(os.getpid())
SIZE = 10**7
def run_heavy_calculation(key):
array1 = np.zeros(SIZE)
array2 = np.zeros(SIZE)
# Linux use "virtual memory", which means that the memory required for the arrays was allocated, but will not
# be actually deducted until we use them, so we write some 1 into them
# cf: https://stackoverflow.com/q/29516888/11384184
for i in range(0, SIZE, 1000):
array1[i] = 1
array2[i] = 1
return {key: (array1, array2)}
class WorkOperations(object):
def __init__(self):
self.data_cache: Dict[str, Dict[str, Tuple[np.ndarray, np.ndarray]]] = {}
def get_data(self, key):
if key not in self.data_cache:
self.add_data(key)
return self.data_cache[key]
def add_data(self, key):
result = run_heavy_calculation(key)
self.data_cache[key] = result
def perform_operations(input_keys):
work_operations = WorkOperations()
for key in input_keys():
data = work_operations.get_data(key)
time.sleep(0.2)
print(key, process.memory_info().rss / 10**9)
del work_operations
perform_operations(lambda: map(str, itertools.product("abcdefgh", "0123456789"))) # dummy keys
print("after operations", process.memory_info().rss / 10**9)
input("pause")
('a', '0') 0.113106944
('a', '1') 0.195014656
('a', '2') 0.276926464
...
('h', '7') 6.421118976
('h', '8') 6.503030784
('h', '9') 6.584942592
after operations 0.031363072
pause
It climbed up to using 6,5 Go of RAM, then returned from the function and all of it was released.
You can add a finalizer (__del__) to the class WorkOperations :
def __del__(self):
print("deleted")
I see it printed between the last operation's print and the one after.
Although it does not guarantee that this will always be the case (cf this question), it strongly indicates that everything is working as intended : event without the del, the functions returns (hence the scope is lost), so the reference count for work_operations gets to 0, and then gets GC'ed.
It can be checked with sys.getrefcount too :
print(sys.getrefcount(work_operations) - 1) # see https://stackoverflow.com/a/510417/11384184
del work_operations
which for me print 1.
Please provide a Minimal Reproducible Example and info on your system.
The following code fills all my memory:
from sys import getsizeof
import numpy
# from http://stackoverflow.com/a/2117379/272471
def getSize(array):
return getsizeof(array) + len(array) * getsizeof(array[0])
class test():
def __init__(self):
pass
def t(self):
temp = numpy.zeros([200,100,100])
A = numpy.zeros([200], dtype = numpy.float64)
for i in range(200):
A[i] = numpy.sum( temp[i].diagonal() )
return A
a = test()
memory_usage("before")
c = [a.t() for i in range(100)]
del a
memory_usage("After")
print("Size of c:", float(getSize(c))/1000.0)
The output is:
('>', 'before', 'memory:', 20588, 'KiB ')
('>', 'After', 'memory:', 1583456, 'KiB ')
('Size of c:', 8.92)
Why am I using ~1.5 GB of memory if c is ~ 9 KiB? Is this a memory leak? (Thanks)
The memory_usage function was posted on SO and is reported here for clarity:
def memory_usage(text = ''):
"""Memory usage of the current process in kilobytes."""
status = None
result = {'peak': 0, 'rss': 0}
try:
# This will only work on systems with a /proc file system
# (like Linux).
status = open('/proc/self/status')
for line in status:
parts = line.split()
key = parts[0][2:-1].lower()
if key in result:
result[key] = int(parts[1])
finally:
if status is not None:
status.close()
print('>', text, 'memory:', result['rss'], 'KiB ')
return result['rss']
The implementation of diagonal() failed to decrement a reference counter. This issue had been previously fixed, but the change didn't make it into 1.7.0.
Upgrading to 1.7.1 solves the problem! The release notes contain various useful identifiers, notably issue 2969.
The solution was provided by Sebastian Berg and Charles Harris on the NumPy mailing list.
Python allocs memory from the OS if it needs some.
If it doesn't need it any longer, it may or may not return it again.
But if it doesn't return it, the memory will be reused on subsequent allocations. You should check that; but supposedly the memory consumption won't increase even more.
About your estimations of memory consumption: As azorius already wrote, your temp array consumes 16 MB, while your A array consumes about 200 * 8 = 1600 bytes (+ 40 for internal reasons). If you take 100 of them, you are at 164000 bytes (plus some for the list).
Besides that, I have no explanation for the memory consumption you have.
I don't think sys.getsizeof returns what you expect
your numpy vector A is 64 bit (8 bytes) - so it takes up (at least)
8 * 200 * 100 * 100 * 100 / (2.0**30) = 1.5625 GB
so at minimum you should use 1.5 GB on the 100 arrays, the last few hundred mg are all the integers used for indexing the large numpy data and the 100 objects
It seems that sys.getsizeof always returns 80 no matter how large a numpy array is:
sys.getsizeof(np.zeros([200,1000,100])) # return 80
sys.getsizeof(np.zeros([20,100,10])) # return 80
In your code you delete a which is a tiny factory object who's t method return huge numpy arrays, you store these huge arrays in a list called c.
try to delete c, then you should regain most of your RAM
I found one answer of resize an array in python ctypes
from ctypes import *
list = (c_int*1)()
def customresize(array, new_size):
resize(array, sizeof(array._type_)*new_size)
return (array._type_*new_size).from_address(addressof(array))
list[0] = 123
list = customresize(list, 5)
>>> list[0]
123
>>> list[4]
0
if i call it again:
list = customresize(list, 40)
it gives error:
ValueError: Memory cannot be resized because this object doesn't own it
why it works only for the first time you call customresize()?
I also saw some one post another answer:
def customresize(array, new_size):
return (array._type_*new_size).from_address(addressof(array))
here customresize() works no matter how many times you call it.
but it raises another question, i found my python.exe does not use more memory when you resize list to a larger size, which means the memory is not allocated for the resized list. Is that very dangerous to give the accessibility to memory without allocation? why ctypes.resize is designed this way? Really get confused....
from_address is dangerous because it does not ensure that you have allocated the memory you are accessing, so it can lead to your application crashing or worse. In addition it does not own the memory it points to so if the original owner is deleted the memory could be reused for something else.
One option here would be to keep a reference to the original array:
def customresize(array, new_size):
base = getattr(array, 'base', array)
resize(base, sizeof(array._type_)*new_size)
new_array = (array._type_*new_size).from_address(addressof(base))
new_array.base = base
Alternatively and much safer, you can just create a new array from the old one:
list = (c_int * 5)(*list)
Let me summarize this issue myself. but please credit to #ecatmur and others
The resize() function can be used to resize the memory buffer of an existing ctypes object. The function takes the object as first argument, and the requested size in bytes as the second argument. However, the resized object still has limited accessibility to the memory buffer based on its original size. to solve the problem. 3 different functions are defined:
def customresize1(array, new_size):
resize(array, sizeof(array._type_)*new_size)
return (array._type_*new_size).from_address(addressof(array))
def customresize2(array, new_size):
return (array._type_*new_size).from_address(addressof(array))
def customresize3(array, new_size):
base = getattr(array, 'base', array)
resize(base, sizeof(array._type_)*new_size)
new_array = (array._type_*new_size).from_address(addressof(base))
new_array.base = base
all functions return an object that shares the memory of the original owner, which does not own the memory and can not be resized (e.g., gives error in customresize1)
customresize2 does return a resized array, but keey in mind that from_address does not allocate memory for resizing..
customresize3 keeps a record of the base object that owns the memory, but the returned object is not the owner of memory
As python is dynamically allocating its memory and garbage collecting, so, if you want to resize something, just redo the size will work. eg.:
list = (c_int * NEW_SIZE)()
or you may want to keep the original values then:
list = (c_int * NEW_SIZE)(*list)
Suppose I have a large in memory numpy array, I have a function func that takes in this giant array as input (together with some other parameters). func with different parameters can be run in parallel. For example:
def func(arr, param):
# do stuff to arr, param
# build array arr
pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]
If I use multiprocessing library, then that giant array will be copied for multiple times into different processes.
Is there a way to let different processes share the same array? This array object is read-only and will never be modified.
What's more complicated, if arr is not an array, but an arbitrary python object, is there a way to share it?
[EDITED]
I read the answer but I am still a bit confused. Since fork() is copy-on-write, we should not invoke any additional cost when spawning new processes in python multiprocessing library. But the following code suggests there is a huge overhead:
from multiprocessing import Pool, Manager
import numpy as np;
import time
def f(arr):
return len(arr)
t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;
pool = Pool(processes = 6)
t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;
output (and by the way, the cost increases as the size of the array increases, so I suspect there is still overhead related to memory copying):
construct array = 0.0178790092468
multiprocessing overhead = 0.252444982529
Why is there such huge overhead, if we didn't copy the array? And what part does the shared memory save me?
If you use an operating system that uses copy-on-write fork() semantics (like any common unix), then as long as you never alter your data structure it will be available to all child processes without taking up additional memory. You will not have to do anything special (except make absolutely sure you don't alter the object).
The most efficient thing you can do for your problem would be to pack your array into an efficient array structure (using numpy or array), place that in shared memory, wrap it with multiprocessing.Array, and pass that to your functions. This answer shows how to do that.
If you want a writeable shared object, then you will need to wrap it with some kind of synchronization or locking. multiprocessing provides two methods of doing this: one using shared memory (suitable for simple values, arrays, or ctypes) or a Manager proxy, where one process holds the memory and a manager arbitrates access to it from other processes (even over a network).
The Manager approach can be used with arbitrary Python objects, but will be slower than the equivalent using shared memory because the objects need to be serialized/deserialized and sent between processes.
There are a wealth of parallel processing libraries and approaches available in Python. multiprocessing is an excellent and well rounded library, but if you have special needs perhaps one of the other approaches may be better.
I run into the same problem and wrote a little shared-memory utility class to work around it.
I'm using multiprocessing.RawArray (lockfree), and also the access to the arrays is not synchronized at all (lockfree), be careful not to shoot your own feet.
With the solution I get speedups by a factor of approx 3 on a quad-core i7.
Here's the code:
Feel free to use and improve it, and please report back any bugs.
'''
Created on 14.05.2013
#author: martin
'''
import multiprocessing
import ctypes
import numpy as np
class SharedNumpyMemManagerError(Exception):
pass
'''
Singleton Pattern
'''
class SharedNumpyMemManager:
_initSize = 1024
_instance = None
def __new__(cls, *args, **kwargs):
if not cls._instance:
cls._instance = super(SharedNumpyMemManager, cls).__new__(
cls, *args, **kwargs)
return cls._instance
def __init__(self):
self.lock = multiprocessing.Lock()
self.cur = 0
self.cnt = 0
self.shared_arrays = [None] * SharedNumpyMemManager._initSize
def __createArray(self, dimensions, ctype=ctypes.c_double):
self.lock.acquire()
# double size if necessary
if (self.cnt >= len(self.shared_arrays)):
self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)
# next handle
self.__getNextFreeHdl()
# create array in shared memory segment
shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))
# convert to numpy array vie ctypeslib
self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)
# do a reshape for correct dimensions
# Returns a masked array containing the same data, but with a new shape.
# The result is a view on the original array
self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)
# update cnt
self.cnt += 1
self.lock.release()
# return handle to the shared memory numpy array
return self.cur
def __getNextFreeHdl(self):
orgCur = self.cur
while self.shared_arrays[self.cur] is not None:
self.cur = (self.cur + 1) % len(self.shared_arrays)
if orgCur == self.cur:
raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')
def __freeArray(self, hdl):
self.lock.acquire()
# set reference to None
if self.shared_arrays[hdl] is not None: # consider multiple calls to free
self.shared_arrays[hdl] = None
self.cnt -= 1
self.lock.release()
def __getArray(self, i):
return self.shared_arrays[i]
#staticmethod
def getInstance():
if not SharedNumpyMemManager._instance:
SharedNumpyMemManager._instance = SharedNumpyMemManager()
return SharedNumpyMemManager._instance
#staticmethod
def createArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)
#staticmethod
def getArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)
#staticmethod
def freeArray(*args, **kwargs):
return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)
# Init Singleton on module load
SharedNumpyMemManager.getInstance()
if __name__ == '__main__':
import timeit
N_PROC = 8
INNER_LOOP = 10000
N = 1000
def propagate(t):
i, shm_hdl, evidence = t
a = SharedNumpyMemManager.getArray(shm_hdl)
for j in range(INNER_LOOP):
a[i] = i
class Parallel_Dummy_PF:
def __init__(self, N):
self.N = N
self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)
self.pool = multiprocessing.Pool(processes=N_PROC)
def update_par(self, evidence):
self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))
def update_seq(self, evidence):
for i in range(self.N):
propagate((i, self.arrayHdl, evidence))
def getArray(self):
return SharedNumpyMemManager.getArray(self.arrayHdl)
def parallelExec():
pf = Parallel_Dummy_PF(N)
print(pf.getArray())
pf.update_par(5)
print(pf.getArray())
def sequentialExec():
pf = Parallel_Dummy_PF(N)
print(pf.getArray())
pf.update_seq(5)
print(pf.getArray())
t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")
print("Sequential: ", t1.timeit(number=1))
print("Parallel: ", t2.timeit(number=1))
This is the intended use case for Ray, which is a library for parallel and distributed Python. Under the hood, it serializes objects using the Apache Arrow data layout (which is a zero-copy format) and stores them in a shared-memory object store so they can be accessed by multiple processes without creating copies.
The code would look like the following.
import numpy as np
import ray
ray.init()
#ray.remote
def func(array, param):
# Do stuff.
return 1
array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)
result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)
If you don't call ray.put then the array will still be stored in shared memory, but that will be done once per invocation of func, which is not what you want.
Note that this will work not only for arrays but also for objects that contain arrays, e.g., dictionaries mapping ints to arrays as below.
You can compare the performance of serialization in Ray versus pickle by running the following in IPython.
import numpy as np
import pickle
import ray
ray.init()
x = {i: np.ones(10**7) for i in range(20)}
# Time Ray.
%time x_id = ray.put(x) # 2.4s
%time new_x = ray.get(x_id) # 0.00073s
# Time pickle.
%time serialized = pickle.dumps(x) # 2.6s
%time deserialized = pickle.loads(serialized) # 1.9s
Serialization with Ray is only slightly faster than pickle, but deserialization is 1000x faster because of the use of shared memory (this number will of course depend on the object).
See the Ray documentation. You can read more about fast serialization using Ray and Arrow. Note I'm one of the Ray developers.
Like Robert Nishihara mentioned, Apache Arrow makes this easy, specifically with the Plasma in-memory object store, which is what Ray is built on.
I made brain-plasma specifically for this reason - fast loading and reloading of big objects in a Flask app. It's a shared-memory object namespace for Apache Arrow-serializable objects, including pickle'd bytestrings generated by pickle.dumps(...).
The key difference with Apache Ray and Plasma is that it keeps track of object IDs for you. Any processes or threads or programs that are running on locally can share the variables' values by calling the name from any Brain object.
$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma
from brain_plasma import Brain
brain = Brain(path='/tmp/plasma/')
brain['a'] = [1]*10000
brain['a']
# >>> [1,1,1,1,...]