I am currently trying to store some data into .h5 files, I quickly realised that might have to store my data into parts, as it is not possible to process it an have in my ram. I started out using numpy.array to compress the memory usage, but that resulted in days spend on formatting data.
So i went back to use list, but made the program monitor the memory usage,
when it was above a specified value, will a part be stored, as a numpy format - such that a another process can load it and make use of it. Problem with doing this, is that what I thought would release my memory isn't releasing the memory. For some reason is the memory the same even though I reset the variable and del the variable. Why isn't the memory being released here?
import numpy as np
import os
import resource
import sys
import gc
import math
import h5py
import SecureString
import objgraph
from numpy.lib.stride_tricks import as_strided as ast
total_frames = 15
total_frames_with_deltas = total_frames*3
dim = 40
window_height = 5
def store_file(file_name,data):
with h5py.File(file_name,'w') as f:
f["train_input"] = np.concatenate(data,axis=1)
def load_data_overlap(saved):
#os.chdir(numpy_train)
print "Inside function!..."
if saved == False:
train_files = np.random.randint(255,size=(1,40,690,4))
train_input_data_interweawed_normalized = []
print "Storing train pic to numpy"
part = 0
for i in xrange(100000):
print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
if resource.getrusage(resource.RUSAGE_SELF).ru_maxrss > 2298842112/10:
print "Max ram storing part: " + str(part) + " At entry: " + str(i)
print "Storing Train input"
file_name = 'train_input_'+'part_'+str(part)+'_'+str(dim)+'_'+str(total_frames_with_deltas)+'_window_height_'+str(window_height)+'.h5'
store_file(file_name,train_input_data_interweawed_normalized)
part = part + 1
del train_input_data_interweawed_normalized
gc.collect()
train_input_data_interweawed_normalized = []
raw_input("something")
for plot in train_files:
overlaps_reshaped = np.random.randint(10,size=(45,200,5,3))
for ind_plot in overlaps_reshaped.reshape(overlaps_reshaped.shape[1],overlaps_reshaped.shape[0],overlaps_reshaped.shape[2],overlaps_reshaped.shape[3]):
ind_plot_reshaped = ind_plot.reshape(ind_plot.shape[0],1,ind_plot.shape[1],ind_plot.shape[2])
train_input_data_interweawed_normalized.append(ind_plot_reshaped)
print len(train_input_data_interweawed_normalized)
return train_input_data_interweawed_normalized_print
#------------------------------------------------------------------------------------------------------------------------------------------------------------
saved = False
train_input = load_data_overlap(saved)
output:
.....
223662080
224772096
225882112
226996224
228106240
229216256
230326272
Max ram storing part: 0 At entry: 135
Storing Train input
something
377118720
Max ram storing part: 1 At entry: 136
Storing Train input
something
377118720
Max ram storing part: 2 At entry: 137
Storing Train input
something
You need to explicitly force garbage collection, see here:
According to Python Official Documentation, you can force the Garbage Collector to release unreferenced memory with gc.collect()
Related
I got the following code:
import sys
from sentence_transformers import InputExample
from lib import DataLoader as DL
def load_train_data():
train_sentences = DL.load_entire_corpus(data_corpus_path) # Loading from 9GB files
# 16GB of memory is now allocated by this process
train_data = []
for i in range(len(train_sentences)):
s = train_sentences.pop() # Use pop to release item for garbage collector
train_data.append(InputExample(texts=[s, s])) # Problem is around here I guess
return train_data
train_data = load_train_data()
The files loaded in DL.load_entire_corpus contain lists of sentences.
The code crashes because more than 32GB of RAM are being allocated during the process. Until the for-loop around 16GB is being allocated. During the for loop it raises until 32GB which leads to a crash or a hanging system.
print(sys.getsizeof(train_sentences) + sys.getsizeof(train_data)) within the for loop is never over 10GB. There is no other process that can allocate RAM.
What am I missing?
getsizeof() generally goes only one level deep. For a list, it returns a measure of the bytes consumed by the list structure itself, but not bytes consumed by the objects the list contains. For example,
>>> import sys
>>> x = list(range(2000))
>>> sys.getsizeof(x)
16056
>>> for i in range(len(x)):
... x[i] = None
>>> sys.getsizeof(x)
16056
See? The result doesn't change regardless of what the list contains. The only thing that matters to getsizeof() is len(the_list).
So the "missing" RAM is almost certainly being consumed by the InputExample(texts=[s, s]) objects you're appending to train_data.
In the normal code, everything is fine, I can read/and write correctly.
for i in range(10):
sleep(1)
m = mmap.mmap(-1, 1024, access=mmap.ACCESS_WRITE, tagname='share_mmap')
m.seek(0)
cnt = m.read_byte()
if cnt == 0:
print("Load data to memory")
m.seek(0)
m.write(b"FFFFFFFFFFFFFFFFFF")
else:
m.seek(0)
info_str=m.read().translate(None, b'\x00').decode()
print("The data is in memory: ", info_str)
Result:
> Load data to memory
> The data is in memory: FFFFFFFFFFFFFFFFFF
> The data is in memory: FFFFFFFFFFFFFFFFFF
But if I wrap it with contextlib, like all the tutorial's code, then I can't read anymore.
for i in range(10):
sleep(1)
# !!!! LOOK AT HERE !!!! I only changed this line.
with contextlib.closing(mmap.mmap(-1, 1024, access=mmap.ACCESS_WRITE,
tagname='share_mmap')) as m:
m.seek(0)
cnt = m.read_byte()
if cnt == 0:
print("Load data to memory")
m.seek(0)
m.write(b"FFFFFFFFFFFFFFFFFF")
else:
m.seek(0)
info_str=m.read().translate(None, b'\x00').decode()
print("The data is in memory: ", info_str)
Result:
> Load data to memory
> Load data to memory
> Load data to memory
Why?
Also, why does everyone like do it in this way — haven't they encountered this bug?
You're mapping anonymous memory as opposed to a file, so the only thing that's making you get the same mapping each time is the tag name. If you already have a mapping with a given tag name, then mmap gives you the same mapping again, but if not, you get a fresh mapping. Your first snippet works because the old mappings are still there (since you never closed them), but in your second snippet, the old mapping is gone before the new one gets created, and since it's anonymous memory, that means the data is gone too.
To fix the problem, either do the mapping outside the loop or map file-backed memory instead.
Apparently it doesn't work with anonymous memory (fileno -1) since it does with a real file as illustrated below. Not sure if this is a bug, or just an undocumented limitation.
import contextlib
import mmap
from time import sleep
import tempfile
with tempfile.TemporaryFile() as mmapfile:
for i in range(5):
sleep(1)
# Works if you use a real file.
with contextlib.closing(mmap.mmap(mmapfile.fileno(), 1024, access=mmap.ACCESS_WRITE,
tagname='share_mmap')) as m:
m.seek(0)
cnt = m.read_byte()
if cnt == 0:
print("Load data to memory")
m.seek(0)
m.write(b"FFFFFFFFFFFFFFFFFF")
else:
m.seek(0)
info_str=m.read().translate(None, b'\x00').decode()
print("The data is in memory: ", info_str)
Output:
Load data to memory
The data is in memory: FFFFFFFFFFFFFFFFFF
The data is in memory: FFFFFFFFFFFFFFFFFF
The data is in memory: FFFFFFFFFFFFFFFFFF
The data is in memory: FFFFFFFFFFFFFFFFFF
Ok, here is my problem: I have a nested for loop in my program which runs on a single core. Since the program spend over 99% of run time in this nested for loop I would like to parallelize it. Right now I have to wait 9 days for the computation to finish. I tried to implement a parallel for loop by using the multiprocessing library. But I only find very basic examples and can not transfer them to my problem. Here are the nested loops with random data:
import numpy as np
dist_n = 100
nrm = np.linspace(1,10,dist_n)
data_Y = 11000
data_I = 90000
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n)
for t in range(data_Y):
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
print(dist)
Please give me some advise how to make it parallel.
There's a small overhead with initiating a process (50ms+ depending on data size) so it's generally best to MP the largest block of code possible. From your comment it sounds like each loop of t is independent so we should be free to parallelize this.
When python creates a new process you get a copy of the main process so you have available all your global data but when each process writes the data, it writes to it's own local copy. This means dist[i,p] won't be available to the main process unless you explicitly pass it back with a return (which will have some overhead). In your situation, if each process writes dist[i,p] to a file then you should be fine, just don't try to write to the same file unless you implement some type of mutex access control.
#!/usr/bin/python
import time
import multiprocessing as mp
import numpy as np
data_Y = 11 #11000
data_I = 90 #90000
dist_n = 100
nrm = np.linspace(1,10,dist_n)
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n))
def worker(t):
st = time.time()
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
# Here - each worker opens a different file and writes to it
print 'Worker time %4.3f mS' % (1000.*(time.time()-st))
if 1: # single threaded
st = time.time()
for x in map(worker, range(data_Y)):
pass
print 'Single-process total time is %4.3f seconds' % (time.time()-st)
print
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print
If you re-increase the size of data_Y/data_I again, the speed-up should increase up to the theoretical limit.
The following code fills all my memory:
from sys import getsizeof
import numpy
# from http://stackoverflow.com/a/2117379/272471
def getSize(array):
return getsizeof(array) + len(array) * getsizeof(array[0])
class test():
def __init__(self):
pass
def t(self):
temp = numpy.zeros([200,100,100])
A = numpy.zeros([200], dtype = numpy.float64)
for i in range(200):
A[i] = numpy.sum( temp[i].diagonal() )
return A
a = test()
memory_usage("before")
c = [a.t() for i in range(100)]
del a
memory_usage("After")
print("Size of c:", float(getSize(c))/1000.0)
The output is:
('>', 'before', 'memory:', 20588, 'KiB ')
('>', 'After', 'memory:', 1583456, 'KiB ')
('Size of c:', 8.92)
Why am I using ~1.5 GB of memory if c is ~ 9 KiB? Is this a memory leak? (Thanks)
The memory_usage function was posted on SO and is reported here for clarity:
def memory_usage(text = ''):
"""Memory usage of the current process in kilobytes."""
status = None
result = {'peak': 0, 'rss': 0}
try:
# This will only work on systems with a /proc file system
# (like Linux).
status = open('/proc/self/status')
for line in status:
parts = line.split()
key = parts[0][2:-1].lower()
if key in result:
result[key] = int(parts[1])
finally:
if status is not None:
status.close()
print('>', text, 'memory:', result['rss'], 'KiB ')
return result['rss']
The implementation of diagonal() failed to decrement a reference counter. This issue had been previously fixed, but the change didn't make it into 1.7.0.
Upgrading to 1.7.1 solves the problem! The release notes contain various useful identifiers, notably issue 2969.
The solution was provided by Sebastian Berg and Charles Harris on the NumPy mailing list.
Python allocs memory from the OS if it needs some.
If it doesn't need it any longer, it may or may not return it again.
But if it doesn't return it, the memory will be reused on subsequent allocations. You should check that; but supposedly the memory consumption won't increase even more.
About your estimations of memory consumption: As azorius already wrote, your temp array consumes 16 MB, while your A array consumes about 200 * 8 = 1600 bytes (+ 40 for internal reasons). If you take 100 of them, you are at 164000 bytes (plus some for the list).
Besides that, I have no explanation for the memory consumption you have.
I don't think sys.getsizeof returns what you expect
your numpy vector A is 64 bit (8 bytes) - so it takes up (at least)
8 * 200 * 100 * 100 * 100 / (2.0**30) = 1.5625 GB
so at minimum you should use 1.5 GB on the 100 arrays, the last few hundred mg are all the integers used for indexing the large numpy data and the 100 objects
It seems that sys.getsizeof always returns 80 no matter how large a numpy array is:
sys.getsizeof(np.zeros([200,1000,100])) # return 80
sys.getsizeof(np.zeros([20,100,10])) # return 80
In your code you delete a which is a tiny factory object who's t method return huge numpy arrays, you store these huge arrays in a list called c.
try to delete c, then you should regain most of your RAM
I have 43 text files that consume "232.2 MB on disk (232,129,355 bytes) for 43 items". what to read them in to memory (see code below). The problem I am having is that each file which is about 5.3mb on disk is causing python to use an additional 100mb of system memory. If check the size of the dict() getsizeof() (see sample of output). When python is up to 3GB of system memory getsizeof(dict()) is only using 6424 bytes of memory. I don't understand what is using the memory.
What is using up all the memory?
The related link is different in that the reported memory use by python was "correct"
related question
I am not very interested in other solutions DB .... I am more interested in understanding what is happening so I know how to avoid it in the future. That said using other python built ins array rather than lists are are great suggestion if it helps.
I have heard suggestions of using guppy to find what is using the memory.
sample output:
Loading into memory: ME49_800.txt
ME49_800.txt has 228484 rows of data
ME49_800.txt has 0 rows of masked data
ME49_800.txt has 198 rows of outliers
ME49_800.txt has 0 modified rows of data
280bytes of memory used for ME49_800.txt
43 files of 43 using 12568 bytes of memory
120
Sample data:
CellHeader=X Y MEAN STDV NPIXELS
0 0 120.0 28.3 25
1 0 6924.0 1061.7 25
2 0 105.0 17.4 25
Code:
import csv, os, glob
import sys
def read_data_file(filename):
reader = csv.reader(open(filename, "U"),delimiter='\t')
fname = os.path.split(filename)[1]
data = []
mask = []
outliers = []
modified = []
maskcount = 0
outliercount = 0
modifiedcount = 0
for row in reader:
if '[MASKS]' in row:
maskcount = 1
if '[OUTLIERS]' in row:
outliercount = 1
if '[MODIFIED]' in row:
modifiedcount = 1
if row:
if not any((maskcount, outliercount, modifiedcount)):
data.append(row)
elif not any((not maskcount, outliercount, modifiedcount)):
mask.append(row)
elif not any((not maskcount, not outliercount, modifiedcount)):
outliers.append(row)
elif not any((not maskcount, not outliercount, not modifiedcount)):
modified.append(row)
else: print '***something went wrong***'
data = data[1:]
mask = mask[3:]
outliers = outliers[3:]
modified = modified[3:]
filedata = dict(zip((fname + '_data', fname + '_mask', fname + '_outliers', fname+'_modified'), (data, mask, outliers, modified)))
return filedata
def ImportDataFrom(folder):
alldata = dict{}
infolder = glob.glob( os.path.join(folder, '*.txt') )
numfiles = len(infolder)
print 'Importing files from: ', folder
print 'Importing ' + str(numfiles) + ' files from: ', folder
for infile in infolder:
fname = os.path.split(infile)[1]
print "Loading into memory: " + fname
filedata = read_data_file(infile)
alldata.update(filedata)
print fname + ' has ' + str(len(filedata[fname + '_data'])) + ' rows of data'
print fname + ' has ' + str(len(filedata[fname + '_mask'])) + ' rows of masked data'
print fname + ' has ' + str(len(filedata[fname + '_outliers'])) + ' rows of outliers'
print fname + ' has ' + str(len(filedata[fname +'_modified'])) + ' modified rows of data'
print str(sys.getsizeof(filedata)) +'bytes'' of memory used for '+ fname
print str(len(alldata)/4) + ' files of ' + str(numfiles) + ' using ' + str(sys.getsizeof(alldata)) + ' bytes of memory'
#print alldata.keys()
print str(sys.getsizeof(ImportDataFrom))
print ' '
return alldata
ImportDataFrom("/Users/vmd/Dropbox/dna/data/rawdata")
The dictionary itself is very small - the bulk of the data is the whole content of the files stored in lists, containing one tuple per line. The 20x size increase is bigger than I expected but seems to be real. Splitting a 27-bytes line from your example input into a tuple, gives me 309 bytes (counting recursively, on a 64-bit machine). Add to this some unknown overhead of memory allocation, and 20x is not impossible.
Alternatives: for more compact representation, you want to convert the strings to integers/floats and to tightly pack them (without all that pointers and separate objects). I'm talking not just one row (although that's a start), but a whole list of rows together - so each file will be represented by just four 2D arrays of numbers. The array module is a start, but really what you need here are numpy arrays:
# Using explicit field types for compactness and access by name
# (e.g. data[i]['mean'] == data[i][2]).
fields = [('x', int), ('y', int), ('mean', float),
('stdv', float), ('npixels', int)]
# The simplest way is to build lists as you do now, and convert them
# to numpy array when done.
data = numpy.array(data, dtype=fields)
mask = numpy.array(mask, dtype=fields)
...
This gives me 40 bytes spent per row (measured on the .data attribute; sys.getsizeof reports that the array has a constant overhead of 80 bytes, but doesn't see the actual data used). This is still a ~1.5 more than the original files, but should easily fit into RAM.
I see 2 of your fields are labeled "x" and "y" - if your data is dense, you could arrange it by them - data[x,y]==... - instead of just storing (x,y,...) records. Besides being slightly more compact, it would be the most sensible structure, allowing easier processing.
If you need to handle even more data than your RAM will fit, pytables is a good library for efficient access to compact (even compressed) tabular data in files. (It's much better at this than general SQL DBs.)
This line specifically gets the size of the function object:
print str(sys.getsizeof(ImportDataFrom))
that's unlikely to be what you're interested in.
The size of a container does not include the size of the data it contains. Consider, for example:
>>> import sys
>>> d={}
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*99
>>> sys.getsizeof(d)
140
>>> d['foo'] = 'x'*9999
>>> sys.getsizeof(d)
140
If you want the size of the container plus the size of all contained things you have to write your own (presumably recursive) function that reaches inside containers and digs for every byte. Or, you can use third-party libraries such as Pympler or guppy.