Python "random" MemoryError - python

I would like to understand what is happening with a MemoryError that seem to occur more or less randomly.
I'm running a Python 3 program under Docker and on an Azure VM (2CPU & 7GB RAM).
To make it simple, the program deals with binary files that are read by a specific library (there's no problem there), then I merge them by peer of files and finally insert data in a database.
The dataset that I get after the merge (and before db insert) is a Pandas dataframe that contains around ~ 2.8M rows and 36 columns.
For the insertion into database, I'm using a REST API that obliges me to insert the file by chunk.
Before that, I'm transforming the datafram into a StringIO buffer using this function:
# static method from Utils class
#staticmethod
def df_to_buffer(my_df):
count_row, count_col = my_df.shape
buffer = io.StringIO() #creating an empty buffer
my_df.to_csv(buffer, index=False) #filling that buffer
LOGGER.info('Current data contains %d rows and %d columns, for a total
buffer size of %d bytes.', count_row, count_col, buffer.tell())
buffer.seek(0) #set to the start of the stream
return buffer
So in my "main" program the behaviour is :
# transform the dataframe to a StringIO buffer
file_data = Utils.df_to_buffer(file_df)
buffer_chunk_size = 32000000 #32MB
while True:
data = file_data.read(buffer_chunk_size)
if data:
...
# do the insert stuff
...
else:
# whole file has been loaded
break
# loop is over, close the buffer before processing a new file
file_data.close()
The problem :
Sometimes I am able to insert 2 or 3 files in a row. Sometimles a MemoryError occurs at a random moment (but always when it's about to insert a new file).
The error occurs at the first iteration of a file insert (never in the middle of the file). It specifically crashes on the line that does the read by chunk file_data.read(buffer_chunk_size)
I'm monitoring the memory during the process (using htop) : it never goes higher than 5,5 GB of memory and espacially when the crash occurs, it runs around ~3.5 GB of used memory at that moment...
Any information or advice would be appreciated,
thanks. :)
EDIT
I was able to debug and kind of identify the problem but did not solve it yet.
It occurs when I read the StringIO buffer by chunk. The data chunk increases a lot the RAM consumption, as it is a big str that contains the 320000000 characters of file.
I tried to reduce it from 32000000 to 16000000. I was able to insert some files, but after some time the MemoryError occurs again... I'm trying to reduce it to 8000000 right now.

Related

How to optimize sequential writes with h5py to increase speed when reading the file afterwards?

I process some input data which, if I did it all at once, would give me a dataset of float32s and typical shape (5000, 30000000). (The length of the 0th axis is fixed, the 1st varies, but I do know what it will be before I start).
Since that's ~600GB and won't fit in memory I have to cut it up along the 1st axis and process it in blocks of (5000, blocksize). I cannot cut it up along the 0th axis, and due to RAM constraints blocksize is typically around 40000. At the moment I'm writing each block to an hdf5 dataset sequentially, creating the dataset like:
fout = h5py.File(fname, "a")
blocksize = 40000
block_to_write = np.random.random((5000, blocksize))
fout.create_dataset("data", data=block_to_write, maxshape=(5000, None))
and then looping through blocks and adding to it via
fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, -blocksize:] = block_to_write
This works and runs in an acceptable amount of time.
The end product I need to feed into the next step is a binary file for each row of the output. It's someone else's software so unfortunately I have no flexibility there.
The problem is that reading in one row like
fin = h5py.File(fname, 'r')
data = fin['data']
a = data[0,:]
takes ~4min and with 5000 rows, that's way too long!
Is there any way I can alter my write so that my read is faster? Or is there anything else I can do instead?
Should I make each individual row its own data set within the hdf5 file? I assumed that doing lots of individual writes would be too slow but maybe it's better?
I tried writing the binary files directly - opening them outside of the loop, writing to them during the loops, and then closing them afterwards - but I ran into OSError: [Errno 24] Too many open files. I haven't tried it but I assume opening the files and closing them inside the loop would make it way too slow.
Your question is similar to a previous SO/h5py question I recently answered: h5py extremely slow writing. Apparently you are getting acceptable write performance, and want to improve read performance.
The 2 most important factors that affect h5py I/O performance are: 1) chunk size/shape, and 2) size of the I/O data block. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. I have also found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py. The size of your read data block is certainly large enough.
So, my initial hunch was to investigate chunk size influence on I/O performance. Setting the optimal chunk size is a bit of an art. Best way to tune the value is to enable chunking, let h5py define the default size, and see if you get acceptable performance. You didn't define the chunks parameter. However, because you defined the maxshape parameter, chunking was automatically enabled with a default size (based on the dataset's initial size). (Without chunking, I/O on a file of this size would be painfully slow.) An additional consideration for your problem: the optimal chunk size has to balance the size of the write data blocks (5000 x 40_000) vs the read data blocks (1 x 30_000_000).
I parameterized your code so I could tinker with the dimensions. When I did, I discovered something interesting. Reading the data is much faster when I run it as a separate process after creating the file. And, the default chunk size seems to give adequate read performance. (Initially I was going to benchmark different chunk size values.)
Note: I only created a 78GB file (4_000_000 columns). This takes >13mins to run on my Windows system. I didn't want to wait 90mins to create a 600GB file. You can modify n_blocks=750 if you want to test 30_000_000 columns. :-) All code at the end of this post.
Next I created a separate program to read the data. Read performance was fast with the default chunk size: (40, 625). Timing output below:
Time to read first row: 0.28 (in sec)
Time to read last row: 0.28
Interestingly, I did not get the same read times with every test. Values above were pretty consistent, but occasionally I would get a read time of 7-10 seconds. Not sure why that happens.
I ran 3 tests (In all cases block_to_write.shape=(500,40_000)):
default chunksize=(40,625) [95KB]; for 500x40_000 dataset (resized)
default chunksize=(10,15625) [596KB]; for 500x4_000_000 dataset (not resized)
user defined chunksize=(10,40_000) [1.526MB]; for 500x4_000_000 dataset (not resized)
Larger chunks improves read performance, but speed with default values is pretty fast. (Chunk size has a very small affect on write performance.) Output for all 3 below.
dataset chunkshape: (40, 625)
Time to read first row: 0.28
Time to read last row: 0.28
dataset chunkshape: (10, 15625)
Time to read first row: 0.05
Time to read last row: 0.06
dataset chunkshape: (10, 40000)
Time to read first row: 0.00
Time to read last row: 0.02
Code to create my test file below:
with h5py.File(fname, 'w') as fout:
blocksize = 40_000
n_blocks = 100
n_rows = 5_000
block_to_write = np.random.random((n_rows, blocksize))
start = time.time()
for cnt in range(n_blocks):
incr = time.time()
print(f'Working on loop: {cnt}', end='')
if "data" not in fout:
fout.create_dataset("data", shape=(n_rows,blocksize),
maxshape=(n_rows, None)) #, chunks=(10,blocksize))
else:
fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, cnt*blocksize:(cnt+1)*blocksize] = block_to_write
print(f' - Time to add block: {time.time()-incr:.2f}')
print(f'Done creating file: {fname}')
print(f'Time to create {n_blocks}x{blocksize:,} columns: {time.time()-start:.2f}\n')
Code to read 2 different arrays from the test file below:
with h5py.File(fname, 'r') as fin:
print(f'dataset shape: {fin["data"].shape}')
print(f'dataset chunkshape: {fin["data"].chunks}')
start = time.time()
data = fin["data"][0,:]
print(f'Time to read first row: {time.time()-start:.2f}')
start = time.time()
data = fin["data"][-1,:]
print(f'Time to read last row: {time.time()-start:.2f}'

Why does reading a large csv in pandas without low_memory lead to an out of memory error despite having enough RAM? [duplicate]

I'm trying to import a large tab/txt (size = 3 gb) file into Python using pandas pd.read_csv("file.txt",sep="\t"). The file I load was a ".tab" file of which I changed the extension to ".txt" to import it with read_csv(). It is a file with 305 columns and +/- 1 000 000 rows.
When I execute the code, after some time Python returns a MemoryError. I searched for some information and this basically means that there is not enough RAM available. When I specify nrows = 20 in read_csv() it works fine.
The computer I'm using has 46gb of RAM of which roughly 20 gb was available for Python.
My question: How is it possible that a file of 3gb needs more than 20gb of RAM to be imported into Python using pandas read_csv()? Am I doing anything wrong?
EDIT: When executing df.dtypes the types are a mix of object, float64, and int64
UPDATE: I used the following code to overcome the problem and perform my calculations:
summed_cols=pd.DataFrame(columns=["sample","read sum"])
while x<352:
x=x+1
sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
del sample_col
it now selects a column, performs a calculation, stores the result in a dataframe, deletes the current column, and moves to the next column
Pandas is cutting up the file, and storing the data individually. I don't know the data types, so I'll assume the worst: strings.
In Python (on my machine), an empty string needs 49 bytes, with an additional byte for each character if ASCII (or 74 bytes with extra 2 bytes for each character if Unicode). That's roughly 15Kb for a row of 305 empty fields. A million and a half of such rows would take roughly 22Gb in memory, while they would take about 437 Mb in a CSV file.
Pandas/numpy are good with numbers, as they can represent a numerical series very compactly (like C program would). As soon as you step away from C-compatible datatypes, it uses memory as Python does, which is... not very frugal.

Pickling and loading a pandas dataframe with each loop to save progress ...bad idea?

I have a for loop that does an API call over thousands of rows.
(I know for loops are not recommended but this api is rate limited so slow is better. I know I can also do iterrows but this is just an example)
Sometimes I come back to find the loop has failed or there's something wrong with the api and I need to stop the loop. This means I lose all my data.
I was thinking of pickling the dataframe at the end of each loop, and re-loading it at the start. This would save all updates made to the dataframe.
Fake example (not working code - this is just a 'what if'):
for i in range(len(df1)):
# check if df pickle file in directory
if pickle in directory:
# load file
df1 = pickle.load(df1)
# append new data
df1.loc[i,'api_result'] = requests(http/api/call/data/)
# dump it to file
pickle.dump(df1)
else:
# start of loop
# append new data
df1.loc[i,'api_result'] = requests(http/api/call/data/)
# dump to file
pickle.dump(df1)
And if this is not a good way to keep the updated file in case of failure or early stoppage, what is?
I think that a good solution would be to save in append all the updates in a file
with open("updates.txt", "a") as f_o:
for i in range(len(df1)):
# append new data
f_o.write(requests(http/api/call/data/)+"\n")
And if all the rows are present in the file, you can do a bulk update. If not, restart the updates from the last failed record

Optimize BLF-reader for Python CAN performance

I have a big blf file, blf_file.blf, and an associated dbc file, dbc_file.dbc. I need to read and decode all the messages and store them in a list. For this I use the python-can library:
decoded_mess = []
db = cantools.db.load_file('dbc_file.dbc')
with can.BLFReader('blf_file.blf') as can_log:
for msg in can_log:
decoded_mess.append(
db.decode_message(msg.arbitration_id, msg.data)
)
However, for my blf files (> 100 MB), this takes up to 5 minutes.
Is there a way to speed this up? In the end, I want to store every signal in a seperate list, so list comprehension is not an option.

Is this a good way to export all entities of a type to a csv file?

I have millions of entities of a particular type that i would like to export to a csv file. The following code writes entities in batches of 1000 to a blob while keeping the blob open and deferring the next batch to the task queue. When there are no more entities to be fetched the blob is finalized. This seems to work for most of my local testing but I wanted to know:
If i am missing out on any gotchas or corner cases before running it on my production data and incurring $s for datastore reads.
If the deadline is exceeded or the memory runs out while the batch is being written to the blob, this code is defaulting to the start of the current batch for running the task again which may cause a lot of duplication. Any suggestions to fix that?
def entities_to_csv(entity_type,blob_file_name='',cursor='',batch_size=1000):
more = True
next_curs = None
q = entity_type.query()
results,next_curs,more = q.fetch_page(batch_size,start_cursor=Cursor.from_websafe_string(cursor))
if results:
try:
if not blob_file_name:
blob_file_name = files.blobstore.create(mime_type='text/csv',_blob_uploaded_filename='%s.csv' % entity_type.__name__)
rows = [e.to_dict() for e in results]
with files.open(blob_file_name, 'a') as f:
writer = csv.DictWriter(f,restval='',extrasaction='ignore',fieldnames=results[0].keys())
writer.writerows(rows)
if more:
deferred.defer(entity_type,blob_file_name,next_curs.to_websafe_string())
else:
files.finalize(blob_file_name)
except DeadlineExceededError:
deferred.defer(entity_type,blob_file_name,cursor)
Later in the code, something like:
deferred.defer(entities_to_csv,Song)
The problem with your current solution is that your memory will increase with every write to preform to the blobstore. the blobstore is immutable and write all the data at once from the memory.
You need to run the job on a backend that can hold all the records in memory, you need to define a backend in your application and call defer with _target='<backend name>'.
Check out this Google I/O video, pretty much describes what you want to do using MapReduce, starting at around the 23:15 mark in the video. The code you want is at 27:19
https://developers.google.com/events/io/sessions/gooio2012/307/

Categories