When inserting a huge pandas dataframe into sqlite via sqlalchemy and pandas to_sql and a specified chucksize, I would get memory errors.
At first I thought it was an issue with to_sql but I tried a workaround where instead of using chunksize I used for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) and that still resulted in an error.
It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.
I had a hard time trying to replicate the memory leak that occured when converting my data, through a minimal example. But this gets pretty close.
import string
import numpy as np
import pandas as pd
from random import randint
import random
def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
return (np.random.choice(list(chars), num_rows*size)
.view('|U{}'.format(size)))
def alt(size, num_rows):
data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
dfAll = pd.DataFrame(data)
return dfAll
dfAll = alt(randint(1000, 2000), 10000)
for i in range(330):
print('step ', i)
data = alt(randint(1000, 2000), 10000)
df = pd.DataFrame(data)
dfAll = pd.concat([ df, dfAll ])
import sqlalchemy
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')
for i in range(500):
print('step', i)
dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')
This was run on Google Colab CPU enviroment.
The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions repeated insertions via looping to_sql or one to_sql with chucksize specified.
Is there a way that this code could be run without causing an eventual increase in memory usage?
Edit:
To fully reproduce the error, run this notebook
https://drive.google.com/open?id=1ZijvI1jU66xOHkcmERO4wMwe-9HpT5OS
The notebook requires you to import this folder into the main directory of your Google Drive
https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8
The notebook will also mount your Google drive, you need to give it authorization to access your Google drive. Since the data is hosted on my Google drive, importing the data should not take up any of your allocated data.
The Google Colab instance starts with about 12.72GB of RAM available.
After creating the DataFrame, theBigList, about 9.99GB of RAM have been used.
Already this is a rather uncomfortable situation to be in, since it is not unusual for
Pandas operations to require as much additional space as the DataFrame it is operating on.
So we should strive to avoid using even this much RAM if possible, and fortunately there is an easy way to do this: simply load each .npy file and store its data in the sqlite database one at a time without ever creating theBigList (see below).
However, if we use the code you posted, we can see that the RAM usage slowly increases
as chunks of theBigList is stored in the database iteratively.
theBigList DataFrame stores the strings in a NumPy array. But in the process
of transferring the strings to the sqlite database, the NumPy strings are
converted into Python strings. This takes additional memory.
Per this Theano tutoral which discusses Python internal memory management,
To speed-up memory allocation (and reuse) Python uses a number of lists for
small objects. Each list will contain objects of similar size: there will be a
list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object
needs to be created, either we reuse a free block in the list, or we allocate a
new one.
... The important point is that those lists never shrink.
Indeed: if an item (of size x) is deallocated (freed by lack of reference) its
location is not returned to Python’s global memory pool (and even less to the
system), but merely marked as free and added to the free list of items of size
x. The dead object’s location will be reused if another object of compatible
size is needed. If there are no dead objects available, new ones are created.
If small objects memory is never freed, then the inescapable conclusion is that,
like goldfishes, these small object lists only keep growing, never shrinking,
and that the memory footprint of your application is dominated by the largest
number of small objects allocated at any given point.
I believe this accurately describes the behavior you are seeing as this loop executes:
for i in range(0, 588):
theBigList.iloc[i*10000:(i+1)*10000].to_sql(
'CS_table', engine, index=False, if_exists='append')
Even though many dead objects' locations are being reused for new strings, it is
not implausible with essentially random strings such as those in theBigList that extra space will occasionally be
needed and so the memory footprint keeps growing.
The process eventually hits Google Colab's 12.72GB RAM limit and the kernel is killed with a memory error.
In this case, the easiest way to avoid large memory usage is to never instantiate the entire DataFrame -- instead, just load and process small chunks of the DataFrame one at a time:
import numpy as np
import pandas as pd
import matplotlib.cbook as mc
import sqlalchemy as SA
def load_and_store(dbpath):
engine = SA.create_engine("sqlite:///{}".format(dbpath))
for i in range(0, 47):
print('step {}: {}'.format(i, mc.report_memory()))
for letter in list('ABCDEF'):
path = '/content/gdrive/My Drive/SummarizationTempData/CS2Part{}{:02}.npy'.format(letter, i)
comb = np.load(path, allow_pickle=True)
toPD = pd.DataFrame(comb).drop([0, 2, 3], 1).astype(str)
toPD.columns = ['title', 'abstract']
toPD = toPD.loc[toPD['abstract'] != '']
toPD.to_sql('CS_table', engine, index=False, if_exists='append')
dbpath = '/content/gdrive/My Drive/dbfile/CSSummaries.db'
load_and_store(dbpath)
which prints
step 0: 132545
step 1: 176983
step 2: 178967
step 3: 181527
...
step 43: 190551
step 44: 190423
step 45: 190103
step 46: 190551
The last number on each line is the amount of memory consumed by the process as reported by
matplotlib.cbook.report_memory. There are a number of different measures of memory usage. On Linux, mc.report_memory() is reporting
the size of the physical pages of the core image of the process (including text, data, and stack space).
By the way, another basic trick you can use manage memory is to use functions.
Local variables inside the function are deallocated when the function terminates.
This relieves you of the burden of manually calling del and gc.collect().
Related
Let's say I've got some data which is in an existing dataset which has no column partitioning but is still 200+ files. I want to rewrite that data into a hive partition. The dataset is too big to open its entirety into memory. In this case the dataset came from a CETAs in Azure Synapse but I don't think that matters.
I do:
dataset=ds.dataset(datadir, format="parquet", filesystem=abfs)
new_part = ds.partitioning(pa.schema([("nodename", pa.string())]), flavor="hive")
scanner=dataset.scanner()
ds.write_dataset(scanner, newdatarootdir,
format="parquet", partitioning=new_part,
existing_data_behavior="overwrite_or_ignore",
max_partitions=2467)
In this case there are 2467 unique nodenames and so I want there to be 2467 directories each with 1 file. However what I get is 2467 directories each with 100+ files of about 10KB each. How do I get just 1 file per partition?
I could do a second step
for node in nodes:
fullpath=f"{datadir}\\{node}"
curnodeds=ds.dataset(fullpath, format="parquet")
curtable=curnodeds.to_table()
os.makedirs(f"{newbase}\\{node}")
pq.write_table(curtable, f"{newbase}\\{node}\\part-0.parquet",
version="2.6", flavor="spark", data_page_version="2.0")
Is there a way to incorporate the second step into the first step?
Edit:
This is with pyarrow 7.0.0
Edit(2):
Using max_open_files=3000 did result in one file per partition. The metadata comparison between the two are that the two step approach has (for one partition)...
<pyarrow._parquet.FileMetaData object at 0x000001C7FA414A40>
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 8
num_rows: 24840
num_row_groups: 1
format_version: 2.6
serialized_size: 1673
size on disk: 278kb
and the one step...
<pyarrow._parquet.FileMetaData object at 0x000001C7FA414A90>
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 8
num_rows: 24840
num_row_groups: 188
format_version: 1.0
serialized_size: 148313
size on disk: 1.04MB
Obviously in the two step version I'm explicitly setting the version to 2.6 so that explains that difference. The total size of my data after the 2-step is 655MB vs the 1 step is 2.6GB. The time difference was pretty significant too. The two step was something like 20 minutes for the first step and 40 minutes for the second. The one step was like 5 minutes for the whole thing.
The remaining question is, how to set version="2.6" and data_page_version="2.0" in write_dataset? I'll still wonder why the row_groups is so different if they're so different when setting those parameters but I'll defer that question.
The dataset writer was changed significantly in 7.0.0. Previously, it would always create 1 file per partition. Now there are a couple of settings that could cause it to write multiple files. Furthermore, it looks like you are ending up with a lot of small row groups which is not ideal and probably the reason the one-step process is both slower and larger.
The first significant setting is max_open_files. Some systems limit how many file descriptors can be open at one time. Linux defaults to 1024 and so pyarrow attempts defaults to ~900 (with the assumption that some file descriptors will be open for scanning, etc.) When this limit is exceeded pyarrow will close the least recently used file. For some datasets this works well. However, if each batch has data for each file this doesn't work well at all. In that case you probably want to increase max_open_files to be greater than your number of partitions (with some wiggle room because you will have some files open for reading too). You may need to adjust OS-specific settings to allow this (generally, these OS limits are pretty conservative and raising this limit is fairly harmless).
I'll still wonder why the row_groups is so different if they're so different when setting those parameters but I'll defer that question.
In addition, the 7.0.0 release adds min_rows_per_group, max_rows_per_group and max_rows_per_file parameters to the write_dataset call. Setting min_rows_per_group to something like 1 million will cause the writer to buffer rows in memory until it has enough to write. This will allow you to create files with 1 row group instead of 188 row groups. This should bring down your file size and fix your performance issues.
However, there is a memory cost associated with this which is going to be min_rows_per_group * num_files * size_of_row_in_bytes.
The remaining question is, how to set version="2.6" and data_page_version="2.0" in write_dataset?
The write_dataset call works on several different formats (e.g. csv, ipc, orc) and so the call only has generic options that apply regardless of format.
Format-specific settings can instead be set using the file_options parameter:
# It does not appear to be documented but make_write_options
# should accept most of the kwargs that write_table does
file_options = ds.ParquetFileFormat().make_write_options(version='2.6', data_page_version='2.0')
ds.write_dataset(..., file_options=file_options)
I would like to understand what is happening with a MemoryError that seem to occur more or less randomly.
I'm running a Python 3 program under Docker and on an Azure VM (2CPU & 7GB RAM).
To make it simple, the program deals with binary files that are read by a specific library (there's no problem there), then I merge them by peer of files and finally insert data in a database.
The dataset that I get after the merge (and before db insert) is a Pandas dataframe that contains around ~ 2.8M rows and 36 columns.
For the insertion into database, I'm using a REST API that obliges me to insert the file by chunk.
Before that, I'm transforming the datafram into a StringIO buffer using this function:
# static method from Utils class
#staticmethod
def df_to_buffer(my_df):
count_row, count_col = my_df.shape
buffer = io.StringIO() #creating an empty buffer
my_df.to_csv(buffer, index=False) #filling that buffer
LOGGER.info('Current data contains %d rows and %d columns, for a total
buffer size of %d bytes.', count_row, count_col, buffer.tell())
buffer.seek(0) #set to the start of the stream
return buffer
So in my "main" program the behaviour is :
# transform the dataframe to a StringIO buffer
file_data = Utils.df_to_buffer(file_df)
buffer_chunk_size = 32000000 #32MB
while True:
data = file_data.read(buffer_chunk_size)
if data:
...
# do the insert stuff
...
else:
# whole file has been loaded
break
# loop is over, close the buffer before processing a new file
file_data.close()
The problem :
Sometimes I am able to insert 2 or 3 files in a row. Sometimles a MemoryError occurs at a random moment (but always when it's about to insert a new file).
The error occurs at the first iteration of a file insert (never in the middle of the file). It specifically crashes on the line that does the read by chunk file_data.read(buffer_chunk_size)
I'm monitoring the memory during the process (using htop) : it never goes higher than 5,5 GB of memory and espacially when the crash occurs, it runs around ~3.5 GB of used memory at that moment...
Any information or advice would be appreciated,
thanks. :)
EDIT
I was able to debug and kind of identify the problem but did not solve it yet.
It occurs when I read the StringIO buffer by chunk. The data chunk increases a lot the RAM consumption, as it is a big str that contains the 320000000 characters of file.
I tried to reduce it from 32000000 to 16000000. I was able to insert some files, but after some time the MemoryError occurs again... I'm trying to reduce it to 8000000 right now.
I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be parallelized? Here is the basic layout of what I am doing:
import dask.bag as db
import json
js = db.read_text('path/to/json').map(json.loads).filter(lambda d: d['field'] == 'value')
result = js.pluck('field')
result = result.map(cleantext, tbl=tbl).str.lower().remove(exclusion).str.split()
result.map(stopwords,stop=stop).compute()
The basic premise is to extract text entries from the json file and then perform some cleaning operations. This seems like something that can be parallelized since each piece of text could be handed off to a processor since each text and the cleaning of each text is independent of any of the other. Is this an incorrect thought? Is there something I should be doing differently?
Thanks.
The read_text function breaks up a file into chunks based on byte ranges. My guess is that your file is small enough to fit into one chunk. You can check this by looking at the .npartitions attribute.
>>> js.npartitions
1
If so, then you might consider reducing the blocksize to increase the number of partitions
>>> js = db.read_text(..., blocksize=1e6)... # 1MB chunks
So I'm writing a Python script for indexing the Bitcoin blockchain by addresses, using a leveldb database (py-leveldb), and it keeps eating more and more memory until it crashes. I've replicated the behaviour in the code example below. When I run the code it continues to use more and more memory until it's exhausted the available RAM on my system and the process is either killed or throws "std::bad_alloc".
Am I doing something wrong? I keep writing to the batch object, and commit it every once in a while, but the memory usage keeps increasing even though I commit the data in the WriteBatch object. I even delete the WriteBatch object after comitting it, so as far as I can see it can't be this that is causing the memory leak.
Is my code using WriteBatch in a wrong way or is there a memory leak in py-leveldb?
The code requires py-leveldb to run, get it from here: https://pypi.python.org/pypi/leveldb
WARNING: RUNNING THIS CODE WILL EXHAUST YOUR MEMORY IF IT RUNS LONG ENOUGH. DO NOT RUN ON ON A CRITICAL SYSTEM. Also, it will write the data to a folder in the same folder as the script runs in, on my system this folder contains about 1.5GB worth of database files before memory is exhaused (it ends up consuming over 3GB of RAM).
Here's the code:
import leveldb, random, string
RANDOM_DB_NAME = "db-DetmREnTrKjd"
KEYLEN = 10
VALLEN = 30
num_keys = 1000
iterations = 100000000
commit_every = 1000000
leveldb.DestroyDB(RANDOM_DB_NAME)
db = leveldb.LevelDB(RANDOM_DB_NAME)
batch = leveldb.WriteBatch()
#generate a random list of keys to be used
key_list = [''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(KEYLEN)) for i in range(0,num_keys)]
for k in xrange(iterations):
#select a random key from the key list
key_index = random.randrange(0,1000)
key = key_list[key_index]
try:
prev_val = db.Get(key)
except KeyError:
prev_val = ""
random_val = ''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(VALLEN))
#write the current random value plus any value that might already be there
batch.Put(key, prev_val + random_val)
if k % commit_every == 0:
print "Comitting batch %d/%d..." % (k/commit_every, iterations/commit_every)
db.Write(batch, sync=True)
del batch
batch = leveldb.WriteBatch()
db.Write(batch, sync=True)
You should really try Plyvel instead. See https://plyvel.readthedocs.org/. It has way cleaner code, more features, more speed, and a lot more tests. I've used it for bulk writing to quite large databases (20+ GB) without any issues.
(Full disclosure: I'm the author.)
I use http://code.google.com/p/leveldb-py/
I don't have enough information to participate in a python leveldb driver bake-off, but I love the simplicity of leveldb-py. It is a single python file using ctypes. I've used it to store documents in about 3 million keys storing about 10GB and never noticed memory problems.
To your actual problem:
You may try working with the batch size.
Your code, using leveldb-py and doing a put for every key, worked fine on my system using less than 20MB of memory.
I take from here (http://ayende.com/blog/161412/reviewing-leveldb-part-iii-writebatch-isnt-what-you-think-it-is) that there are quite a few memory copies going on under the hood in leveldb.
I have a Python program that processes fairly large NumPy arrays (in the hundreds of megabytes), which are stored on disk in pickle files (one ~100MB array per file). When I want to run a query on the data I load the entire array, via pickle, and then perform the query (so that from the perspective of the Python program the entire array is in memory, even if the OS is swapping it out). I did this mainly because I believed that being able to use vectorized operations on NumPy arrays would be substantially faster than using for loops through each item.
I'm running this on a web server which has memory limits that I quickly run up against. I have many different kinds of queries that I run on the data so writing "chunking" code which loads portions of the data from separate pickle files, processes them, and then proceeds to the next chunk would likely add a lot of complexity. It'd definitely be preferable to make this "chunking" transparent to any function that processes these large arrays.
It seems like the ideal solution would be something like a generator which periodically loaded a block of the data from the disk and then passed the array values out one by one. This would substantially reduce the amount of memory required by the program without requiring any extra work on the part of the individual query functions. Is it possible to do something like this?
PyTables is a package for managing hierarchical datasets. It is designed to solve this problem for you.
NumPy's memory-mapped data structure (memmap) might be a good choice here.
You access your NumPy arrays from a binary file on disk, without loading the entire file into memory at once.
(Note, i believe, but i am not certain, that Numpys memmap object is not the same as Pythons--in particular, NumPys is array-like, Python's is file-like.)
The method signature is:
A = NP.memmap(filename, dtype, mode, shape, order='C')
All of the arguments are straightforward (i.e., they have the same meaning as used elsewhere in NumPy) except for 'order', which refers to order of the ndarray memory layout. I believe the default is 'C', and the (only) other option is 'F', for Fortran--as elsewhere, these two options represent row-major and column-major order, respectively.
The two methods are:
flush (which writes to disk any changes you make to the array); and
close (which writes the data to the memmap array, or more precisely to an array-like memory-map to the data stored on disk)
example use:
import numpy as NP
from tempfile import mkdtemp
import os.path as PH
my_data = NP.random.randint(10, 100, 10000).reshape(1000, 10)
my_data = NP.array(my_data, dtype="float")
fname = PH.join(mkdtemp(), 'tempfile.dat')
mm_obj = NP.memmap(fname, dtype="float32", mode="w+", shape=1000, 10)
# now write the data to the memmap array:
mm_obj[:] = data[:]
# reload the memmap:
mm_obj = NP.memmap(fname, dtype="float32", mode="r", shape=(1000, 10))
# verify that it's there!:
print(mm_obj[:20,:])
It seems like the ideal solution would
be something like a generator which
periodically loaded a block of the
data from the disk and then passed the
array values out one by one. This
would substantially reduce the amount
of memory required by the program
without requiring any extra work on
the part of the individual query
functions. Is it possible to do
something like this?
Yes, but not by keeping the arrays on disk in a single pickle -- the pickle protocol just isn't designed for "incremental deserialization".
You can write multiple pickles to the same open file, one after the other (use dump, not dumps), and then the "lazy evaluator for iteration" just needs to use pickle.load each time.
Example code (Python 3.1 -- in 2.any you'll want cPickle instead of pickle and a -1 for protocol, etc, of course;-):
>>> import pickle
>>> lol = [range(i) for i in range(5)]
>>> fp = open('/tmp/bah.dat', 'wb')
>>> for subl in lol: pickle.dump(subl, fp)
...
>>> fp.close()
>>> fp = open('/tmp/bah.dat', 'rb')
>>> def lazy(fp):
... while True:
... try: yield pickle.load(fp)
... except EOFError: break
...
>>> list(lazy(fp))
[range(0, 0), range(0, 1), range(0, 2), range(0, 3), range(0, 4)]
>>> fp.close()