I'm trying to write a large HDF5 files into MongoDB. I'm following the example in this tutorial: http://api.mongodb.org/python/current/examples/bulk.html. I have a generator that loops through each row of the HDF file and yields a dictionary:
def gen():
for file in files:
data = load_file(file)
for row in data:
ob = dict()
ob['a'] = int(row['a'])
ob['b'] = int(row['b'])
ob['c'] = int(row['c'])
ob['d'] = row['d'].tolist()
ob['e'] = row['e'].tolist()
ob['f'] = row['f'].tolist()
ob['g'] = row['g'].tolist()
yield ob
def main():
data = gen()
db = pymongo.MongoClient().data_db
db.data.insert(data)
This works fine but as time goes on, the Python process takes up more and more RAM until it reaches 10GB and threatens to use up all memory. I think PyMongo is buffering this data in memory and as it waits to write it to the database. Is there a way I can limit how big this buffer is instead of letting it grow uncontrollably? It's strange how the default settings would cause me to run out of RAM.
PyMongo is designed to work the way you want: it iterates your generator until it has one batch of data (16 or 32MB, depending on MongoDB version). Someone contributed this ability to PyMongo last year. What MongoDB and PyMongo versions are you using?
Related
I am looking for the most efficient way of saving text from PDF files into my database. Currently I am using pdfplumber with standard code looking like this:
my_string = ''
with pdfplumber.open(text_file_path) as pdf:
for page in pdf.pages:
if page.extract_text():
my_string += str(page.extract_text().replace('\n', ' ').split(' '))
But current code is litelary killing my machine (it takes around 3 to 6 GB of RAM for PDF with 600 pages) and my goal is to actually host it on mobile phones.
I did some tests and it seems that reading PDF is not a problem, but saving or storing those words is problematic. I tried to create dict where each page string is one key/value but it wasn't much better.
Maybe I should try yielding each page into txt file and then just read string from this txt file?
I will be grateful for any tips, thanks!
EDIT:
with pdfplumber.open(text_file_path) as pdf:
for page in pdf.pages:
connection = sqlite3.connect('my_db.db')
cursor = connection.cursor()
cursor.execute("INSERT INTO temp_text VALUES (?, ?)",
(text_file_path, str(page.extract_text()).replace('\n', ' ')))
connection.commit()
connection.close()
I changed code to that, and it is a little bit better, (now it takes up to around 2.9 GB of RAM) but it is still a lot. Can I do anything more about it?
The issue is you're storing data long-term, and this means as you incrementally process more and more data, you're still referencing it all in memory. This is what a database aims to prevent: all efficient storage and retrieval of data without needing to store it all in RAM. A simple example using PyMongo (for an iOS app, you're likely going to want to use SQLite) is the following:
import pdfplumbder
import poymongo
import os
def process_file(path, collection):
'''Process a single file, from a path.'''
basename = os.path.splitext(os.path.basename(path))[0]
with pdfplumber.open(path) as pdf:
for index, page in enumerate(pdf.pages):
# Don't store any long-term references to the data
text = page.extract_text()
data = { 'text': text, 'filename': basename, 'page': index }
collection.insert_one(data)
def main(paths):
'''Just a dummy entry point, pass args normally here.'''
client = pymongo.MongoClient('localhost', 27017)
database = client['myapp']
collection = database['pdfs']
# Sort by filename, then by page.
collection.create_index([('filename', 1), ('page', 1)])
for path in paths:
process_file(path, collection)
# Do what you want here
As you can see, we create a connection to the local client, create or access the database we're using, and create a collection for PDF storage. We then index by the filename, then the page number.
We then iterate over all the paths, and iteratively process them. We don't store text for more than a single page at a time, and write the data to the database every loop. For performance, this might not be optimal (although the engine might optimize this decently anyway), but it will minimize the memory required.
Avoid using global state that processes many gigabytes of data: you're forcing Python to keep a reference to all that data when it doesn't need it.
I am comparing performance of the two dbs, plus csv - data is 1 million row by 5 column float, bulk insert into sqlite/mongodb/csv, done in python.
import csv
import sqlite3
import pymongo
N, M = 1000000, 5
data = np.random.rand(N, M)
docs = [{str(j): data[i, j] for j in range(len(data[i]))} for i in range(N)]
writing to csv takes 6.7 seconds:
%%time
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
for i in range(N):
writer.writerow(data[i])
writing to sqlite3 takes 3.6 seconds:
%%time
con = sqlite3.connect('test.db')
con.execute('create table five(a, b, c, d, e)')
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
writing to mongo takes 14.2 seconds:
%%time
with pymongo.MongoClient() as client:
start_w = time()
client['warmup']['warmup'].insert_many(docs)
start_w = time()
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
I am still new to this, but is it expected that mongodb could be 4x slower sqlite, and 2x slower vs csv, in similar scenarios? It is based on mongodb v4.4 with WiredTiger engine, and python3.8.
I know mongodb excels when there is no fixed schema, but when each document has exactly the same key:value pairs, like the above example, are there methods to speed up the bulk insert?
EDIT: I tested adding a warmup in front of the 'real' write, as #D. SM suggested. It helps, but overall it is still the slowest of the pack. What I meant is, total Wall time 23.9s, (warmup 14.2 + real insert 9.6). What's interesting is that CPU times total 18.1s, meaning 23.9-18.1 = 5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
In any case, even if I use warmup and disregard the IO wait time, the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls! Apparently the csv writer does much better job in buffering/caching. Did I get something seriously wrong here?
Another question somewhat related: the size of the collection file (/var/lib/mongodb/collection-xxx) does not seem to grow linearly, start from batch one, for each million insert, the size goes up by 57MB, 15MB, 75MB, 38MB, 45MB, 68MB. Sizes of compressed random data can vary, I understand, but the variation seems quite large. Is this expected?
MongoDB clients connect to the servers in the background. If you want to benchmark inserts, a more accurate test would be something like this:
with pymongo.MongoClient() as client:
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
Keep in mind that insert_many performs a bulk write and there are limits on bulk write sizes, in particular there can be only 1000 commands per bulk write. If you are sending 1 million inserts you could be looking at 2000 splits per bulk write which all involve data copies. Test inserting 1000 documents at a time vs other batch sizes.
Working test:
import csv
import sqlite3
import pymongo, random, time
N, M = 1000000, 5
docs = [{'_id':1,'b':2,'c':3,'d':4,'e':5}]*N
i=1
for i in range(len(docs)):
docs[i]=dict(docs[i])
docs[i]['_id'] = i
data=[tuple(doc.values())for doc in docs]
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
start = time.time()
for i in range(N):
writer.writerow(data[i])
end = time.time()
print('%f' %( end-start))
con = sqlite3.connect('test.db')
con.execute('drop table if exists five')
con.execute('create table five(a, b, c, d, e)')
start = time.time()
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
end = time.time()
print('%f' %( end-start))
with pymongo.MongoClient() as client:
client['warmup']['warmup'].delete_many({})
client['test']['test'].delete_many({})
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time.time()
coll.insert_many(docs)
end = time.time()
print('%f' %( end-start))
Results:
risque% python3 test.py
0.001464
0.002031
0.022351
risque% python3 test.py
0.013875
0.019704
0.153323
risque% python3 test.py
0.147391
0.236540
1.631367
risque% python3 test.py
1.492073
2.063393
16.289790
MongoDB is about 8x the sqlite time.
Is this expected? Perhaps. The comparison between sqlite and mongodb doesn't reveal much besides that sqlite is markedly faster. But, naturally, this is expected since mongodb utilizes a client/server architecture and sqlite is an in-process database, meaning:
The client has to serialize the data to send to the server
The server has to deserialize that data
The server then has to parse the request and figure out what to do
The server needs to write the data in a scalable/concurrent way (sqlite simply errors with concurrent write errors from what I remember of it)
The server needs to compose a response back to the client, serialize that response, write it to the network
Client needs to read the response, deserialize, check it for success
5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
Compared to what - an in-process database that does not do any network i/o?
the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls
The physical write calls are a small part of what goes into data storage by a modern database.
Besides which, neither case involves a million of them. When you write to file the writes are buffered by python's standard library before they are even sent to the kernel - you have to use flush() after each line to actually produce a million writes. In a database the writes are similarly performed on a page by page basis and not for individual documents.
When inserting a huge pandas dataframe into sqlite via sqlalchemy and pandas to_sql and a specified chucksize, I would get memory errors.
At first I thought it was an issue with to_sql but I tried a workaround where instead of using chunksize I used for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) and that still resulted in an error.
It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.
I had a hard time trying to replicate the memory leak that occured when converting my data, through a minimal example. But this gets pretty close.
import string
import numpy as np
import pandas as pd
from random import randint
import random
def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
return (np.random.choice(list(chars), num_rows*size)
.view('|U{}'.format(size)))
def alt(size, num_rows):
data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
dfAll = pd.DataFrame(data)
return dfAll
dfAll = alt(randint(1000, 2000), 10000)
for i in range(330):
print('step ', i)
data = alt(randint(1000, 2000), 10000)
df = pd.DataFrame(data)
dfAll = pd.concat([ df, dfAll ])
import sqlalchemy
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')
for i in range(500):
print('step', i)
dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')
This was run on Google Colab CPU enviroment.
The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions repeated insertions via looping to_sql or one to_sql with chucksize specified.
Is there a way that this code could be run without causing an eventual increase in memory usage?
Edit:
To fully reproduce the error, run this notebook
https://drive.google.com/open?id=1ZijvI1jU66xOHkcmERO4wMwe-9HpT5OS
The notebook requires you to import this folder into the main directory of your Google Drive
https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8
The notebook will also mount your Google drive, you need to give it authorization to access your Google drive. Since the data is hosted on my Google drive, importing the data should not take up any of your allocated data.
The Google Colab instance starts with about 12.72GB of RAM available.
After creating the DataFrame, theBigList, about 9.99GB of RAM have been used.
Already this is a rather uncomfortable situation to be in, since it is not unusual for
Pandas operations to require as much additional space as the DataFrame it is operating on.
So we should strive to avoid using even this much RAM if possible, and fortunately there is an easy way to do this: simply load each .npy file and store its data in the sqlite database one at a time without ever creating theBigList (see below).
However, if we use the code you posted, we can see that the RAM usage slowly increases
as chunks of theBigList is stored in the database iteratively.
theBigList DataFrame stores the strings in a NumPy array. But in the process
of transferring the strings to the sqlite database, the NumPy strings are
converted into Python strings. This takes additional memory.
Per this Theano tutoral which discusses Python internal memory management,
To speed-up memory allocation (and reuse) Python uses a number of lists for
small objects. Each list will contain objects of similar size: there will be a
list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object
needs to be created, either we reuse a free block in the list, or we allocate a
new one.
... The important point is that those lists never shrink.
Indeed: if an item (of size x) is deallocated (freed by lack of reference) its
location is not returned to Python’s global memory pool (and even less to the
system), but merely marked as free and added to the free list of items of size
x. The dead object’s location will be reused if another object of compatible
size is needed. If there are no dead objects available, new ones are created.
If small objects memory is never freed, then the inescapable conclusion is that,
like goldfishes, these small object lists only keep growing, never shrinking,
and that the memory footprint of your application is dominated by the largest
number of small objects allocated at any given point.
I believe this accurately describes the behavior you are seeing as this loop executes:
for i in range(0, 588):
theBigList.iloc[i*10000:(i+1)*10000].to_sql(
'CS_table', engine, index=False, if_exists='append')
Even though many dead objects' locations are being reused for new strings, it is
not implausible with essentially random strings such as those in theBigList that extra space will occasionally be
needed and so the memory footprint keeps growing.
The process eventually hits Google Colab's 12.72GB RAM limit and the kernel is killed with a memory error.
In this case, the easiest way to avoid large memory usage is to never instantiate the entire DataFrame -- instead, just load and process small chunks of the DataFrame one at a time:
import numpy as np
import pandas as pd
import matplotlib.cbook as mc
import sqlalchemy as SA
def load_and_store(dbpath):
engine = SA.create_engine("sqlite:///{}".format(dbpath))
for i in range(0, 47):
print('step {}: {}'.format(i, mc.report_memory()))
for letter in list('ABCDEF'):
path = '/content/gdrive/My Drive/SummarizationTempData/CS2Part{}{:02}.npy'.format(letter, i)
comb = np.load(path, allow_pickle=True)
toPD = pd.DataFrame(comb).drop([0, 2, 3], 1).astype(str)
toPD.columns = ['title', 'abstract']
toPD = toPD.loc[toPD['abstract'] != '']
toPD.to_sql('CS_table', engine, index=False, if_exists='append')
dbpath = '/content/gdrive/My Drive/dbfile/CSSummaries.db'
load_and_store(dbpath)
which prints
step 0: 132545
step 1: 176983
step 2: 178967
step 3: 181527
...
step 43: 190551
step 44: 190423
step 45: 190103
step 46: 190551
The last number on each line is the amount of memory consumed by the process as reported by
matplotlib.cbook.report_memory. There are a number of different measures of memory usage. On Linux, mc.report_memory() is reporting
the size of the physical pages of the core image of the process (including text, data, and stack space).
By the way, another basic trick you can use manage memory is to use functions.
Local variables inside the function are deallocated when the function terminates.
This relieves you of the burden of manually calling del and gc.collect().
So I'm writing a Python script for indexing the Bitcoin blockchain by addresses, using a leveldb database (py-leveldb), and it keeps eating more and more memory until it crashes. I've replicated the behaviour in the code example below. When I run the code it continues to use more and more memory until it's exhausted the available RAM on my system and the process is either killed or throws "std::bad_alloc".
Am I doing something wrong? I keep writing to the batch object, and commit it every once in a while, but the memory usage keeps increasing even though I commit the data in the WriteBatch object. I even delete the WriteBatch object after comitting it, so as far as I can see it can't be this that is causing the memory leak.
Is my code using WriteBatch in a wrong way or is there a memory leak in py-leveldb?
The code requires py-leveldb to run, get it from here: https://pypi.python.org/pypi/leveldb
WARNING: RUNNING THIS CODE WILL EXHAUST YOUR MEMORY IF IT RUNS LONG ENOUGH. DO NOT RUN ON ON A CRITICAL SYSTEM. Also, it will write the data to a folder in the same folder as the script runs in, on my system this folder contains about 1.5GB worth of database files before memory is exhaused (it ends up consuming over 3GB of RAM).
Here's the code:
import leveldb, random, string
RANDOM_DB_NAME = "db-DetmREnTrKjd"
KEYLEN = 10
VALLEN = 30
num_keys = 1000
iterations = 100000000
commit_every = 1000000
leveldb.DestroyDB(RANDOM_DB_NAME)
db = leveldb.LevelDB(RANDOM_DB_NAME)
batch = leveldb.WriteBatch()
#generate a random list of keys to be used
key_list = [''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(KEYLEN)) for i in range(0,num_keys)]
for k in xrange(iterations):
#select a random key from the key list
key_index = random.randrange(0,1000)
key = key_list[key_index]
try:
prev_val = db.Get(key)
except KeyError:
prev_val = ""
random_val = ''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(VALLEN))
#write the current random value plus any value that might already be there
batch.Put(key, prev_val + random_val)
if k % commit_every == 0:
print "Comitting batch %d/%d..." % (k/commit_every, iterations/commit_every)
db.Write(batch, sync=True)
del batch
batch = leveldb.WriteBatch()
db.Write(batch, sync=True)
You should really try Plyvel instead. See https://plyvel.readthedocs.org/. It has way cleaner code, more features, more speed, and a lot more tests. I've used it for bulk writing to quite large databases (20+ GB) without any issues.
(Full disclosure: I'm the author.)
I use http://code.google.com/p/leveldb-py/
I don't have enough information to participate in a python leveldb driver bake-off, but I love the simplicity of leveldb-py. It is a single python file using ctypes. I've used it to store documents in about 3 million keys storing about 10GB and never noticed memory problems.
To your actual problem:
You may try working with the batch size.
Your code, using leveldb-py and doing a put for every key, worked fine on my system using less than 20MB of memory.
I take from here (http://ayende.com/blog/161412/reviewing-leveldb-part-iii-writebatch-isnt-what-you-think-it-is) that there are quite a few memory copies going on under the hood in leveldb.
I have millions of entities of a particular type that i would like to export to a csv file. The following code writes entities in batches of 1000 to a blob while keeping the blob open and deferring the next batch to the task queue. When there are no more entities to be fetched the blob is finalized. This seems to work for most of my local testing but I wanted to know:
If i am missing out on any gotchas or corner cases before running it on my production data and incurring $s for datastore reads.
If the deadline is exceeded or the memory runs out while the batch is being written to the blob, this code is defaulting to the start of the current batch for running the task again which may cause a lot of duplication. Any suggestions to fix that?
def entities_to_csv(entity_type,blob_file_name='',cursor='',batch_size=1000):
more = True
next_curs = None
q = entity_type.query()
results,next_curs,more = q.fetch_page(batch_size,start_cursor=Cursor.from_websafe_string(cursor))
if results:
try:
if not blob_file_name:
blob_file_name = files.blobstore.create(mime_type='text/csv',_blob_uploaded_filename='%s.csv' % entity_type.__name__)
rows = [e.to_dict() for e in results]
with files.open(blob_file_name, 'a') as f:
writer = csv.DictWriter(f,restval='',extrasaction='ignore',fieldnames=results[0].keys())
writer.writerows(rows)
if more:
deferred.defer(entity_type,blob_file_name,next_curs.to_websafe_string())
else:
files.finalize(blob_file_name)
except DeadlineExceededError:
deferred.defer(entity_type,blob_file_name,cursor)
Later in the code, something like:
deferred.defer(entities_to_csv,Song)
The problem with your current solution is that your memory will increase with every write to preform to the blobstore. the blobstore is immutable and write all the data at once from the memory.
You need to run the job on a backend that can hold all the records in memory, you need to define a backend in your application and call defer with _target='<backend name>'.
Check out this Google I/O video, pretty much describes what you want to do using MapReduce, starting at around the 23:15 mark in the video. The code you want is at 27:19
https://developers.google.com/events/io/sessions/gooio2012/307/