I am comparing performance of the two dbs, plus csv - data is 1 million row by 5 column float, bulk insert into sqlite/mongodb/csv, done in python.
import csv
import sqlite3
import pymongo
N, M = 1000000, 5
data = np.random.rand(N, M)
docs = [{str(j): data[i, j] for j in range(len(data[i]))} for i in range(N)]
writing to csv takes 6.7 seconds:
%%time
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
for i in range(N):
writer.writerow(data[i])
writing to sqlite3 takes 3.6 seconds:
%%time
con = sqlite3.connect('test.db')
con.execute('create table five(a, b, c, d, e)')
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
writing to mongo takes 14.2 seconds:
%%time
with pymongo.MongoClient() as client:
start_w = time()
client['warmup']['warmup'].insert_many(docs)
start_w = time()
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
I am still new to this, but is it expected that mongodb could be 4x slower sqlite, and 2x slower vs csv, in similar scenarios? It is based on mongodb v4.4 with WiredTiger engine, and python3.8.
I know mongodb excels when there is no fixed schema, but when each document has exactly the same key:value pairs, like the above example, are there methods to speed up the bulk insert?
EDIT: I tested adding a warmup in front of the 'real' write, as #D. SM suggested. It helps, but overall it is still the slowest of the pack. What I meant is, total Wall time 23.9s, (warmup 14.2 + real insert 9.6). What's interesting is that CPU times total 18.1s, meaning 23.9-18.1 = 5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
In any case, even if I use warmup and disregard the IO wait time, the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls! Apparently the csv writer does much better job in buffering/caching. Did I get something seriously wrong here?
Another question somewhat related: the size of the collection file (/var/lib/mongodb/collection-xxx) does not seem to grow linearly, start from batch one, for each million insert, the size goes up by 57MB, 15MB, 75MB, 38MB, 45MB, 68MB. Sizes of compressed random data can vary, I understand, but the variation seems quite large. Is this expected?
MongoDB clients connect to the servers in the background. If you want to benchmark inserts, a more accurate test would be something like this:
with pymongo.MongoClient() as client:
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
Keep in mind that insert_many performs a bulk write and there are limits on bulk write sizes, in particular there can be only 1000 commands per bulk write. If you are sending 1 million inserts you could be looking at 2000 splits per bulk write which all involve data copies. Test inserting 1000 documents at a time vs other batch sizes.
Working test:
import csv
import sqlite3
import pymongo, random, time
N, M = 1000000, 5
docs = [{'_id':1,'b':2,'c':3,'d':4,'e':5}]*N
i=1
for i in range(len(docs)):
docs[i]=dict(docs[i])
docs[i]['_id'] = i
data=[tuple(doc.values())for doc in docs]
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
start = time.time()
for i in range(N):
writer.writerow(data[i])
end = time.time()
print('%f' %( end-start))
con = sqlite3.connect('test.db')
con.execute('drop table if exists five')
con.execute('create table five(a, b, c, d, e)')
start = time.time()
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
end = time.time()
print('%f' %( end-start))
with pymongo.MongoClient() as client:
client['warmup']['warmup'].delete_many({})
client['test']['test'].delete_many({})
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time.time()
coll.insert_many(docs)
end = time.time()
print('%f' %( end-start))
Results:
risque% python3 test.py
0.001464
0.002031
0.022351
risque% python3 test.py
0.013875
0.019704
0.153323
risque% python3 test.py
0.147391
0.236540
1.631367
risque% python3 test.py
1.492073
2.063393
16.289790
MongoDB is about 8x the sqlite time.
Is this expected? Perhaps. The comparison between sqlite and mongodb doesn't reveal much besides that sqlite is markedly faster. But, naturally, this is expected since mongodb utilizes a client/server architecture and sqlite is an in-process database, meaning:
The client has to serialize the data to send to the server
The server has to deserialize that data
The server then has to parse the request and figure out what to do
The server needs to write the data in a scalable/concurrent way (sqlite simply errors with concurrent write errors from what I remember of it)
The server needs to compose a response back to the client, serialize that response, write it to the network
Client needs to read the response, deserialize, check it for success
5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
Compared to what - an in-process database that does not do any network i/o?
the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls
The physical write calls are a small part of what goes into data storage by a modern database.
Besides which, neither case involves a million of them. When you write to file the writes are buffered by python's standard library before they are even sent to the kernel - you have to use flush() after each line to actually produce a million writes. In a database the writes are similarly performed on a page by page basis and not for individual documents.
Related
I have the following python code which runs multiple SQL Queries in Oracle database and combines them into one dataframe.
The queries exist in a txt file and every row is a separate SQL query. The loop runs sequentially the queries. I want to cancel any SQL queries that run for more than 10 secs so as not to create an overhead in the database.
The following code doesnt actually me give the results that i want. More specifically this bit of the code really help me on my issue:
if (time.time() - start) > 10:
connection.cancel()
Full python code is the following. Probably it is an oracle function that can be called so as to cancel the query.
import pandas as pd
import cx_Oracle
import time
ip = 'XX.XX.XX.XX'
port = XXXX
svc = 'XXXXXX'
dsn_tns = cx_Oracle.makedsn(ip, port, service_name = svc)
connection = cx_Oracle.connect(user='XXXXXX'
, password='XXXXXX'
, dsn=dsn_tns
, encoding = "UTF-8"
, nencoding = "UTF-8"
)
filepath = 'C:/XXXXX'
appended_data = []
with open(filepath + 'sql_queries.txt') as fp:
line = fp.readline()
while line:
start = time.time()
df = pd.read_sql(line, con=connection)
if (time.time() - start) > 10:
connection.cancel()
print("Cancel")
appended_data.append(df)
df_combined = pd.concat(appended_data, axis=0)
line = fp.readline()
print(time.time() - start)
fp.close()
A better approach would be to spend some time tuning the queries to make them as efficient as necessary. As #Andrew points out we can't easily kill a database query from outside the database - or even from another session inside the database (it requires DBA level privileges).
Indeed, most DBAs would rather you ran a query for 20 seconds rather than attempt to kill every query which runs more than 10. Apart from anything else, having a process which polls you query to see how long it's been running for is itself a waste of database resources.
I suggest you discuss this with your DBA. You may find you're worrying about nothing.
Look at cx_Oracle 7's Connection.callTimeout setting. You'll need to be using Oracle client libraries 18+. (These will connect to Oracle DB 11.2+). The doc for the equivalent node-oracledb parameter explains the fine print behind the Oracle behavior and round trips.
I'm scraping some JSON data from a website, and need to do this ~50,000 times (all data is for distinct zip codes over a 3-year period). I timed out the program for about 1,000 calls, and the average time per call was 0.25 seconds, leaving me with about 3.5 hours of runtime for the whole range (all 50,000 calls).
How can I distribute this process across all of my cores? The core of my code is pretty much this:
with open("U:/dailyweather.txt", "r+") as f:
f.write("var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7\tvar8\tvar9\n")
writeData(zips, zip_weather_links, daypart)
Where writeData() looks like this:
def writeData(zipcodes, links, dayparttime):
for z in zipcodes:
for pair in links:
## do some logic ##
f.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (var1, var2, var3, var4, var5,
var6, var7, var8, var9))
zips looks like this:
zips = ['55111', '56789', '68111', ...]
and zip_weather_links is just a dictionary of (URL, date) for each zip code:
zip_weather_links['55111'] = [('https://website.com/55111/data', datetime.datetime(2013, 1, 1, 0, 0, 0), ...]
How can I distribute this using Pool or multiprocessing? Or would distribution even save time?
You want to "Distribute web-scraping write-to-file to parallel processes in Python".
For a start let's look where the most time is used for Web-Scraping.
The latency for the HTTP-Requests is much higher than for Harddisks. Link: Latency comparison. Small writes to a Harddisk are significantly slower than bigger writes. SSDs have a much higher random write speed so this effect doesn't affect them so much.
Distribute the HTTP-Requests
Collect all the results
Write all the results at once to disk
some example code with IPython parallel:
from ipyparallel import Client
import requests
rc = Client()
lview = rc.load_balanced_view()
worklist = ['http://xkcd.com/614/info.0.json',
'http://xkcd.com/613/info.0.json']
#lview.parallel()
def get_webdata(w):
import requests
r = requests.get(w)
if not r.status_code == 200:
return (w, r.status_code,)
return (w, r.json(),)
#get_webdata will be called once with every element of the worklist
proc = get_webdata.map(worklist)
results = proc.get()
# results is a list with all the return values
print(results[1])
# TODO: write the results to disk
You have to start the IPython parallel workers first:
(py35)River:~ rene$ ipcluster start -n 20
I have to import csv data from the URL which is giving me the data in the chunks of stream to a mongoDB server. I have tried following way to import the data in python :
response = urllib2.urlopen(url)
cr = csv.DictReader(response)
if seperator is not "":
cr = csv.DictReader(response,delimiter=seperator, quoting=csv.QUOTE_ALL)
cr.next()
ct = 0
#1 make the data objects
rows = list(cr)
totalrows = len(rows)
for i,row in enumerate(rows):
# I am creating the mongo documents here
# once the documents are ready insert them into respective mongo collections
since I have a large number of urls and each url having almost 200 MB data so I tried multiprocessing to do so but still my script is taking too long to execute( takes 5-7 hours to import 0.6million with only 5-10% CPU usage and ~70% memory usage). My server is 4core CPU with 8GB RAM.
Please suggest me to achieve the best performance with python.
I'm trying to write a large HDF5 files into MongoDB. I'm following the example in this tutorial: http://api.mongodb.org/python/current/examples/bulk.html. I have a generator that loops through each row of the HDF file and yields a dictionary:
def gen():
for file in files:
data = load_file(file)
for row in data:
ob = dict()
ob['a'] = int(row['a'])
ob['b'] = int(row['b'])
ob['c'] = int(row['c'])
ob['d'] = row['d'].tolist()
ob['e'] = row['e'].tolist()
ob['f'] = row['f'].tolist()
ob['g'] = row['g'].tolist()
yield ob
def main():
data = gen()
db = pymongo.MongoClient().data_db
db.data.insert(data)
This works fine but as time goes on, the Python process takes up more and more RAM until it reaches 10GB and threatens to use up all memory. I think PyMongo is buffering this data in memory and as it waits to write it to the database. Is there a way I can limit how big this buffer is instead of letting it grow uncontrollably? It's strange how the default settings would cause me to run out of RAM.
PyMongo is designed to work the way you want: it iterates your generator until it has one batch of data (16 or 32MB, depending on MongoDB version). Someone contributed this ability to PyMongo last year. What MongoDB and PyMongo versions are you using?
I have a piece of code that queries a server that returns a big json object(elasticsearch, BTW),
It takes a lot of time to read the results. parsing the json object is very fast.
tic = time.time()
req_resp = urllib2.urlopen(req, timeout = 60)
toc=time.time()
a = toc-tic
tic = time.time()
json_str = req_resp.read()
toc=time.time()
b = toc-tic
tic = time.time()
resp = json.loads(json_str)
toc=time.time()
c = toc-tic
print 'Fetch %.1f Process %.1f, load Json %.1f' %(a,b,c)
Output:
Fetch 0.5 Process 3.5, load Json 0.0
It seems strange that this takes so much time, while loading the json is fast. What am I doing wrong? any way to do this faster?
FYI this is a query for 1000 documents in elasticsearch, returning a few string fields which are several words long.
I am using python 2.7
The socket module relies on _socket which is written in C++ (I think?). Presumably there is an overhead transferring large amounts of data between C++ and Python. I also get an oddly large overhead with .read() thou I have not tried it with huge data sets so it wasn't bigger than the fetch time. I'm not sure there is any thing you can do apart from switching to a different language. I will do some more testing and get back to you if I find any thing else.