I am comparing performance of the two dbs, plus csv - data is 1 million row by 5 column float, bulk insert into sqlite/mongodb/csv, done in python.
import csv
import sqlite3
import pymongo
N, M = 1000000, 5
data = np.random.rand(N, M)
docs = [{str(j): data[i, j] for j in range(len(data[i]))} for i in range(N)]
writing to csv takes 6.7 seconds:
%%time
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
for i in range(N):
writer.writerow(data[i])
writing to sqlite3 takes 3.6 seconds:
%%time
con = sqlite3.connect('test.db')
con.execute('create table five(a, b, c, d, e)')
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
writing to mongo takes 14.2 seconds:
%%time
with pymongo.MongoClient() as client:
start_w = time()
client['warmup']['warmup'].insert_many(docs)
start_w = time()
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
I am still new to this, but is it expected that mongodb could be 4x slower sqlite, and 2x slower vs csv, in similar scenarios? It is based on mongodb v4.4 with WiredTiger engine, and python3.8.
I know mongodb excels when there is no fixed schema, but when each document has exactly the same key:value pairs, like the above example, are there methods to speed up the bulk insert?
EDIT: I tested adding a warmup in front of the 'real' write, as #D. SM suggested. It helps, but overall it is still the slowest of the pack. What I meant is, total Wall time 23.9s, (warmup 14.2 + real insert 9.6). What's interesting is that CPU times total 18.1s, meaning 23.9-18.1 = 5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
In any case, even if I use warmup and disregard the IO wait time, the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls! Apparently the csv writer does much better job in buffering/caching. Did I get something seriously wrong here?
Another question somewhat related: the size of the collection file (/var/lib/mongodb/collection-xxx) does not seem to grow linearly, start from batch one, for each million insert, the size goes up by 57MB, 15MB, 75MB, 38MB, 45MB, 68MB. Sizes of compressed random data can vary, I understand, but the variation seems quite large. Is this expected?
MongoDB clients connect to the servers in the background. If you want to benchmark inserts, a more accurate test would be something like this:
with pymongo.MongoClient() as client:
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
Keep in mind that insert_many performs a bulk write and there are limits on bulk write sizes, in particular there can be only 1000 commands per bulk write. If you are sending 1 million inserts you could be looking at 2000 splits per bulk write which all involve data copies. Test inserting 1000 documents at a time vs other batch sizes.
Working test:
import csv
import sqlite3
import pymongo, random, time
N, M = 1000000, 5
docs = [{'_id':1,'b':2,'c':3,'d':4,'e':5}]*N
i=1
for i in range(len(docs)):
docs[i]=dict(docs[i])
docs[i]['_id'] = i
data=[tuple(doc.values())for doc in docs]
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
start = time.time()
for i in range(N):
writer.writerow(data[i])
end = time.time()
print('%f' %( end-start))
con = sqlite3.connect('test.db')
con.execute('drop table if exists five')
con.execute('create table five(a, b, c, d, e)')
start = time.time()
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
end = time.time()
print('%f' %( end-start))
with pymongo.MongoClient() as client:
client['warmup']['warmup'].delete_many({})
client['test']['test'].delete_many({})
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time.time()
coll.insert_many(docs)
end = time.time()
print('%f' %( end-start))
Results:
risque% python3 test.py
0.001464
0.002031
0.022351
risque% python3 test.py
0.013875
0.019704
0.153323
risque% python3 test.py
0.147391
0.236540
1.631367
risque% python3 test.py
1.492073
2.063393
16.289790
MongoDB is about 8x the sqlite time.
Is this expected? Perhaps. The comparison between sqlite and mongodb doesn't reveal much besides that sqlite is markedly faster. But, naturally, this is expected since mongodb utilizes a client/server architecture and sqlite is an in-process database, meaning:
The client has to serialize the data to send to the server
The server has to deserialize that data
The server then has to parse the request and figure out what to do
The server needs to write the data in a scalable/concurrent way (sqlite simply errors with concurrent write errors from what I remember of it)
The server needs to compose a response back to the client, serialize that response, write it to the network
Client needs to read the response, deserialize, check it for success
5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
Compared to what - an in-process database that does not do any network i/o?
the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls
The physical write calls are a small part of what goes into data storage by a modern database.
Besides which, neither case involves a million of them. When you write to file the writes are buffered by python's standard library before they are even sent to the kernel - you have to use flush() after each line to actually produce a million writes. In a database the writes are similarly performed on a page by page basis and not for individual documents.
I'm iterating over M dataframes, each containing a column with N URLs. For each URL, I extract paragraph text, then conduct standard cleaning for textual analysis before calculating "sentiment" scores.
Is it more efficient for me to:
Continue as it is (compute scores in the URL for-loop itself)
Extract all of the text from URLs first, and then separately iterate over the list / column of text ?
Or does it not make any difference?
Currently running calculations within the loop itself. Each DF has about 15,000 - 20,000 URLs so it's taking an insane amount of time too!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')
To answer the question, it shouldn't make too much of a difference if you download the data and then apply analysis to it. You'd just be re arranging the order in which you do a set of tasks that would effectively take the same time.
The only difference may be if the text corpus' are rather large and then read write time to disk will start to play a part so could be a little faster running the analytics all in memory. But this still isn't going to really solve your problem.
May I be so bold as to reinterpret your question as: "My analysis is taking too long help me speed it up!"
This sounds like a perfect use case for multiprocessing! Since this sounds like a data science project you'll need to pip install multiprocess if you're using a ipython notebook (like Jupyter) or import multiprocessing if using a python script. This is because of the way python passes information between processes, don't worry though the API's for both multiprocess and multiprocessing are identical!
A basic and easy way to speed up your analysis is to indent you for loop and put it in a function. That function can then be passed to a multiprocessing map which can spawn multiple processes and do the analysis on several urls all at once:
from multiprocess import Pool
import numpy as np
import os
import pandas as pd
num_cpus = os.cpu_count()
def analytics_function(*args):
#Your full function including fetching data goes here and accepts a array of links
return something
df_links_split = np.array_split(df_links, num_cpus * 2) #I normally just use 2 as a rule of thumb
pool = Pool(num_cpus * 2) #Start a pool with num_cpus * 2 processes
list_of_returned = pool.map(analytics_function, df_links_split)
This will spin up a load of processes and utilise your full cpu. You'll not be able to do much else on your computer, and you'll need to have your resource monitor open to check you're not maxing our your memory and slowing down/crashing the process. But it should significantly speed up your analysis by roughly a factor of num_cpus * 2!!
Extracting all of the texts then processing all of it or extracting one text then processing it before extracting the next wont do any difference.
Doing ABABAB takes as much time as doing AAABBB.
You might however be interested in using threads or asynchronous requests to fetch all of the data in parallel.
I need to shorten multiple urls using the Google Shortener API. Since each shortening process is not interdependent I decided to use multiprocessing library in Python.
raw_data is a data frame which contains my long url. Api_Key is a list which contains api key from multiple google accounts(as i dont want to hit the limit of api usage for a day)
Here is the underlying code.
import pandas as pd
import time
from multiprocessing import Pool
import math
import requests
raw_data["Shortened"] = "xx"
Api_Key_List1 = ['Key1',
'Key2',
'Key3']
raw_data = raw_data[0:3]
LongUrl = raw_data['SMS Click traker'].tolist()
args = [[LongUrl[index], Api_Key_List1[index]] for index,value in enumerate(LongUrl)]
print("xxxx")
def goo_shorten_url(url,key):
post_url = 'https://www.googleapis.com/urlshortener/v1/url?key=%s'%key
payload = {'longUrl': url}
headers = {'content-type': 'application/json'}
time.sleep(2)
r = requests.post(post_url, data=json.dumps(payload), headers=headers)
return(json.loads(r.text)["id"])
def helper(args):
return goo_shorten_url(*args)
if __name__ == '__main__':
p = Pool(processes = 3)
data = p.map(helper,args)
print("main")
raw_data["Shortened"]=data
p.close()
print(raw_data)
Using this code i am successfully shorten multiple urls in one go thus saving a lot of time. Here are some questions which i am having difficulty in finding answers to:
What is the optimal number of sub-processes that I shall be running? (I am using a simple quadcore processor 8 gb ram laptop, Windows 7)
Sometimes due to some issues the api does not return a shortened url. The code is stuck forever. This was my experience when i was not using multiprocess. What happens to the process in such case?
Assuming the subprocess will get stuck for a long set of urls (70 000). How do I kill it and also ensure the code safely returns me all remaining shortened urls with proper matching in the data frame?
How can I identify those urls which were not Shortened?
I am a newbie to python programming, bear with me. This will help me better understand the inner working of multiprocessing.
I'm trying to write a large HDF5 files into MongoDB. I'm following the example in this tutorial: http://api.mongodb.org/python/current/examples/bulk.html. I have a generator that loops through each row of the HDF file and yields a dictionary:
def gen():
for file in files:
data = load_file(file)
for row in data:
ob = dict()
ob['a'] = int(row['a'])
ob['b'] = int(row['b'])
ob['c'] = int(row['c'])
ob['d'] = row['d'].tolist()
ob['e'] = row['e'].tolist()
ob['f'] = row['f'].tolist()
ob['g'] = row['g'].tolist()
yield ob
def main():
data = gen()
db = pymongo.MongoClient().data_db
db.data.insert(data)
This works fine but as time goes on, the Python process takes up more and more RAM until it reaches 10GB and threatens to use up all memory. I think PyMongo is buffering this data in memory and as it waits to write it to the database. Is there a way I can limit how big this buffer is instead of letting it grow uncontrollably? It's strange how the default settings would cause me to run out of RAM.
PyMongo is designed to work the way you want: it iterates your generator until it has one batch of data (16 or 32MB, depending on MongoDB version). Someone contributed this ability to PyMongo last year. What MongoDB and PyMongo versions are you using?
I have a large ASCII file (~100GB) which consists of roughly 1.000.000 lines of known formatted numbers which I try to process with python. The file is too large to read in completely into memory, so I decided to process the file line by line:
fp = open(file_name)
for count,line in enumerate(fp):
data = np.array(line.split(),dtype=np.float)
#do stuff
fp.close()
It turns out, that I spend most of the run time of my program in the data = line. Are there any ways to speed up that line? Also, the execution speed seem much slower than what I could get from an native FORTRAN program with formated read (see this question, I've implemented a FORTRAN string processor and used it with f2py, but the run time was only comparable with the data = line. I guess the I/O handling and type conversions between Python/FORTRAN killed what I gained from FORTRAN)
Since I know the formatting, shouldn't there be a better and faster way as to use split()? Something like:
data = readf(line,'(1000F20.10)')
I tried the fortranformat package, which worked well, but in my case was three times slower than thee split() approach.
P.S. As suggested by ExP and root I tried the np.fromstring and made this quick and dirtry benchmark:
t1 = time.time()
for i in range(500):
data=np.array(line.split(),dtype=np.float)
t2 = time.time()
print (t2-t1)/500
print data.shape
print data[0]
0.00160977363586
(9002,)
0.0015162509
and:
t1 = time.time()
for i in range(500):
data = np.fromstring(line,sep=' ',dtype=np.float,count=9002)
t2 = time.time()
print (t2-t1)/500
print data.shape
print data[0]
0.00159792804718
(9002,)
0.0015162509
so fromstring is in fact slightly slower in my case.
Have you tried numpyp.fromstring?
np.fromstring(line, dtype=np.float, sep=" ")
The np.genfromtxt function is a speed champion if you can get it to match you input format.
If not, then you may already be using the fastest method. Your line-by-line split-into-array approach exactly matches the SciPy Cookbook examples.