Distribute web-scraping write-to-file to parallel processes in Python?

Distribute web-scraping write-to-file to parallel processes in Python? - python

I'm scraping some JSON data from a website, and need to do this ~50,000 times (all data is for distinct zip codes over a 3-year period). I timed out the program for about 1,000 calls, and the average time per call was 0.25 seconds, leaving me with about 3.5 hours of runtime for the whole range (all 50,000 calls).
How can I distribute this process across all of my cores? The core of my code is pretty much this:
with open("U:/dailyweather.txt", "r+") as f:
f.write("var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7\tvar8\tvar9\n")
writeData(zips, zip_weather_links, daypart)
Where writeData() looks like this:
def writeData(zipcodes, links, dayparttime):
for z in zipcodes:
for pair in links:
## do some logic ##
f.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (var1, var2, var3, var4, var5,
var6, var7, var8, var9))
zips looks like this:
zips = ['55111', '56789', '68111', ...]
and zip_weather_links is just a dictionary of (URL, date) for each zip code:
zip_weather_links['55111'] = [('https://website.com/55111/data', datetime.datetime(2013, 1, 1, 0, 0, 0), ...]
How can I distribute this using Pool or multiprocessing? Or would distribution even save time?

You want to "Distribute web-scraping write-to-file to parallel processes in Python".
For a start let's look where the most time is used for Web-Scraping.
The latency for the HTTP-Requests is much higher than for Harddisks. Link: Latency comparison. Small writes to a Harddisk are significantly slower than bigger writes. SSDs have a much higher random write speed so this effect doesn't affect them so much.
Distribute the HTTP-Requests
Collect all the results
Write all the results at once to disk
some example code with IPython parallel:
from ipyparallel import Client
import requests
rc = Client()
lview = rc.load_balanced_view()
worklist = ['http://xkcd.com/614/info.0.json',
'http://xkcd.com/613/info.0.json']
#lview.parallel()
def get_webdata(w):
import requests
r = requests.get(w)
if not r.status_code == 200:
return (w, r.status_code,)
return (w, r.json(),)
#get_webdata will be called once with every element of the worklist
proc = get_webdata.map(worklist)
results = proc.get()
# results is a list with all the return values
print(results[1])
# TODO: write the results to disk
You have to start the IPython parallel workers first:
(py35)River:~ rene$ ipcluster start -n 20

Related

mongodb 4x slower than sqlite, 2x slower than csv?

I am comparing performance of the two dbs, plus csv - data is 1 million row by 5 column float, bulk insert into sqlite/mongodb/csv, done in python.
import csv
import sqlite3
import pymongo
N, M = 1000000, 5
data = np.random.rand(N, M)
docs = [{str(j): data[i, j] for j in range(len(data[i]))} for i in range(N)]
writing to csv takes 6.7 seconds:
%%time
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
for i in range(N):
writer.writerow(data[i])
writing to sqlite3 takes 3.6 seconds:
%%time
con = sqlite3.connect('test.db')
con.execute('create table five(a, b, c, d, e)')
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
writing to mongo takes 14.2 seconds:
%%time
with pymongo.MongoClient() as client:
start_w = time()
client['warmup']['warmup'].insert_many(docs)
start_w = time()
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
I am still new to this, but is it expected that mongodb could be 4x slower sqlite, and 2x slower vs csv, in similar scenarios? It is based on mongodb v4.4 with WiredTiger engine, and python3.8.
I know mongodb excels when there is no fixed schema, but when each document has exactly the same key:value pairs, like the above example, are there methods to speed up the bulk insert?
EDIT: I tested adding a warmup in front of the 'real' write, as #D. SM suggested. It helps, but overall it is still the slowest of the pack. What I meant is, total Wall time 23.9s, (warmup 14.2 + real insert 9.6). What's interesting is that CPU times total 18.1s, meaning 23.9-18.1 = 5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
In any case, even if I use warmup and disregard the IO wait time, the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls! Apparently the csv writer does much better job in buffering/caching. Did I get something seriously wrong here?
Another question somewhat related: the size of the collection file (/var/lib/mongodb/collection-xxx) does not seem to grow linearly, start from batch one, for each million insert, the size goes up by 57MB, 15MB, 75MB, 38MB, 45MB, 68MB. Sizes of compressed random data can vary, I understand, but the variation seems quite large. Is this expected?

MongoDB clients connect to the servers in the background. If you want to benchmark inserts, a more accurate test would be something like this:
with pymongo.MongoClient() as client:
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time()
coll.insert_many(docs)
end = time()
Keep in mind that insert_many performs a bulk write and there are limits on bulk write sizes, in particular there can be only 1000 commands per bulk write. If you are sending 1 million inserts you could be looking at 2000 splits per bulk write which all involve data copies. Test inserting 1000 documents at a time vs other batch sizes.
Working test:
import csv
import sqlite3
import pymongo, random, time
N, M = 1000000, 5
docs = [{'_id':1,'b':2,'c':3,'d':4,'e':5}]*N
i=1
for i in range(len(docs)):
docs[i]=dict(docs[i])
docs[i]['_id'] = i
data=[tuple(doc.values())for doc in docs]
with open('test.csv', 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
start = time.time()
for i in range(N):
writer.writerow(data[i])
end = time.time()
print('%f' %( end-start))
con = sqlite3.connect('test.db')
con.execute('drop table if exists five')
con.execute('create table five(a, b, c, d, e)')
start = time.time()
con.executemany('insert into five(a, b, c, d, e) values (?,?,?,?,?)', data)
end = time.time()
print('%f' %( end-start))
with pymongo.MongoClient() as client:
client['warmup']['warmup'].delete_many({})
client['test']['test'].delete_many({})
client['warmup']['warmup'].insert_many(docs)
db = client['test']
coll = db['test']
start = time.time()
coll.insert_many(docs)
end = time.time()
print('%f' %( end-start))
Results:
risque% python3 test.py
0.001464
0.002031
0.022351
risque% python3 test.py
0.013875
0.019704
0.153323
risque% python3 test.py
0.147391
0.236540
1.631367
risque% python3 test.py
1.492073
2.063393
16.289790
MongoDB is about 8x the sqlite time.
Is this expected? Perhaps. The comparison between sqlite and mongodb doesn't reveal much besides that sqlite is markedly faster. But, naturally, this is expected since mongodb utilizes a client/server architecture and sqlite is an in-process database, meaning:
The client has to serialize the data to send to the server
The server has to deserialize that data
The server then has to parse the request and figure out what to do
The server needs to write the data in a scalable/concurrent way (sqlite simply errors with concurrent write errors from what I remember of it)
The server needs to compose a response back to the client, serialize that response, write it to the network
Client needs to read the response, deserialize, check it for success
5.8s was spent inside .insert_many() method waiting for TCP/IO? That sounds a lot.
Compared to what - an in-process database that does not do any network i/o?
the remaining time left for the actual write is still likely larger than csv write, which is a million write() calls
The physical write calls are a small part of what goes into data storage by a modern database.
Besides which, neither case involves a million of them. When you write to file the writes are buffered by python's standard library before they are even sent to the kernel - you have to use flush() after each line to actually produce a million writes. In a database the writes are similarly performed on a page by page basis and not for individual documents.

Is it possible to operate on multiple txt files at once in python?

I have 1080 .txt files, each of which contain over 100k rows of values in three columns. I have to perform an average of the first column in each of these .txt files.
Any method that performs looping is proving to be too slow as only one file is loaded by numpy.loadtxt at a time.
The kicker is that I have 38 of these folders on which I need to perform this operation. So 38*1030 files in total. Using time module to get compute time for each numpy.loadtxt gives me around 1.7 seconds. So the total time to run over all folders is over 21 hours which seems a bit too much time.
So this has me wondering if there is a way to perform multiple operations at once by being able to open multiple txt files and performing average on the first column. Then also being able to store that average in the corresponding order of the txt files, since the order is important.
Since I am a begineer, I'm not sure if this even is the fastest way. Thanks in advance.
import numpy as np
import glob
import os
i = 0
while i < 39:
source_directory = "something/" + str(i) #Go to specific folder with the numbering
hw_array = sorted(glob.glob(source_directory + "/data_*.txt")) # read paths of 1080 txt files
velocity_array = np.zeros((30,36,3))
for probe in hw_array:
x = 35 - int((i-0.0001)/30) #describing position of the probes where velocities are measured
y = (30 - int(round((i)%30)))%30
velocity_column = np.loadtxt(data_file, usecols=(0)) #step that takes most time
average_array = np.mean(velocity_column, axis=0)
velocity_array[y,x,0] = average_array
velocity_array[y,x,1] = y*(2/29)
velocity_array[y,x,2] = x*0.5
np.save(r"C:/Users/md101/Desktop/AE2/Project2/raw/" + "R29" + "/data" + "R29", velocity_array) #save velocity array for use in later analysis
i += 1

Python has pretty slow I/O, and most of your time is spent in talking to the operating system and other costs associated with opening files.
Threading in python is strange and only provides an improvement in certain situations. Here is why it is good for your case. How threading works in Python is that a thread will do stuff if it has permission (called acquiring the GIL, or global interpreter lock. read about it). While it is waiting for something, like I/O, it will pass up the GIL to another thread. This will allow your files to be operated on (averaging the first row) while it has the GIL, and while the file is being opened it will pass the GIL on to another file to perform operations

It's completely possible to write a function that loads files from a directory, and spawn off one multiprocessing and get it done in close to 1/39th the time it was taking. Or, don't parallelize by directory, but by file and queue up the work and read from that work queue.
Some pseudocode:
pool = multiprocessing.Pool()
workers = []
for d in os.listdir("."):
for f in os.listdir(d):
workers.append(pool.apply_async(counter_function_for_file, (os.path.join(d, f),))
s = sum(worker.get() for worker in workers)
...and put your code for reading from the file in that counter_function_for_file(filename) function.

Nested For Loops With Calculations Vs. Linear Process

I'm iterating over M dataframes, each containing a column with N URLs. For each URL, I extract paragraph text, then conduct standard cleaning for textual analysis before calculating "sentiment" scores.
Is it more efficient for me to:
Continue as it is (compute scores in the URL for-loop itself)
Extract all of the text from URLs first, and then separately iterate over the list / column of text ?
Or does it not make any difference?
Currently running calculations within the loop itself. Each DF has about 15,000 - 20,000 URLs so it's taking an insane amount of time too!
# DFs are stored on a website
# I extract links to each .csv file and store it as a list in "df_links"
for link in df_links:
cleaned_articles = []
df = pd.read_csv(link, sep="\t", header=None)
# Conduct df cleaning
# URLs for articles to scrape are stored in 1 column, which I iterate over as...
for url in df['article_url']:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
para_text = [text.get_text() for text in soup.findAll('p')]
text = " ".join(para_text)
words = text.split()
if len(words) > 500:
# Conduct Text Cleaning & Scores Computations
# Cleaned text stored as a variable "clean_text"
cleaned_articles.append(clean_text)
df['article_text'] = cleaned_articles
df.to_csv('file_name.csv')

To answer the question, it shouldn't make too much of a difference if you download the data and then apply analysis to it. You'd just be re arranging the order in which you do a set of tasks that would effectively take the same time.
The only difference may be if the text corpus' are rather large and then read write time to disk will start to play a part so could be a little faster running the analytics all in memory. But this still isn't going to really solve your problem.
May I be so bold as to reinterpret your question as: "My analysis is taking too long help me speed it up!"
This sounds like a perfect use case for multiprocessing! Since this sounds like a data science project you'll need to pip install multiprocess if you're using a ipython notebook (like Jupyter) or import multiprocessing if using a python script. This is because of the way python passes information between processes, don't worry though the API's for both multiprocess and multiprocessing are identical!
A basic and easy way to speed up your analysis is to indent you for loop and put it in a function. That function can then be passed to a multiprocessing map which can spawn multiple processes and do the analysis on several urls all at once:
from multiprocess import Pool
import numpy as np
import os
import pandas as pd
num_cpus = os.cpu_count()
def analytics_function(*args):
#Your full function including fetching data goes here and accepts a array of links
return something
df_links_split = np.array_split(df_links, num_cpus * 2) #I normally just use 2 as a rule of thumb
pool = Pool(num_cpus * 2) #Start a pool with num_cpus * 2 processes
list_of_returned = pool.map(analytics_function, df_links_split)
This will spin up a load of processes and utilise your full cpu. You'll not be able to do much else on your computer, and you'll need to have your resource monitor open to check you're not maxing our your memory and slowing down/crashing the process. But it should significantly speed up your analysis by roughly a factor of num_cpus * 2!!

Extracting all of the texts then processing all of it or extracting one text then processing it before extracting the next wont do any difference.
Doing ABABAB takes as much time as doing AAABBB.
You might however be interested in using threads or asynchronous requests to fetch all of the data in parallel.

Python: Pycharm runtimes

I am witnessing some strange run time issues with PyCharm that are explained below. The code has been run on a machine with 20 cores and 256 GB RAM and there is sufficient memory to spare. I am not showing any of the real functions as it is a reasonably large project, but am more than happy to add details upon request.
In short, I have a .py file project with the following structure:
import ...
import ...
cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)
def collect_results(result_list):
return pd.DataFrame({'start_time': result_list[0::4],
'arrival_time': result_list[1::4],
'tour_id': result_list[2::4],
'trip_id': result_list[3::4]})
if __name__ == '__main__':
# Run the serial code
st = starttimes.StartTimesCreate(prng)
temp_df, two_trips_df, time_dist_arr = st.run()
# Prepare the dataframe to sample start times. Create groups from the input dataframe
temp_df1 = st.prepare_two_trips_more_df(temp_df, two_trips_df)
validation.logger.info("Dataframe prepared for multiprocessing")
grp_list = []
for name, group in temp_df1.groupby('tour_id'): ### problem lies here in runtimes
grp_list.append(group)
validation.logger.info("All groups have been prepared for multiprocessing, "
"for a total of %s groups" %len(grp_list))
################ PARALLEL CODE BELOW #################
The for loop is run on a dataframe of 10.5million rows and 18 columns. In the current form it takes about 25 mins to create the list of groups (2.8M groups). These groups are created and then fed to a multiprocess pool, code for which is not shown.
The 25 mins it is taking is quite long for I have done the following test run as well, which takes only 7 mins. Essentially, I saved the temp_df1 file to a CSV and then just batched in the pre-saved file and run the same for loop as before.
import ...
import ...
cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)
def collect_results(result_list):
return pd.DataFrame({'start_time': result_list[0::4],
'arrival_time': result_list[1::4],
'tour_id': result_list[2::4],
'trip_id': result_list[3::4]})
if __name__ == '__main__':
# Run the serial code
st = starttimes.StartTimesCreate(prng)
temp_df1 = pd.read_csv(r"c:\\...\\temp_df1.csv")
time_dist = pd.read_csv(r"c:\\...\\start_time_distribution_treso_1.csv")
time_dist_arr = np.array(time_dist.to_records())
grp_list = []
for name, group in temp_df1.groupby('tour_id'):
grp_list.append(group)
validation.logger.info("All groups have been prepared for multiprocessing, "
"for a total of %s groups" %len(grp_list))
QUESTION
So, what is it that is causing the code to run 3 times faster when I just batch in the file versus when the file is created as part of a function further upstream?
Thanks in advance and please let me know how I can further clarify.

I am answering my question as I stumbled upon the answer while doing a bunch of tests and thankfully when I googled the solution someone else had the same issue. The explanation for why having categorical columns is a bad idea when doing group_by operations can be found at the above link. Thus I am not going to post it here. Thanks.

PyMongo Bulk Insert Runs out of memory

I'm trying to write a large HDF5 files into MongoDB. I'm following the example in this tutorial: http://api.mongodb.org/python/current/examples/bulk.html. I have a generator that loops through each row of the HDF file and yields a dictionary:
def gen():
for file in files:
data = load_file(file)
for row in data:
ob = dict()
ob['a'] = int(row['a'])
ob['b'] = int(row['b'])
ob['c'] = int(row['c'])
ob['d'] = row['d'].tolist()
ob['e'] = row['e'].tolist()
ob['f'] = row['f'].tolist()
ob['g'] = row['g'].tolist()
yield ob
def main():
data = gen()
db = pymongo.MongoClient().data_db
db.data.insert(data)
This works fine but as time goes on, the Python process takes up more and more RAM until it reaches 10GB and threatens to use up all memory. I think PyMongo is buffering this data in memory and as it waits to write it to the database. Is there a way I can limit how big this buffer is instead of letting it grow uncontrollably? It's strange how the default settings would cause me to run out of RAM.

PyMongo is designed to work the way you want: it iterates your generator until it has one batch of data (16 or 32MB, depending on MongoDB version). Someone contributed this ability to PyMongo last year. What MongoDB and PyMongo versions are you using?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Distribute web-scraping write-to-file to parallel processes in Python? - python

Related

mongodb 4x slower than sqlite, 2x slower than csv?

Is it possible to operate on multiple txt files at once in python?

Nested For Loops With Calculations Vs. Linear Process

Python: Pycharm runtimes

PyMongo Bulk Insert Runs out of memory

Categories

Resources