I ran into a pickle (literally) in parallelizing the following Python code and could really need some help.
First of all the input is a CSV file consisting of a list of website links that I need to scrape with the function scrape_function(). The original code is as follows and runs perfectly
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
m=[]
for i in inputlist:
m.append(scrape_code(re.sub("\'|\[|\]",'',str(i)))) #remove the quotes around the link strings otherwise it results in URLError
print(m)
I then tried to parallelize this code using joblib as follows:
from joblib import Parallel, delayed
import multiprocessing
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(m.append(scrape_code(re.sub("\'|\[|\]",'',str(i))))) for i in inputlist)
However, this would result in a weird error:
File "C:\Users\...\joblib\pool.py", line 371, in send
CustomizablePickler(buffer, self._reducers).dump(obj)
AttributeError: Can't pickle local object 'delayed.<locals>.delayed_function'
Any idea what I did wrong here? If I try to put the append in a separate function like below then the error would go away, but the execution would then freeze and hang indefinitely:
def process(k):
a=[]
a.append(scrape_code(re.sub("\'|\[|\]",'',str(k))))
return a
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(process)(i) for i in inputlist)
The input list has 10000s of pages so parallel processing would be a huge benefit.
If you really need it in separate processes, the easiest way is to just create a process pool and let it deal with distributing the links to your function, e.g.:
import csv
from multiprocessing import Pool
if __name__ == "__main__": # multiprocessing guard
with open("c:\\links.csv", "r", newline="") as f: # open the CSV
reader = csv.reader(f) # create a reader
links = [r[0] for r in reader] # collect only the first column
with Pool() as pool: # create a pool, it will make a pool with all your CPU cores...
results = pool.map(scrape_code, links) # distribute your links to scrape_code
print(results)
NOTE: I'm assuming your links.csv actually holds the link in its first column based on how you're pre-processing the links in your code.
However, as I've stated in my comment, this doesn't have to be necessarily faster than plain threading so I'd first try it using threads. Fortunately, the multiprocessing module includes a threading interfrace dummy so you just need to replace from multiprocessing import Pool with from multiprocessing.dummy import Pool and see in what regime your code works faster.
Related
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I wrote a simple python multiprocessing, in which it reads a bunch of lines from csv, calls an api and then writes to new csv. However, what I see is that performance of this program is same as sequential execution. Changing the pool size does not have any effect. What is going wrong?
from multiprocessing import Pool
from random import randint
from time import sleep
import csv
import requests
import json
def orders_v4(order_number):
response = requests.request("GET", url, headers=headers, params=querystring, verify=False)
return response.json()
newcsvFile=open('gom_acr_status.csv', 'w')
writer = csv.writer(newcsvFile)
def process_line(row):
ol_key = row['\ufeffORDER_LINE_KEY']
order_number=row['ORDER_NUMBER']
orders_json = orders_v4(order_number)
oms_order_key = orders_json['oms_order_key']
order_lines = orders_json["order_lines"]
for order_line in order_lines:
if ol_key==order_line['order_line_key']:
print(order_number)
print(ol_key)
ftype = order_line['fulfillment_spec']['fulfillment_type']
status_desc = order_line['statuses'][0]['status_description']
print(ftype)
print(status_desc)
listrow = [ol_key, order_number, ftype, status_desc]
#(writer)
writer.writerow(listrow)
newcsvFile.flush()
def get_next_line():
with open("gom_acr.csv", 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield row
f = get_next_line()
t = Pool(processes=50)
for i in f:
t.map(process_line, (i,))
t.join()
t.close()
EDIT: I just noticed you call map inside a loop. you need to call it only once. is is a blocking function, it is not async! check out the docs for examples of correct usage.
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
Original answer:
The fact that all processes write to the output file causes file-system contention.
If your process_line function would just return the rows (e.g. as a list of strings), then the main processes would write all of those after map returned them all, then you should experience a performance boost.
also, 2 notes:
try different numbers of processes, starting from # of cores and going up. maybe 50 is too much.
the work done in each process seems (to me, at first glance) pretty short, it is possible that the overhead of spawning new processes and orchestrating them is just too big to benefit the task at hand.
Is there a good way to download a lot of files en masse using python? This code is speedy enough for downloading about 100 or so files. But I need to download 300,000 files. Obviously they are all very small files (or I wouldn't be downloading 300,000 of them :) ) so the real bottleneck seems to be this loop. Does anyone have any thoughts? Maybe using MPI or threading?
Do I just have to live with the bottle neck? Or is there a faster way, maybe not even using python?
(I included the full beginning of the code just for completeness sake)
from __future__ import division
import pandas as pd
import numpy as np
import urllib2
import os
import linecache
#we start with a huge file of urls
data= pd.read_csv("edgar.csv")
datatemp2=data[data['form'].str.contains("14A")]
datatemp3=data[data['form'].str.contains("14C")]
#data2 is the cut-down file
data2=datatemp2.append(datatemp3)
flist=np.array(data2['filename'])
print len(flist)
print flist
###below we have a script to download all of the files in the data2 database
###here you will need to create a new directory named edgar14A14C in your CWD
original=os.getcwd().copy()
os.chdir(str(os.getcwd())+str('/edgar14A14C'))
for i in xrange(len(flist)):
url = "ftp://ftp.sec.gov/"+str(flist[i])
file_name = str(url.split('/')[-1])
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
print i
The usual pattern with multiprocessing is to create a job() function that takes arguments and performs some potentially CPU bound work.
Example: (based on your code)
from multiprocessing import Pool
def job(url):
file_name = str(url.split('/')[-1])
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
pool = Pool()
urls = ["ftp://ftp.sec.gov/{0:s}".format(f) for f in flist]
pool.map(job, urls)
This will do a number of things:
Create a multiprocessing pool and set of workers as you have CPU(s) or CPU Core(s)
Create a list of inputs to the job() function.
Map the list of inputs urls to job() and wait for all jobs to complete.
Python's multiprocessing.Pool.map will take care of splitting up your input across the no. of workers in the pool.
Another useful neat little thing I've done for this kind of work is to use progress like this:
from multiprocessing import Pool
from progress.bar import Bar
def job(input):
# do some work
pool = Pool()
inputs = range(100)
bar = Bar('Processing', max=len(inputs))
for i in pool.imap(job, inputs):
bar.next()
bar.finish()
This gives you a nice progress bar on your console as your jobs are progressing so you have some idea of progress and eta, etc.
I also find the requests library very useful here and a much nicer set of API(s) for dealing with web resources and downloading of content.
I'm using multiprocessing in Python for parallelizing.
I'm trying to parallelize the process on chunks of data read from an excel file using pandas.
I'm new to multiprocessing and parallel processing. During implementation on simple code,
import time;
import os;
from multiprocessing import Process
import pandas as pd
print os.getpid();
df = pd.read_csv('train.csv', sep=',',usecols=["POLYLINE"],iterator=True,chunksize=2);
print "hello";
def my_function(chunk):
print chunk;
count = 0;
processes = [];
for chunk in df:
if __name__ == '__main__':
p = Process(target=my_function,args=(chunk,));
processes.append(p);
if(count==4):
break;
count = count + 1;
The print "hello" is being executed multiple times, I'm guessing the individual process created should work on the target rather than main code.
Can anyone suggest me where I'm wrong.
The way that multiprocessing works is create a new process and then import the file with the target function. Since your outermost scope has print statements, it will get executed once for every process.
By the way you should use a Pool instead of Processes directly. Here's a cleaned up example:
import os
import time
from multiprocessing import Pool
import pandas as pd
NUM_PROCESSES = 4
def process_chunk(chunk):
# do something
return chunk
if __name__ == '__main__':
df = pd.read_csv('train.csv', sep=',', usecols=["POLYLINE"], iterator=True, chunksize=2)
pool = Pool(NUM_PROCESSES)
for result in pool.map(process_chunk, df):
print result
Using multiprocessing is probably not going to speed up reading data from disk, since disk access is much slower than e.g. RAM access or calculations. And the different pieces of the file will end up in different processes.
Using mmap could help speed up data access.
If you do a read-only mmap of the data file before starting e.g. a Pool.map, each worker could read its own slice of data from the shared memory mapped file and process it.
Is there a good way to download a lot of files en masse using python? This code is speedy enough for downloading about 100 or so files. But I need to download 300,000 files. Obviously they are all very small files (or I wouldn't be downloading 300,000 of them :) ) so the real bottleneck seems to be this loop. Does anyone have any thoughts? Maybe using MPI or threading?
Do I just have to live with the bottle neck? Or is there a faster way, maybe not even using python?
(I included the full beginning of the code just for completeness sake)
from __future__ import division
import pandas as pd
import numpy as np
import urllib2
import os
import linecache
#we start with a huge file of urls
data= pd.read_csv("edgar.csv")
datatemp2=data[data['form'].str.contains("14A")]
datatemp3=data[data['form'].str.contains("14C")]
#data2 is the cut-down file
data2=datatemp2.append(datatemp3)
flist=np.array(data2['filename'])
print len(flist)
print flist
###below we have a script to download all of the files in the data2 database
###here you will need to create a new directory named edgar14A14C in your CWD
original=os.getcwd().copy()
os.chdir(str(os.getcwd())+str('/edgar14A14C'))
for i in xrange(len(flist)):
url = "ftp://ftp.sec.gov/"+str(flist[i])
file_name = str(url.split('/')[-1])
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
print i
The usual pattern with multiprocessing is to create a job() function that takes arguments and performs some potentially CPU bound work.
Example: (based on your code)
from multiprocessing import Pool
def job(url):
file_name = str(url.split('/')[-1])
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
pool = Pool()
urls = ["ftp://ftp.sec.gov/{0:s}".format(f) for f in flist]
pool.map(job, urls)
This will do a number of things:
Create a multiprocessing pool and set of workers as you have CPU(s) or CPU Core(s)
Create a list of inputs to the job() function.
Map the list of inputs urls to job() and wait for all jobs to complete.
Python's multiprocessing.Pool.map will take care of splitting up your input across the no. of workers in the pool.
Another useful neat little thing I've done for this kind of work is to use progress like this:
from multiprocessing import Pool
from progress.bar import Bar
def job(input):
# do some work
pool = Pool()
inputs = range(100)
bar = Bar('Processing', max=len(inputs))
for i in pool.imap(job, inputs):
bar.next()
bar.finish()
This gives you a nice progress bar on your console as your jobs are progressing so you have some idea of progress and eta, etc.
I also find the requests library very useful here and a much nicer set of API(s) for dealing with web resources and downloading of content.