I wrote a simple python multiprocessing, in which it reads a bunch of lines from csv, calls an api and then writes to new csv. However, what I see is that performance of this program is same as sequential execution. Changing the pool size does not have any effect. What is going wrong?
from multiprocessing import Pool
from random import randint
from time import sleep
import csv
import requests
import json
def orders_v4(order_number):
response = requests.request("GET", url, headers=headers, params=querystring, verify=False)
return response.json()
newcsvFile=open('gom_acr_status.csv', 'w')
writer = csv.writer(newcsvFile)
def process_line(row):
ol_key = row['\ufeffORDER_LINE_KEY']
order_number=row['ORDER_NUMBER']
orders_json = orders_v4(order_number)
oms_order_key = orders_json['oms_order_key']
order_lines = orders_json["order_lines"]
for order_line in order_lines:
if ol_key==order_line['order_line_key']:
print(order_number)
print(ol_key)
ftype = order_line['fulfillment_spec']['fulfillment_type']
status_desc = order_line['statuses'][0]['status_description']
print(ftype)
print(status_desc)
listrow = [ol_key, order_number, ftype, status_desc]
#(writer)
writer.writerow(listrow)
newcsvFile.flush()
def get_next_line():
with open("gom_acr.csv", 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield row
f = get_next_line()
t = Pool(processes=50)
for i in f:
t.map(process_line, (i,))
t.join()
t.close()
EDIT: I just noticed you call map inside a loop. you need to call it only once. is is a blocking function, it is not async! check out the docs for examples of correct usage.
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
Original answer:
The fact that all processes write to the output file causes file-system contention.
If your process_line function would just return the rows (e.g. as a list of strings), then the main processes would write all of those after map returned them all, then you should experience a performance boost.
also, 2 notes:
try different numbers of processes, starting from # of cores and going up. maybe 50 is too much.
the work done in each process seems (to me, at first glance) pretty short, it is possible that the overhead of spawning new processes and orchestrating them is just too big to benefit the task at hand.
Related
I have the following code
import random
import csv
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
def generate_username(id, job_location):
number = "{:03d}".format(random.randrange(1, 999))
return "".join([id, job_location.strip(), str(number)])
def append_to_list(l, idx, job_location):
l.append([idx, generate_username(str(idx), job_location)])
def generate_csv(filepath, df):
rows = [["EMP_ID", "username"]]
ids, locations = df.EMP_ID, df["Job Location"]
for idx, location in zip(ids, locations):
rows.append([idx, generate_username(str(idx), location)])
with open(filepath, 'w') as file:
writer = csv.writer(file)
writer.writerows(rows)
And this is the multithreading implementation
def generate_csv_threads(filepath, df, n):
rows = [["EMP_ID", "username"]]
ids, locations = df.EMP_ID, df["Job Location"]
with ThreadPoolExecutor(max_workers=n) as executor:
executor.map(append_to_list, rows, ids, locations)
executor.shutdown(wait=True)
with open(filepath, 'w') as file:
writer = csv.writer(file)
writer.writerows(rows)
I have several questions regarding this. I saw that append is thread safe, so I would not need a lock. However, the csv generated is the following:
[['EMP_ID', 'username', [234687, '234687Oregon696']]]
(I have more than one user to generate)
If generate_username is a very fast operation and CPU-bound (as it seems to be) then you won't get any benefits from multithreading in Python. Worse, this will likely make the program slower and introduce subtle concurrency issues.
The .append() method of CPython (the most common Python implementation) is thread-safe because of the GIL (Global Interpreter Lock) but you generally should not rely on that because it may be removed in a future version or does not exist in some other Python implementations.
You can use the concurrent.futures.as_completed() function to process the results of the thread pool in a sequential manner.
The first thing to do before trying to use multithreading or multiprocessing is to benchmark your program execution. Only chose to use multithreading or multiprocessing of you clearly identified parts that can be accelerated by such tools.
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I ran into a pickle (literally) in parallelizing the following Python code and could really need some help.
First of all the input is a CSV file consisting of a list of website links that I need to scrape with the function scrape_function(). The original code is as follows and runs perfectly
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
m=[]
for i in inputlist:
m.append(scrape_code(re.sub("\'|\[|\]",'',str(i)))) #remove the quotes around the link strings otherwise it results in URLError
print(m)
I then tried to parallelize this code using joblib as follows:
from joblib import Parallel, delayed
import multiprocessing
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(m.append(scrape_code(re.sub("\'|\[|\]",'',str(i))))) for i in inputlist)
However, this would result in a weird error:
File "C:\Users\...\joblib\pool.py", line 371, in send
CustomizablePickler(buffer, self._reducers).dump(obj)
AttributeError: Can't pickle local object 'delayed.<locals>.delayed_function'
Any idea what I did wrong here? If I try to put the append in a separate function like below then the error would go away, but the execution would then freeze and hang indefinitely:
def process(k):
a=[]
a.append(scrape_code(re.sub("\'|\[|\]",'',str(k))))
return a
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(process)(i) for i in inputlist)
The input list has 10000s of pages so parallel processing would be a huge benefit.
If you really need it in separate processes, the easiest way is to just create a process pool and let it deal with distributing the links to your function, e.g.:
import csv
from multiprocessing import Pool
if __name__ == "__main__": # multiprocessing guard
with open("c:\\links.csv", "r", newline="") as f: # open the CSV
reader = csv.reader(f) # create a reader
links = [r[0] for r in reader] # collect only the first column
with Pool() as pool: # create a pool, it will make a pool with all your CPU cores...
results = pool.map(scrape_code, links) # distribute your links to scrape_code
print(results)
NOTE: I'm assuming your links.csv actually holds the link in its first column based on how you're pre-processing the links in your code.
However, as I've stated in my comment, this doesn't have to be necessarily faster than plain threading so I'd first try it using threads. Fortunately, the multiprocessing module includes a threading interfrace dummy so you just need to replace from multiprocessing import Pool with from multiprocessing.dummy import Pool and see in what regime your code works faster.
I am trying to figure out how to write a program that performs computations in parallel such that the result of each computation can be written to a file in a specific order. My problem is size; I would like to do what I've outlined in the sample program below - save the large output as the value of a dictionary which stores the ordering system in its keys. But my program keeps breaking because it can't store/pass around so many bytes.
Is there a set way to approach such problems? I'm new to dealing with both multiprocessing and large data.
from multiprocessing import Process, Manager
def eachProcess(i, d):
LARGE_BINARY_OBJECT = #perform some computation resulting in millions of bytes
d[i] = LARGE_BINARY_OBJECT
def main():
manager = Manager()
d = manager.dict()
maxProcesses = 10
for i in range(maxProcesses):
process = Process(target=eachProcess, args=(i,d))
process.start()
counter = 0
while counter < maxProcesses:
file1 = open("test.txt", "wb")
if counter in d:
file1.write(d[counter])
counter += 1
if __name__ == '__main__':
main()
Thank you.
When dealing with large data usually the approaches are two:
Local file system if the problem is simple enough
Remote data storage if more complex support over data is needed
As your problem seems pretty simple, I'd suggest the following solution. Each process writes its partial solution to a local file. Once all processing is done, the main process combines all result files together.
from multiprocessing import Pool
from tempfile import NamedTemporaryFile
def worker_function(partial_result_path):
data = produce_large_binary()
with open(partial_result_path, 'wb') as partial_result_file:
partial_result_file.write(data)
# storing partial results in temporary files
partial_result_paths = [NamedTemporaryFile() for i in range(max_processes)]
pool = Pool(max_processes)
pool.map(worker_function, partial_result_paths)
with open('test.txt', 'wb') as result_file:
for partial_result_path in partial_result_paths:
with open(partial_result_path) as partial_result_file:
result_file.write(partial_result_file.read())
I have created a class which loops through a file and after checking if a line is valid, it'll write that line to another file. Every line it checks is a lengthy process making it very slow. I need to implement either threading/multiprocessing at the process_file function; I do not know which library is best suited for speeding this function up or how to implement it.
class FileProcessor:
def process_file(self):
with open('file.txt', 'r') as f:
with open('outfile.txt', 'w') as output:
for line in f:
# There's some string manipulation code here...
validate = FileProcessor.do_stuff(self, line)
# If true write line to output.txt
def do_stuff(self, line)
# Does stuff...
pass
Extra Information:
The code goes through a proxy list checking whether it is online. This is a lengthy and time consuming process.
Thank you for any insight or help!
The code goes through a proxy list checking whether it is online
It sounds like what takes a long time is connecting to the internet, meaning your task is IO bound and thus threads can help speed it up. Multiple processes are always applicable but can be harder to use.
This seems like a job for multiprocessing.map.
import multiprocessing
def process_file(filename):
pool = multiprocessing.Pool(4)
with open(filename) as fd:
results = pool.imap_unordered(do_stuff, (line for line in fd))
with open("output.txt", "w") as fd:
for r in results:
fd.write(r)
def do_stuff(item):
return "I did something with %s\n" % item
process_file(__file__)
You can also use multiprocessing.dummy.Pool instead if you want to use threads (which might be preferable in this case since your are I/O bound).
Essentially you are passing an iterable to imap_unordered (or imap if order matters) and farming out portions of it to other processes (or threads if using dummy). You can tune the chunksize of the map to help with efficiency.
If you want to encapsulate this into a class, you'll need to use multiprocessing.dummy. (Otherwise it can't pickle the instance method.)
You do have to wait until the map finishes before you can process the results, although you could write the results in do_stuff instead -- just be sure to open the file in append mode, and you'll likely want to lock the file.