Appending to a list with multithreading ThreadPoolExecutor and map

Appending to a list with multithreading ThreadPoolExecutor and map - python

I have the following code
import random
import csv
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
import pandas as pd
def generate_username(id, job_location):
number = "{:03d}".format(random.randrange(1, 999))
return "".join([id, job_location.strip(), str(number)])
def append_to_list(l, idx, job_location):
l.append([idx, generate_username(str(idx), job_location)])
def generate_csv(filepath, df):
rows = [["EMP_ID", "username"]]
ids, locations = df.EMP_ID, df["Job Location"]
for idx, location in zip(ids, locations):
rows.append([idx, generate_username(str(idx), location)])
with open(filepath, 'w') as file:
writer = csv.writer(file)
writer.writerows(rows)
And this is the multithreading implementation
def generate_csv_threads(filepath, df, n):
rows = [["EMP_ID", "username"]]
ids, locations = df.EMP_ID, df["Job Location"]
with ThreadPoolExecutor(max_workers=n) as executor:
executor.map(append_to_list, rows, ids, locations)
executor.shutdown(wait=True)
with open(filepath, 'w') as file:
writer = csv.writer(file)
writer.writerows(rows)
I have several questions regarding this. I saw that append is thread safe, so I would not need a lock. However, the csv generated is the following:
[['EMP_ID', 'username', [234687, '234687Oregon696']]]
(I have more than one user to generate)

If generate_username is a very fast operation and CPU-bound (as it seems to be) then you won't get any benefits from multithreading in Python. Worse, this will likely make the program slower and introduce subtle concurrency issues.
The .append() method of CPython (the most common Python implementation) is thread-safe because of the GIL (Global Interpreter Lock) but you generally should not rely on that because it may be removed in a future version or does not exist in some other Python implementations.
You can use the concurrent.futures.as_completed() function to process the results of the thread pool in a sequential manner.
The first thing to do before trying to use multithreading or multiprocessing is to benchmark your program execution. Only chose to use multithreading or multiprocessing of you clearly identified parts that can be accelerated by such tools.

Related

Concurrent Futures: When and how to implement?

from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
import numpy as np
import time
#creating iterable
testDict = {}
for i in range(1000):
testDict[i] = np.random.randint(1,10)
#default method
stime = time.time()
newdict = []
for k, v in testDict.items():
for i in range(1000):
v = np.tanh(v)
newdict.append(v)
etime = time.time()
print(etime - stime)
#output: 1.1139910221099854
#multi processing
stime = time.time()
testresult = []
def f(item):
x = item[1]
for i in range(1000):
x = np.tanh(x)
return x
def main(testDict):
with ProcessPoolExecutor(max_workers = 8) as executor:
futures = [executor.submit(f, item) for item in testDict.items()]
for future in as_completed(futures):
testresult.append(future.result())
if __name__ == '__main__':
main(testDict)
etime = time.time()
print(etime - stime)
#output: 3.4509658813476562
Learning multiprocessing and testing stuff. Ran a test to check if I have implemented this correctly. Looking at the output time taken, concurrent method is 3 times slower. So what's wrong?
My objective is to parallelize a script which mostly operates on a dictionary of around 500 items. Each loop, values of those 500 items are processed and updated. This loops for let's say 5000 generations. None of the k,v pairs interact with other k,v pairs. [Its a genetic algorithm].
I am also looking at guidance on how to parallelize the above described objective. If I use the correct concurrent futures method on each of my function in my genetic algorithm code, where each function takes an input of a dictionary and outputs a new dictionary, will it be useful? Any guides/resources/help is appreciated.
Edit: If I run this example: https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor-example, it takes 3 times more to solve than a default for loop check.

There are a couple basic problems here, you're using numpy but you're not vectorizing your calculations. You'll not benefit from numpy's speed benefit with the way you write your code here, and might as well just use the standard library math module, which is faster than numpy for this style of code:
# 0.089sec
import math
for k, v in testDict.items():
for i in range(1000):
v = math.tanh(v)
newdict.append(v)
Once you vectorise the operation, only then you see the benefit of numpy:
# 0.016sec
for k, v in testDict.items():
arr = no.full(1000, v)
arr2 = np.tanh(arr)
newdict.append(arr2[-1])
For comparison, your original single threaded code runs in 1.171sec on my test machine. As you can see here, when it's not used properly, NumPy can be a couple orders of magnitude slower than even pure Python.
Now on to why you're seeing what you're seeing.
To be honest, I can't replicate your timing results. Your original multiprocessing code runs in 0.299sec for me macOS on Python 3.6), which is faster than the single process code. But if I have to take a guess, you're probably using Windows? In some platforms like Windows, creating a child process and setting up an environment to run multiprocessing task is very expensive, so using multiprocessing for a task that lasts less than a few seconds is of dubious benefit. If your are interested in why, read here.
Also, in platforms that lacks a usable fork() like MacOS after Python 3.8 or Windows, when you use multiprocessing, the child process has to reimport the module, so if you put both code in the same file, it has to run your single threaded code in the child processes before it can run the multiprocessing code. You'll likely want to put your test code in a function and protect the top level code with if __name__ == "__main__" block. On Mac with Python 3.8 or higher, you can also revert to using fork method by calling multiprocessing.set_start_method("fork") if you're not calling into Mac's non-fork-safe framework libraries.
With that out of the way, on to your title question.
When you use multiprocessing, you need to copy data to the child process and back to the main process to retrieve the result and there's a cost to spawning child processes. To benefit from multiprocessing, you need to design your workload so that this part of the cost is negligible.
If your data comes from external source, try loading the data in the child processes, rather than having the main process load the data then transfer it to the child process, have the main process tell the child how to fetch its slice of data. Here you're generating the testDict in the main process, so if you can, parallelize that and move them to the children instead.
Also, since you're using numpy, if you vectorise your operations properly, numpy will release the GIL while doing vectorised operations, so you may be able to just use multithreading instead. Since numpy doesn't hold GIL during vector operation, you can take advantage of multiple threads in a single Python process, and you don't need to fork or copy data over to child processes, as threads share memory.

Python multiprocessing not increasing performance

I wrote a simple python multiprocessing, in which it reads a bunch of lines from csv, calls an api and then writes to new csv. However, what I see is that performance of this program is same as sequential execution. Changing the pool size does not have any effect. What is going wrong?
from multiprocessing import Pool
from random import randint
from time import sleep
import csv
import requests
import json
def orders_v4(order_number):
response = requests.request("GET", url, headers=headers, params=querystring, verify=False)
return response.json()
newcsvFile=open('gom_acr_status.csv', 'w')
writer = csv.writer(newcsvFile)
def process_line(row):
ol_key = row['\ufeffORDER_LINE_KEY']
order_number=row['ORDER_NUMBER']
orders_json = orders_v4(order_number)
oms_order_key = orders_json['oms_order_key']
order_lines = orders_json["order_lines"]
for order_line in order_lines:
if ol_key==order_line['order_line_key']:
print(order_number)
print(ol_key)
ftype = order_line['fulfillment_spec']['fulfillment_type']
status_desc = order_line['statuses'][0]['status_description']
print(ftype)
print(status_desc)
listrow = [ol_key, order_number, ftype, status_desc]
#(writer)
writer.writerow(listrow)
newcsvFile.flush()
def get_next_line():
with open("gom_acr.csv", 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield row
f = get_next_line()
t = Pool(processes=50)
for i in f:
t.map(process_line, (i,))
t.join()
t.close()

EDIT: I just noticed you call map inside a loop. you need to call it only once. is is a blocking function, it is not async! check out the docs for examples of correct usage.
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
Original answer:
The fact that all processes write to the output file causes file-system contention.
If your process_line function would just return the rows (e.g. as a list of strings), then the main processes would write all of those after map returned them all, then you should experience a performance boost.
also, 2 notes:
try different numbers of processes, starting from # of cores and going up. maybe 50 is too much.
the work done in each process seems (to me, at first glance) pretty short, it is possible that the overhead of spawning new processes and orchestrating them is just too big to benefit the task at hand.

Need help parallelizing this code

I ran into a pickle (literally) in parallelizing the following Python code and could really need some help.
First of all the input is a CSV file consisting of a list of website links that I need to scrape with the function scrape_function(). The original code is as follows and runs perfectly
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
m=[]
for i in inputlist:
m.append(scrape_code(re.sub("\'|\[|\]",'',str(i)))) #remove the quotes around the link strings otherwise it results in URLError
print(m)
I then tried to parallelize this code using joblib as follows:
from joblib import Parallel, delayed
import multiprocessing
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(m.append(scrape_code(re.sub("\'|\[|\]",'',str(i))))) for i in inputlist)
However, this would result in a weird error:
File "C:\Users\...\joblib\pool.py", line 371, in send
CustomizablePickler(buffer, self._reducers).dump(obj)
AttributeError: Can't pickle local object 'delayed.<locals>.delayed_function'
Any idea what I did wrong here? If I try to put the append in a separate function like below then the error would go away, but the execution would then freeze and hang indefinitely:
def process(k):
a=[]
a.append(scrape_code(re.sub("\'|\[|\]",'',str(k))))
return a
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(process)(i) for i in inputlist)
The input list has 10000s of pages so parallel processing would be a huge benefit.

If you really need it in separate processes, the easiest way is to just create a process pool and let it deal with distributing the links to your function, e.g.:
import csv
from multiprocessing import Pool
if __name__ == "__main__": # multiprocessing guard
with open("c:\\links.csv", "r", newline="") as f: # open the CSV
reader = csv.reader(f) # create a reader
links = [r[0] for r in reader] # collect only the first column
with Pool() as pool: # create a pool, it will make a pool with all your CPU cores...
results = pool.map(scrape_code, links) # distribute your links to scrape_code
print(results)
NOTE: I'm assuming your links.csv actually holds the link in its first column based on how you're pre-processing the links in your code.
However, as I've stated in my comment, this doesn't have to be necessarily faster than plain threading so I'd first try it using threads. Fortunately, the multiprocessing module includes a threading interfrace dummy so you just need to replace from multiprocessing import Pool with from multiprocessing.dummy import Pool and see in what regime your code works faster.

Muti threading and multiprocessing in Python

I'm trying to read a csv file and calculate the time required for linear processing, using threads and multiprocessing but my code doesn't seems to get me the correct output. It would be great if someone could help me with the codes.
Multi processing and Linear Processing:
import csv
import time
import multiprocessing
from multiprocessing import Process
proces=[]
number_of_processes=2
class mulprocess():
def data(self):
with open('test.csv','r+') as f:
reader=csv.reader(f)
for row in reader:
print row
def processdata(self):
for i in range(2):
start_time=time.time()
proces.append(multiprocessing.Process(target=self.data(),args=()))
for p in proces:
p.start()
for p in proces:
p.join()
end_time=time.time()
print end_time-start_time
a=mulprocess()
a.data()
a.processdata()
Threading:
import csv
import time
import threading
thread_count=2
threads=[]
class Operation():
def data(self):
with open('test.csv','r+') as f:
reader=csv.reader(f)
for row in reader:
print row
def filedata(self):
for i in range(thread_count):
threads.append(threading.Thread(target=self.data,args=()))
start_time=time.time()
for t in threads:
t.start()
t.join()
end_time=time.time()
print end_time-start_time
a=Operation()
a.filedata()

Ignoring the pointlessness of loading the same file over and over in a different process (which, btw. might can cause concurrency error because you're loading your file in r+ mode, effectively locking it) and the general ill-advised approach of calling instance methods as separate processes, the crux of your problem lies in the line:
proces.append(multiprocessing.Process(target=self.data(), args=()))
What you're telling the multiprocessing.Process instance there is to set the target to whatever your self.data() method returns, which is None, so your process doesn't get anything to work with or to call so everything executes in your main process and, of course, it takes double the time. Check this answer to see how to set up multiprocessing benchmark properly.
As for threading - no reason to try it out, it will be slower for processing than a single-threaded approach. The only advantage you can have with threading is semi-parallelizing I/O operations.

Multiprocessing in python, multiple process running same instructions

I'm using multiprocessing in Python for parallelizing.
I'm trying to parallelize the process on chunks of data read from an excel file using pandas.
I'm new to multiprocessing and parallel processing. During implementation on simple code,
import time;
import os;
from multiprocessing import Process
import pandas as pd
print os.getpid();
df = pd.read_csv('train.csv', sep=',',usecols=["POLYLINE"],iterator=True,chunksize=2);
print "hello";
def my_function(chunk):
print chunk;
count = 0;
processes = [];
for chunk in df:
if __name__ == '__main__':
p = Process(target=my_function,args=(chunk,));
processes.append(p);
if(count==4):
break;
count = count + 1;
The print "hello" is being executed multiple times, I'm guessing the individual process created should work on the target rather than main code.
Can anyone suggest me where I'm wrong.

The way that multiprocessing works is create a new process and then import the file with the target function. Since your outermost scope has print statements, it will get executed once for every process.
By the way you should use a Pool instead of Processes directly. Here's a cleaned up example:
import os
import time
from multiprocessing import Pool
import pandas as pd
NUM_PROCESSES = 4
def process_chunk(chunk):
# do something
return chunk
if __name__ == '__main__':
df = pd.read_csv('train.csv', sep=',', usecols=["POLYLINE"], iterator=True, chunksize=2)
pool = Pool(NUM_PROCESSES)
for result in pool.map(process_chunk, df):
print result

Using multiprocessing is probably not going to speed up reading data from disk, since disk access is much slower than e.g. RAM access or calculations. And the different pieces of the file will end up in different processes.
Using mmap could help speed up data access.
If you do a read-only mmap of the data file before starting e.g. a Pool.map, each worker could read its own slice of data from the shared memory mapped file and process it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.