Pass around large amounts of data with multiprocessing - python

I am trying to figure out how to write a program that performs computations in parallel such that the result of each computation can be written to a file in a specific order. My problem is size; I would like to do what I've outlined in the sample program below - save the large output as the value of a dictionary which stores the ordering system in its keys. But my program keeps breaking because it can't store/pass around so many bytes.
Is there a set way to approach such problems? I'm new to dealing with both multiprocessing and large data.
from multiprocessing import Process, Manager
def eachProcess(i, d):
LARGE_BINARY_OBJECT = #perform some computation resulting in millions of bytes
d[i] = LARGE_BINARY_OBJECT
def main():
manager = Manager()
d = manager.dict()
maxProcesses = 10
for i in range(maxProcesses):
process = Process(target=eachProcess, args=(i,d))
process.start()
counter = 0
while counter < maxProcesses:
file1 = open("test.txt", "wb")
if counter in d:
file1.write(d[counter])
counter += 1
if __name__ == '__main__':
main()
Thank you.

When dealing with large data usually the approaches are two:
Local file system if the problem is simple enough
Remote data storage if more complex support over data is needed
As your problem seems pretty simple, I'd suggest the following solution. Each process writes its partial solution to a local file. Once all processing is done, the main process combines all result files together.
from multiprocessing import Pool
from tempfile import NamedTemporaryFile
def worker_function(partial_result_path):
data = produce_large_binary()
with open(partial_result_path, 'wb') as partial_result_file:
partial_result_file.write(data)
# storing partial results in temporary files
partial_result_paths = [NamedTemporaryFile() for i in range(max_processes)]
pool = Pool(max_processes)
pool.map(worker_function, partial_result_paths)
with open('test.txt', 'wb') as result_file:
for partial_result_path in partial_result_paths:
with open(partial_result_path) as partial_result_file:
result_file.write(partial_result_file.read())

Related

Python multitprocessing to process files

I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.

Python multiprocessing not increasing performance

I wrote a simple python multiprocessing, in which it reads a bunch of lines from csv, calls an api and then writes to new csv. However, what I see is that performance of this program is same as sequential execution. Changing the pool size does not have any effect. What is going wrong?
from multiprocessing import Pool
from random import randint
from time import sleep
import csv
import requests
import json
def orders_v4(order_number):
response = requests.request("GET", url, headers=headers, params=querystring, verify=False)
return response.json()
newcsvFile=open('gom_acr_status.csv', 'w')
writer = csv.writer(newcsvFile)
def process_line(row):
ol_key = row['\ufeffORDER_LINE_KEY']
order_number=row['ORDER_NUMBER']
orders_json = orders_v4(order_number)
oms_order_key = orders_json['oms_order_key']
order_lines = orders_json["order_lines"]
for order_line in order_lines:
if ol_key==order_line['order_line_key']:
print(order_number)
print(ol_key)
ftype = order_line['fulfillment_spec']['fulfillment_type']
status_desc = order_line['statuses'][0]['status_description']
print(ftype)
print(status_desc)
listrow = [ol_key, order_number, ftype, status_desc]
#(writer)
writer.writerow(listrow)
newcsvFile.flush()
def get_next_line():
with open("gom_acr.csv", 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield row
f = get_next_line()
t = Pool(processes=50)
for i in f:
t.map(process_line, (i,))
t.join()
t.close()
EDIT: I just noticed you call map inside a loop. you need to call it only once. is is a blocking function, it is not async! check out the docs for examples of correct usage.
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
Original answer:
The fact that all processes write to the output file causes file-system contention.
If your process_line function would just return the rows (e.g. as a list of strings), then the main processes would write all of those after map returned them all, then you should experience a performance boost.
also, 2 notes:
try different numbers of processes, starting from # of cores and going up. maybe 50 is too much.
the work done in each process seems (to me, at first glance) pretty short, it is possible that the overhead of spawning new processes and orchestrating them is just too big to benefit the task at hand.

Multiprocessing in python, multiple process running same instructions

I'm using multiprocessing in Python for parallelizing.
I'm trying to parallelize the process on chunks of data read from an excel file using pandas.
I'm new to multiprocessing and parallel processing. During implementation on simple code,
import time;
import os;
from multiprocessing import Process
import pandas as pd
print os.getpid();
df = pd.read_csv('train.csv', sep=',',usecols=["POLYLINE"],iterator=True,chunksize=2);
print "hello";
def my_function(chunk):
print chunk;
count = 0;
processes = [];
for chunk in df:
if __name__ == '__main__':
p = Process(target=my_function,args=(chunk,));
processes.append(p);
if(count==4):
break;
count = count + 1;
The print "hello" is being executed multiple times, I'm guessing the individual process created should work on the target rather than main code.
Can anyone suggest me where I'm wrong.
The way that multiprocessing works is create a new process and then import the file with the target function. Since your outermost scope has print statements, it will get executed once for every process.
By the way you should use a Pool instead of Processes directly. Here's a cleaned up example:
import os
import time
from multiprocessing import Pool
import pandas as pd
NUM_PROCESSES = 4
def process_chunk(chunk):
# do something
return chunk
if __name__ == '__main__':
df = pd.read_csv('train.csv', sep=',', usecols=["POLYLINE"], iterator=True, chunksize=2)
pool = Pool(NUM_PROCESSES)
for result in pool.map(process_chunk, df):
print result
Using multiprocessing is probably not going to speed up reading data from disk, since disk access is much slower than e.g. RAM access or calculations. And the different pieces of the file will end up in different processes.
Using mmap could help speed up data access.
If you do a read-only mmap of the data file before starting e.g. a Pool.map, each worker could read its own slice of data from the shared memory mapped file and process it.

Python multiprocessing safely writing to a file

I am trying to solve a big numerical problem which involves lots of subproblems, and I'm using Python's multiprocessing module (specifically Pool.map) to split up different independent subproblems onto different cores. Each subproblem involves computing lots of sub-subproblems, and I'm trying to effectively memoize these results by storing them to a file if they have not been computed by any process yet, otherwise skip the computation and just read the results from the file.
I'm having concurrency issues with the files: different processes sometimes check to see if a sub-subproblem has been computed yet (by looking for the file where the results would be stored), see that it hasn't, run the computation, then try to write the results to the same file at the same time. How do I avoid writing collisions like this?
#GP89 mentioned a good solution. Use a queue to send the writing tasks to a dedicated process that has sole write access to the file. All the other workers have read only access. This will eliminate collisions. Here is an example that uses apply_async, but it will work with map too:
import multiprocessing as mp
import time
fn = 'c:/temp/temp.txt'
def worker(arg, q):
'''stupidly simulates long running process'''
start = time.clock()
s = 'this is a test'
txt = s
for i in range(200000):
txt += s
done = time.clock() - start
with open(fn, 'rb') as f:
size = len(f.read())
res = 'Process' + str(arg), str(size), done
q.put(res)
return res
def listener(q):
'''listens for messages on the q, writes to file. '''
with open(fn, 'w') as f:
while 1:
m = q.get()
if m == 'kill':
f.write('killed')
break
f.write(str(m) + '\n')
f.flush()
def main():
#must use Manager queue here, or will not work
manager = mp.Manager()
q = manager.Queue()
pool = mp.Pool(mp.cpu_count() + 2)
#put listener to work first
watcher = pool.apply_async(listener, (q,))
#fire off workers
jobs = []
for i in range(80):
job = pool.apply_async(worker, (i, q))
jobs.append(job)
# collect results from the workers through the pool result queue
for job in jobs:
job.get()
#now we are done, kill the listener
q.put('kill')
pool.close()
pool.join()
if __name__ == "__main__":
main()
It looks to me that you need to use Manager to temporarily save your results to a list and then write the results from the list to a file. Also, use starmap to pass the object you want to process and the managed list. The first step is to build the parameter to be passed to starmap, which includes the managed list.
from multiprocessing import Manager
from multiprocessing import Pool
import pandas as pd
def worker(row, param):
# do something here and then append it to row
x = param**2
row.append(x)
if __name__ == '__main__':
pool_parameter = [] # list of objects to process
with Manager() as mgr:
row = mgr.list([])
# build list of parameters to send to starmap
for param in pool_parameter:
params.append([row,param])
with Pool() as p:
p.starmap(worker, params)
From this point you need to decide how you are going to handle the list. If you have tons of RAM and a huge data set feel free to concatenate using pandas. Then you can save of the file very easily as a csv or a pickle.
df = pd.concat(row, ignore_index=True)
df.to_pickle('data.pickle')
df.to_csv('data.csv')

Processing single file from multiple processes

I have a single big text file in which I want to process each line ( do some operations ) and store them in a database. Since a single simple program is taking too long, I want it to be done via multiple processes or threads.
Each thread/process should read the DIFFERENT data(different lines) from that single file and do some operations on their piece of data(lines) and put them in the database so that in the end, I have whole of the data processed and my database is dumped with the data I need.
But I am not able to figure it out that how to approach this.
What you are looking for is a Producer/Consumer pattern
Basic threading example
Here is a basic example using the threading module (instead of multiprocessing)
import threading
import Queue
import sys
def do_work(in_queue, out_queue):
while True:
item = in_queue.get()
# process
result = item
out_queue.put(result)
in_queue.task_done()
if __name__ == "__main__":
work = Queue.Queue()
results = Queue.Queue()
total = 20
# start for workers
for i in xrange(4):
t = threading.Thread(target=do_work, args=(work, results))
t.daemon = True
t.start()
# produce data
for i in xrange(total):
work.put(i)
work.join()
# get the results
for i in xrange(total):
print results.get()
sys.exit()
You wouldn't share the file object with the threads. You would produce work for them by supplying the queue with lines of data. Then each thread would pick up a line, process it, and then return it in the queue.
There are some more advanced facilities built into the multiprocessing module to share data, like lists and special kind of Queue. There are trade-offs to using multiprocessing vs threads and it depends on whether your work is cpu bound or IO bound.
Basic multiprocessing.Pool example
Here is a really basic example of a multiprocessing Pool
from multiprocessing import Pool
def process_line(line):
return "FOO: %s" % line
if __name__ == "__main__":
pool = Pool(4)
with open('file.txt') as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 4)
print results
A Pool is a convenience object that manages its own processes. Since an open file can iterate over its lines, you can pass it to the pool.map(), which will loop over it and deliver lines to the worker function. Map blocks and returns the entire result when its done. Be aware that this is an overly simplified example, and that the pool.map() is going to read your entire file into memory all at once before dishing out work. If you expect to have large files, keep this in mind. There are more advanced ways to design a producer/consumer setup.
Manual "pool" with limit and line re-sorting
This is a manual example of the Pool.map, but instead of consuming an entire iterable in one go, you can set a queue size so that you are only feeding it piece by piece as fast as it can process. I also added the line numbers so that you can track them and refer to them if you want, later on.
from multiprocessing import Process, Manager
import time
import itertools
def do_work(in_queue, out_list):
while True:
item = in_queue.get()
line_no, line = item
# exit signal
if line == None:
return
# fake work
time.sleep(.5)
result = (line_no, line)
out_list.append(result)
if __name__ == "__main__":
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
# start for workers
pool = []
for i in xrange(num_workers):
p = Process(target=do_work, args=(work, results))
p.start()
pool.append(p)
# produce data
with open("source.txt") as f:
iters = itertools.chain(f, (None,)*num_workers)
for num_and_line in enumerate(iters):
work.put(num_and_line)
for p in pool:
p.join()
# get the results
# example: [(1, "foo"), (10, "bar"), (0, "start")]
print sorted(results)
Here's a really stupid example that I cooked up:
import os.path
import multiprocessing
def newlinebefore(f,n):
f.seek(n)
c=f.read(1)
while c!='\n' and n > 0:
n-=1
f.seek(n)
c=f.read(1)
f.seek(n)
return n
filename='gpdata.dat' #your filename goes here.
fsize=os.path.getsize(filename) #size of file (in bytes)
#break the file into 20 chunks for processing.
nchunks=20
initial_chunks=range(1,fsize,fsize/nchunks)
#You could also do something like:
#initial_chunks=range(1,fsize,max_chunk_size_in_bytes) #this should work too.
with open(filename,'r') as f:
start_byte=sorted(set([newlinebefore(f,i) for i in initial_chunks]))
end_byte=[i-1 for i in start_byte] [1:] + [None]
def process_piece(filename,start,end):
with open(filename,'r') as f:
f.seek(start+1)
if(end is None):
text=f.read()
else:
nbytes=end-start+1
text=f.read(nbytes)
# process text here. createing some object to be returned
# You could wrap text into a StringIO object if you want to be able to
# read from it the way you would a file.
returnobj=text
return returnobj
def wrapper(args):
return process_piece(*args)
filename_repeated=[filename]*len(start_byte)
args=zip(filename_repeated,start_byte,end_byte)
pool=multiprocessing.Pool(4)
result=pool.map(wrapper,args)
#Now take your results and write them to the database.
print "".join(result) #I just print it to make sure I get my file back ...
The tricky part here is to make sure that we split the file on newline characters so that you don't miss any lines (or only read partial lines). Then, each process reads it's part of the file and returns an object which can be put into the database by the main thread. Of course, you may even need to do this part in chunks so that you don't have to keep all of the information in memory at once. (this is quite easily accomplished -- just split the "args" list into X chunks and call pool.map(wrapper,chunk) -- See here)
well break the single big file into multiple smaller files and have each of them processed in separate threads.

Categories