I am trying to solve a big numerical problem which involves lots of subproblems, and I'm using Python's multiprocessing module (specifically Pool.map) to split up different independent subproblems onto different cores. Each subproblem involves computing lots of sub-subproblems, and I'm trying to effectively memoize these results by storing them to a file if they have not been computed by any process yet, otherwise skip the computation and just read the results from the file.
I'm having concurrency issues with the files: different processes sometimes check to see if a sub-subproblem has been computed yet (by looking for the file where the results would be stored), see that it hasn't, run the computation, then try to write the results to the same file at the same time. How do I avoid writing collisions like this?
#GP89 mentioned a good solution. Use a queue to send the writing tasks to a dedicated process that has sole write access to the file. All the other workers have read only access. This will eliminate collisions. Here is an example that uses apply_async, but it will work with map too:
import multiprocessing as mp
import time
fn = 'c:/temp/temp.txt'
def worker(arg, q):
'''stupidly simulates long running process'''
start = time.clock()
s = 'this is a test'
txt = s
for i in range(200000):
txt += s
done = time.clock() - start
with open(fn, 'rb') as f:
size = len(f.read())
res = 'Process' + str(arg), str(size), done
q.put(res)
return res
def listener(q):
'''listens for messages on the q, writes to file. '''
with open(fn, 'w') as f:
while 1:
m = q.get()
if m == 'kill':
f.write('killed')
break
f.write(str(m) + '\n')
f.flush()
def main():
#must use Manager queue here, or will not work
manager = mp.Manager()
q = manager.Queue()
pool = mp.Pool(mp.cpu_count() + 2)
#put listener to work first
watcher = pool.apply_async(listener, (q,))
#fire off workers
jobs = []
for i in range(80):
job = pool.apply_async(worker, (i, q))
jobs.append(job)
# collect results from the workers through the pool result queue
for job in jobs:
job.get()
#now we are done, kill the listener
q.put('kill')
pool.close()
pool.join()
if __name__ == "__main__":
main()
It looks to me that you need to use Manager to temporarily save your results to a list and then write the results from the list to a file. Also, use starmap to pass the object you want to process and the managed list. The first step is to build the parameter to be passed to starmap, which includes the managed list.
from multiprocessing import Manager
from multiprocessing import Pool
import pandas as pd
def worker(row, param):
# do something here and then append it to row
x = param**2
row.append(x)
if __name__ == '__main__':
pool_parameter = [] # list of objects to process
with Manager() as mgr:
row = mgr.list([])
# build list of parameters to send to starmap
for param in pool_parameter:
params.append([row,param])
with Pool() as p:
p.starmap(worker, params)
From this point you need to decide how you are going to handle the list. If you have tons of RAM and a huge data set feel free to concatenate using pandas. Then you can save of the file very easily as a csv or a pickle.
df = pd.concat(row, ignore_index=True)
df.to_pickle('data.pickle')
df.to_csv('data.csv')
Related
I'm reading a chunk from a big file, loading it in memory as a list of lines, then processing a task on every line.
The sequential solution was taking too long so I started looking at how to parallelize it.
The first solution I came up with is with Process and managing each subprocess' slice of the list.
import multiprocessing as mp
BIG_FILE_PATH = 'big_file.txt'
CHUNKSIZE = '1000000'
N_PROCESSES = mp.cpu_count()
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(BIG_FILE_PATH, encoding="Latin-1") as file:
for piece in read_in_chunks(file, CHUNKSIZE):
jobs = []
piece_list = piece.splitlines()
piece_list_len = len(piece_list)
item_delta = round(piece_list_len/N_PROCESSES)
start = 0
for process in range(N_PROCESSES):
finish = start + item_delta
p = mp.Process(target=work, args=(piece_list[start:finish]))
start = finish
jobs.append(p)
p.start()
for job in jobs:
job.join()
It completes each chunk in roughly 2498ms.
Then I discovered the Pool tool to automatically manage the slices.
import multiprocessing as mp
BIG_FILE_PATH = 'big_file.txt'
CHUNKSIZE = '1000000'
N_PROCESSES = mp.cpu_count()
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(BIG_FILE_PATH, encoding="Latin-1") as file:
with mp.Pool(N_PROCESSES) as pool:
for piece in read_in_chunks(file, CHUNKSIZE):
piece_list = piece.splitlines()
pool.map(work, piece_list)
It completes each chunk in roughly 15540ms, 6 times slower than manual but still faster than sequential.
Am I using the Pool wrong?
Is there a better or faster way to do this?
Thank you for reading.
Update
The Pool has quite the overhead as Hannu suggested.
The work function called by the Process method is expecting a list of lines.
The work function called by the Pool method is expecting a single line because of how the Pool is deciding the slices.
I'm not quite sure how to make the pool give a certain worker more than one line at a time.
That should solve the problem?
Update 2
Final question, is there a 3rd better way to do it?
I am not entirely sure about this but it appears to me that your programs are materially different in what they submit to workers.
In your Process method you seem to be submitting a large chunk of rows:
p = mp.Process(target=work, args=(piece_list[start:finish]))
but then when you use Pool, you do this:
for piece in read_in_chunks(file, CHUNKSIZE):
piece_list = piece.splitlines()
pool.map(work, piece_list)
You read your file in chunks but then when you use splitlines, your piece_list iterable submits units of one.
Which means in your process approach you submit as many subtasks as you have CPUs but in your Pool approach you submit as many tasks as your source data has lines. If you have a lot of lines, this will create massive orchestration overhead in your Pool as each worker only processes one line at a time, then finishes, returns result and Pool then submits another line to the newly freed worker.
If this is what is going on here, it definitely explains why Pool takes much longer to complete.
What happens if you use your reader as the iterable and skip the line splitting part:
pool.map(work, read_in_chunks(file, CHUNKSIZE))
I do not know if this gonna work , but may you try with this?
if __name__ == "__main__":
with open(BIG_FILE_PATH, encoding="Latin-1") as file:
with mp.Pool(N_PROCESSES) as pool:
for piece in read_in_chunks(file, CHUNKSIZE):
piece_list = piece.splitlines()
pool.map(work, piece_list)
My reasoning:1. pool.map() , just need once and your code is looping it
2. My guess that the loop makes it slower
3. Because parallel processing should be faster hehe
Oh boy! This was quite a ride to figure out, but very fun nonetheless.
The Pool.map is getting, pickling and passing every item individually from the iterator to each one of the workers. Once a worker is done, rinse and repeat, get -> pickle -> pass. This creates a noticeable overhead cost.
This is actually intended because the Pool.map isn't smart enough to know the length of the iterator, nor is able to effectively make a list of lists and passing each list inside it (chunk) to a worker.
But, it can be helped.
Simply transforming the list to a list of chunks (lists) with a list comprehension works like a charm and reduces the overhead to the same level as the Process method.
import multiprocessing as mp
BIG_FILE_PATH = 'big_file.txt'
CHUNKSIZE = '1000000'
N_PROCESSES = mp.cpu_count()
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open(BIG_FILE_PATH, encoding="Latin-1") as file:
with mp.Pool(N_PROCESSES) as pool:
for piece in read_in_chunks(file, CHUNKSIZE):
piece_list = piece.splitlines()
piece_list_len = len(piece_list)
item_delta = round(piece_list_len / N_PROCESSES)
pool.map(work, [piece_list[i:i + item_delta] for i in range(0, piece_list_len, item_delta)])
This Pool with a list of lists iterator has the exact same running time of the Process method.
I've a json file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way.
My problem is that it runs in 8 minutes for a 12gb dataset. But the requirement is to scale the code so that it could run on 100gb dataset. Any pointers on how to do this? Should I use multi-threading or multi-processing in python to achieve this ? Or any other method?
This is the code:
import json
import time
""" This class contains the business logic for identifying the duplicates and creating an output file for further processing """
class BusinessService:
""" The method identiifes the duplicate """
def service(ipPath,opPath):
start_time = time.time() #We start the timer to see how much time the method takes to work #
uniqueHandleSet = set(); #Creating a set to store unique values #
try:
duplicateHandles = open(opPath,'w+',encoding='utf-8') #Opening and creating an output file to catch the duplicate hanndles #
with open(ipPath,buffering = 200000000,encoding = 'utf-8') as infile: #Reading the JSON File by buffering and using 20mb as it is too big to read at once #
for line in infile:
tweetJsonObject = json.loads(line);
if tweetJsonObject["name"] not in uniqueHandleSet:
uniqueHandleSet.add(tweetJsonObject["name"]);
else:
duplicateHandles.write(line);
print("--- %s seconds --- memory 200mb while buffering" % (time.time() - start_time)); #Printing the total time required to execute
except:
print("Error")
finally:
duplicateHandles.close();
To scale it, you would need queues for feeding multiples processses and two shared lists to keep track of your results. The main idea is feeding the file line by line to a queue that is subsequently processed by some consumer processes. These processes however share two lists to store the intermediate results. The Manager is responsible for the synchronization between the processes.
The following code is just some rough guideline, not really tested:
from multiprocessing import Process, Manager, Queue
def findDuplicate(inputQueue, uniqueValues, duplicates):
for line in iter(inputQueue.get, 'STOP'): #get line from Queue, stop if 'STOP' is received
if line not in uniqueValues: # check if duplicate
uniqueValues.append(line)
else:
duplicates.append(line) # store it
manager = Manager() # get a new SyncManager
uniqueValues = manager.list() # handle for shared list
duplicates = manager.list() # a 2nd handle for a shared list
inputQueue = Queue() # a queue to provide tasks to the processes
# setup workers, provide shared lists and tasks
numProc = 4
process = [Process(target=findDuplicate,
args=(inputQueue, uniqueValues, duplicates)) for x in range(numProc)]
# start processes, they will idle if nothing is in queue
for p in process:
p.start()
with open(ipPath) as f:
for line in f:
inputQueue.put(line, block=True) # put line in queue, only if free slot avaible
for p in process:
inputQueue.put('STOP') # signal workers to stop as no further input
# wait for processes to finish
for p in process:
p.join()
I am trying to figure out how to write a program that performs computations in parallel such that the result of each computation can be written to a file in a specific order. My problem is size; I would like to do what I've outlined in the sample program below - save the large output as the value of a dictionary which stores the ordering system in its keys. But my program keeps breaking because it can't store/pass around so many bytes.
Is there a set way to approach such problems? I'm new to dealing with both multiprocessing and large data.
from multiprocessing import Process, Manager
def eachProcess(i, d):
LARGE_BINARY_OBJECT = #perform some computation resulting in millions of bytes
d[i] = LARGE_BINARY_OBJECT
def main():
manager = Manager()
d = manager.dict()
maxProcesses = 10
for i in range(maxProcesses):
process = Process(target=eachProcess, args=(i,d))
process.start()
counter = 0
while counter < maxProcesses:
file1 = open("test.txt", "wb")
if counter in d:
file1.write(d[counter])
counter += 1
if __name__ == '__main__':
main()
Thank you.
When dealing with large data usually the approaches are two:
Local file system if the problem is simple enough
Remote data storage if more complex support over data is needed
As your problem seems pretty simple, I'd suggest the following solution. Each process writes its partial solution to a local file. Once all processing is done, the main process combines all result files together.
from multiprocessing import Pool
from tempfile import NamedTemporaryFile
def worker_function(partial_result_path):
data = produce_large_binary()
with open(partial_result_path, 'wb') as partial_result_file:
partial_result_file.write(data)
# storing partial results in temporary files
partial_result_paths = [NamedTemporaryFile() for i in range(max_processes)]
pool = Pool(max_processes)
pool.map(worker_function, partial_result_paths)
with open('test.txt', 'wb') as result_file:
for partial_result_path in partial_result_paths:
with open(partial_result_path) as partial_result_file:
result_file.write(partial_result_file.read())
I'm trying to run a function with multiprocessing. This is the code:
import multiprocessing as mu
output = []
def f(x):
output.append(x*x)
jobs = []
np = mu.cpu_count()
for n in range(np*500):
p = mu.Process(target=f, args=(n,))
jobs.append(p)
running = []
for i in range(np):
p = jobs.pop()
running.append(p)
p.start()
while jobs != []:
for r in running:
if r.exitcode == 0:
try:
running.remove(r)
p = jobs.pop()
p.start()
running.append(p)
except IndexError:
break
print "Done:"
print output
The output is [], while it should be [1,4,9,...]. Someone sees where i'm making a mistake?
You are using multiprocessing, not threading. So your output list is not shared between the processes.
There are several possible solutions;
Retain most of your program but use a multiprocessing.Queue instead of a list. Let the workers put their results in the queue, and read it from the main program. It will copy data from process to process, so for big chunks of data this will have significant overhead.
You could use shared memory in the form of multiprocessing.Array. This might be the best solution if the processed data is large.
Use a Pool. This takes care of all the process management for you. Just like with a queue, it copies data from process to process. It is probably the easiest to use. IMO this is the best option if the data sent to/from each worker is small.
Use threading so that the output list is shared between threads. Threading in CPython has the restriction that only one thread at a time can be executing Python bytecode, so you might not get as much performance benefit as you'd expect. And unlike the multiprocessing solutions it will not take advantage of multiple cores.
Edit:
Thanks to #Roland Smith to point out.
The main problem is the function f(x). When child process call this, it's unable for them to fine the output variable (since it's not shared).
Edit:
Just as #cdarke said, in multiprocessing you have to carefully control the shared object that child process could access(maybe a lock), and it's pretty complicated and hard to debug.
Personally I suggest to use the Pool.map method for this.
For instance, I assume that you run this code directly, not as a module, then your code would be:
import multiprocessing as mu
def f(x):
return x*x
if __name__ == '__main__':
np = mu.cpu_count()
args = [n for n in range(np*500)]
pool = mu.Pool(processes=np)
result = pool.map(f, args)
pool.close()
pool.join()
print result
but there's something you must know
if you just run this file but not import with module, the if __name__ == '__main__': is important, since python will load this file as a module for other process, if you don't place the function 'f' outside if __name__ == '__main__':, the child process would not be able to find your function 'f'
**Edit:**thanks #Roland Smith point out that we could use tuple
if you have more then one args for the function f, then you might need a tuple to do so, for instance
def f((x,y))
return x*y
args = [(n,1) for n in range(np*500)]
result = pool.map(f, args)
or check here for more detailed discussion
I have a single big text file in which I want to process each line ( do some operations ) and store them in a database. Since a single simple program is taking too long, I want it to be done via multiple processes or threads.
Each thread/process should read the DIFFERENT data(different lines) from that single file and do some operations on their piece of data(lines) and put them in the database so that in the end, I have whole of the data processed and my database is dumped with the data I need.
But I am not able to figure it out that how to approach this.
What you are looking for is a Producer/Consumer pattern
Basic threading example
Here is a basic example using the threading module (instead of multiprocessing)
import threading
import Queue
import sys
def do_work(in_queue, out_queue):
while True:
item = in_queue.get()
# process
result = item
out_queue.put(result)
in_queue.task_done()
if __name__ == "__main__":
work = Queue.Queue()
results = Queue.Queue()
total = 20
# start for workers
for i in xrange(4):
t = threading.Thread(target=do_work, args=(work, results))
t.daemon = True
t.start()
# produce data
for i in xrange(total):
work.put(i)
work.join()
# get the results
for i in xrange(total):
print results.get()
sys.exit()
You wouldn't share the file object with the threads. You would produce work for them by supplying the queue with lines of data. Then each thread would pick up a line, process it, and then return it in the queue.
There are some more advanced facilities built into the multiprocessing module to share data, like lists and special kind of Queue. There are trade-offs to using multiprocessing vs threads and it depends on whether your work is cpu bound or IO bound.
Basic multiprocessing.Pool example
Here is a really basic example of a multiprocessing Pool
from multiprocessing import Pool
def process_line(line):
return "FOO: %s" % line
if __name__ == "__main__":
pool = Pool(4)
with open('file.txt') as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 4)
print results
A Pool is a convenience object that manages its own processes. Since an open file can iterate over its lines, you can pass it to the pool.map(), which will loop over it and deliver lines to the worker function. Map blocks and returns the entire result when its done. Be aware that this is an overly simplified example, and that the pool.map() is going to read your entire file into memory all at once before dishing out work. If you expect to have large files, keep this in mind. There are more advanced ways to design a producer/consumer setup.
Manual "pool" with limit and line re-sorting
This is a manual example of the Pool.map, but instead of consuming an entire iterable in one go, you can set a queue size so that you are only feeding it piece by piece as fast as it can process. I also added the line numbers so that you can track them and refer to them if you want, later on.
from multiprocessing import Process, Manager
import time
import itertools
def do_work(in_queue, out_list):
while True:
item = in_queue.get()
line_no, line = item
# exit signal
if line == None:
return
# fake work
time.sleep(.5)
result = (line_no, line)
out_list.append(result)
if __name__ == "__main__":
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
# start for workers
pool = []
for i in xrange(num_workers):
p = Process(target=do_work, args=(work, results))
p.start()
pool.append(p)
# produce data
with open("source.txt") as f:
iters = itertools.chain(f, (None,)*num_workers)
for num_and_line in enumerate(iters):
work.put(num_and_line)
for p in pool:
p.join()
# get the results
# example: [(1, "foo"), (10, "bar"), (0, "start")]
print sorted(results)
Here's a really stupid example that I cooked up:
import os.path
import multiprocessing
def newlinebefore(f,n):
f.seek(n)
c=f.read(1)
while c!='\n' and n > 0:
n-=1
f.seek(n)
c=f.read(1)
f.seek(n)
return n
filename='gpdata.dat' #your filename goes here.
fsize=os.path.getsize(filename) #size of file (in bytes)
#break the file into 20 chunks for processing.
nchunks=20
initial_chunks=range(1,fsize,fsize/nchunks)
#You could also do something like:
#initial_chunks=range(1,fsize,max_chunk_size_in_bytes) #this should work too.
with open(filename,'r') as f:
start_byte=sorted(set([newlinebefore(f,i) for i in initial_chunks]))
end_byte=[i-1 for i in start_byte] [1:] + [None]
def process_piece(filename,start,end):
with open(filename,'r') as f:
f.seek(start+1)
if(end is None):
text=f.read()
else:
nbytes=end-start+1
text=f.read(nbytes)
# process text here. createing some object to be returned
# You could wrap text into a StringIO object if you want to be able to
# read from it the way you would a file.
returnobj=text
return returnobj
def wrapper(args):
return process_piece(*args)
filename_repeated=[filename]*len(start_byte)
args=zip(filename_repeated,start_byte,end_byte)
pool=multiprocessing.Pool(4)
result=pool.map(wrapper,args)
#Now take your results and write them to the database.
print "".join(result) #I just print it to make sure I get my file back ...
The tricky part here is to make sure that we split the file on newline characters so that you don't miss any lines (or only read partial lines). Then, each process reads it's part of the file and returns an object which can be put into the database by the main thread. Of course, you may even need to do this part in chunks so that you don't have to keep all of the information in memory at once. (this is quite easily accomplished -- just split the "args" list into X chunks and call pool.map(wrapper,chunk) -- See here)
well break the single big file into multiple smaller files and have each of them processed in separate threads.