I have built a very single program to understand how parallelization may help (save time).
This program simply write a list of integer in a file, and I split the treatment by range of integer.
Reading relevant question, I have put Lock, but no improvement.
The program run, but instead of having in my file, for instance, 1000000 lines, I will have 99996xx...and the number of worker do not change anything (even with one worker, the problem curiously remains...however no competition problem between worker).
May somebody explain me what is wrong ? (I have tried with opening and writing a file within the worker and then concatenate all : it runs. but I'd like to understand why I do not manage to write in a shared file with multiprocessing).
Please find below the code:
import csv
import time
import multiprocessing
SizeOfSet=8000000
PathCorpus='/home'
Myfile='output.csv'
MyFile=PathCorpus+'/'+Myfile
step=1000000
nb_worker=SizeOfSet//step
def worker_verrou(verrou,start,end):
verrou.acquire()
for k in range(start,end):
try:
writerfile.writerow([k])
except:
print('pb with writerow %i',k)
verrou.release()
return
pointeurwritefile = open(MyFile, "wt")
writerfile = csv.writer(pointeurwritefile)
verrou=multiprocessing.Lock()
start_time = time.time()
jobs=[]
for k in range(nb_worker):
print('Starting process number %i', k)
p = multiprocessing.Process(target=worker_verrou, args=(verrou, k*step,(k+1)*step))
jobs.append(p)
p.start()
for j in jobs:
j.join()
print '%s.exitcode = %s' % (j.name, j.exitcode)
interval = time.time() - start_time
print('Total time in seconds:', interval)
print('ended')
Related
I have some text files that I need to read with Python. The text files keep an array of floats only (ie no strings) and the size of the array is 2000-by-2000. I tried to use the multiprocessing package but for some reason it now runs slower. The times I have on my pc for the code attached below are
Multi thread: 73.89 secs
Single thread: 60.47 secs
What am I doing wrong here, is there a way to speed up this task? My pc is powered by an Intel Core i7 processor and in real life I have several hundreds of these text files, 600 or even more.
import numpy as np
from multiprocessing.dummy import Pool as ThreadPool
import os
import time
from datetime import datetime
def read_from_disk(full_path):
print('%s reading %s' % (datetime.now().strftime('%H:%M:%S'), full_path))
out = np.genfromtxt(full_path, delimiter=',')
return out
def make_single_path(n):
return r"./dump/%d.csv" % n
def save_flatfiles(n):
for i in range(n):
temp = np.random.random((2000, 2000))
_path = os.path.join('.', 'dump', str(i)+'.csv')
np.savetxt(_path, temp, delimiter=',')
if __name__ == "__main__":
# make some text files
n = 10
save_flatfiles(n)
# list with the paths to the text files
file_list = [make_single_path(d) for d in range(n)]
pool = ThreadPool(8)
start = time.time()
results = pool.map(read_from_disk, file_list)
pool.close()
pool.join()
print('finished multi thread in %s' % (time.time()-start))
start = time.time()
for d in file_list:
out = read_from_disk(d)
print('finished single thread in %s' % (time.time() - start))
print('Done')
You are using multiprocessing.dummy which replicates the API of multiprocessing but actually it is a wrapper around the threading module.
So, basically you are using Threads instead of Process. And threads in python are not useful( Due to GIL) when you want to perform computational tasks.
So Replace:
from multiprocessing.dummy import Pool as ThreadPool
With:
from multiprocessing import Pool
I've tried running your code on my machine having a i5 processor, it finished execution in 45 seconds. so i would say that's a big improvement.
Hope this clears your understanding.
I'm working on a python 2.7 program that performs these actions in parallel using multiprocessing:
reads a line from file 1 and file 2 at the same time
applies function(line_1, line_2)
writes the function output to a file
I am new to multiprocessing and I'm not extremely expert with python in general. Therefore, I read a lot of already asked questions and tutorials: I feel close to the point but I am now probably missing something that I can't really spot.
The code is structured like this:
from itertools import izip
from multiprocessing import Queue, Process, Lock
nthreads = int(mp.cpu_count())
outq = Queue(nthreads)
l = Lock()
def func(record_1, record_2):
result = # do stuff
outq.put(result)
OUT = open("outputfile.txt", "w")
IN1 = open("infile_1.txt", "r")
IN2 = open("infile_2.txt", "r")
processes = []
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2))
processes.append(proc)
proc.start()
for proc in processes:
proc.join()
while (not outq.empty()):
l.acquire()
item = outq.get()
OUT.write(item)
l.release()
OUT.close()
IN1.close()
IN2.close()
To my understanding (so far) of multiprocessing as package, what I'm doing is:
creating a queue for the results of the function that has a size limit compatible with the number of cores of the machine.
filling this queue with the results of func().
reading the queue items until the queue is empty, writing them to the output file.
Now, my problem is that when I run this script it immediately becomes a zombie process. I know that the function works because without the multiprocessing implementation I had the results I wanted.
I'd like to read from the two files and write to output at the same time, to avoid generating a huge list from my input files and then reading it (input files are huge). Do you see anything gross, completely wrong or improvable?
The biggest issue I see is that you should pass the queue object through the process instead of trying to use it as a global in your function.
def func(record_1, record_2, queue):
result = # do stuff
queue.put(result)
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2, outq))
Also, as currently written, you would still be pulling all that information into memory (aka the queue) and waiting for the read to finish before writing to the output file. You need to move the p.join loop until after reading through the queue, and instead of putting all the information in the queue at the end of the func it should be filling the queue with chucks in a loop over time, or else it's the same as just reading it all into memory.
You also don't need a lock unless you are using it in the worker function func, and if you do, you will again want to pass it through.
If you want to not to read / store a lot in memory, I would write out the same time I am iterating through the input files. Here is a basic example of combining each line of the files together.
with open("infile_1.txt") as infile1, open("infile_2.txt") as infile2, open("out", "w") as outfile:
for line1, line2 in zip(infile1, infile2):
outfile.write(line1 + line2)
I don't want to write to much about all of these, just trying to give you ideas. Let me know if you want more detail about something. Hope it helps!
I'm new to python and I'm having trouble understanding how threading works. By skimming through the documentation, my understanding is that calling join() on a thread is the recommended way of blocking until it completes.
To give a bit of background, I have 48 large csv files (multiple GB) which I am trying to parse in order to find inconsistencies. The threads share no state. This can be done single threadedly in a reasonable ammount of time for a one-off, but I am trying to do it concurrently as an exercise.
Here's a skeleton of the file processing:
def process_file(data_file):
with open(data_file) as f:
print "Start processing {0}".format(data_file)
line = f.readline()
while line:
# logic omitted for brevity; can post if required
# pretty certain it works as expected, single 'thread' works fine
line = f.readline()
print "Finished processing file {0} with {1} errors".format(data_file, error_count)
def process_file_callable(data_file):
try:
process_file(data_file)
except:
print >> sys.stderr, "Error processing file {0}".format(data_file)
And the concurrent bit:
def partition_list(l, n):
""" Yield successive n-sized partitions from a list.
"""
for i in xrange(0, len(l), n):
yield l[i:i+n]
partitions = list(partition_list(data_files, 4))
for partition in partitions:
threads = []
for data_file in partition:
print "Processing file {0}".format(data_file)
t = Thread(name=data_file, target=process_file_callable, args = (data_file,))
threads.append(t)
t.start()
for t in threads:
print "Joining {0}".format(t.getName())
t.join(5)
print "Joined the first chunk of {0}".format(map(lambda t: t.getName(), threads))
I run this as:
python -u datautils/cleaner.py > cleaner.out 2> cleaner.err
My understanding is that join() should block the calling thread waiting for the thread it's called on to finish, however the behaviour I'm observing is inconsistent with my expectation.
I never see errors in the error file, but I also never see the expected log messages on stdout.
The parent process does not terminate unless I explicitly kill it from the shell. If I check how many prints I have for Finished ... it's never the expected 48, but somewhere between 12 and 15. However, having run this single-threadedly, I can confirm that the multithreaded run is actually processing everything and doing all the expected validation, only it does not seem to terminate cleanly.
I know I must be doing something wrong, but I would really appreciate if you can point me in the right direction.
I can't understand where mistake in your code. But I can recommend you to refactor it a little bit.
First at all, threading in python is not concurrent at all. It's just illusion, because there is a Global Interpreter Lock, so only one thread can be executed in same time. That's why I recommend you to use multiprocessing module:
from multiprocessing import Pool, cpu_count
pool = Pool(cpu_count)
for partition in partition_list(data_files, 4):
res = pool.map(process_file_callable, partition)
print res
At second, you are using not pythonic way to read file:
with open(...) as f:
line = f.readline()
while line:
... # do(line)
line = f.readline()
Here is pythonic way:
with open(...) as f:
for line in f:
... # do(line)
This is memory efficient, fast, and leads to simple code. (c) PyDoc
By the way, I have only one hypothesis what can happen with your program in multithreading way - app became more slower, because unordered access to hard disk drive is significantly slower than ordered. You can try to check this hypothesis using iostat or htop, if you are using Linux.
If your app does not finish work, and it doesn't do anything in process monitor (cpu or disk is not active), it means you have some kind of deadlock or blocked access to same resource.
Thanks everybody for your input and sorry for not replying sooner - I'm working on this on and off as a hobby project.
I've managed to write a simple example that proves it was my bad:
from itertools import groupby
from threading import Thread
from random import randint
from time import sleep
for key, partition in groupby(range(1, 50), lambda k: k//10):
threads = []
for idx in list(partition):
thread_name = 'thread-%d' % idx
t = Thread(name=thread_name, target=sleep, args=(randint(1, 5),))
threads.append(t)
print 'Starting %s' % t.getName()
t.start()
for t in threads:
print 'Joining %s' % t.getName()
t.join()
print 'Joined the first group of %s' % map(lambda t: t.getName(), threads)
The reason it was failing initially was the while loop the 'logic omitted for brevity' was working fine, however some of the input files that were being fed in were corrupted (had jumbled lines) and the logic went into an infinite loop on them. This is the reason some threads were never joined. The timeout for the join made sure that they were all started, but some never finished hence the inconsistency between 'starting' and 'joining'. The other fun fact was that the corruption was on the last line, so all the expected data was being processed.
Thanks again for your advice - the comment about processing files in a while instead of the pythonic way pointed me in the right direction, and yes, threading behaves as expected.
I am trying to solve a big numerical problem which involves lots of subproblems, and I'm using Python's multiprocessing module (specifically Pool.map) to split up different independent subproblems onto different cores. Each subproblem involves computing lots of sub-subproblems, and I'm trying to effectively memoize these results by storing them to a file if they have not been computed by any process yet, otherwise skip the computation and just read the results from the file.
I'm having concurrency issues with the files: different processes sometimes check to see if a sub-subproblem has been computed yet (by looking for the file where the results would be stored), see that it hasn't, run the computation, then try to write the results to the same file at the same time. How do I avoid writing collisions like this?
#GP89 mentioned a good solution. Use a queue to send the writing tasks to a dedicated process that has sole write access to the file. All the other workers have read only access. This will eliminate collisions. Here is an example that uses apply_async, but it will work with map too:
import multiprocessing as mp
import time
fn = 'c:/temp/temp.txt'
def worker(arg, q):
'''stupidly simulates long running process'''
start = time.clock()
s = 'this is a test'
txt = s
for i in range(200000):
txt += s
done = time.clock() - start
with open(fn, 'rb') as f:
size = len(f.read())
res = 'Process' + str(arg), str(size), done
q.put(res)
return res
def listener(q):
'''listens for messages on the q, writes to file. '''
with open(fn, 'w') as f:
while 1:
m = q.get()
if m == 'kill':
f.write('killed')
break
f.write(str(m) + '\n')
f.flush()
def main():
#must use Manager queue here, or will not work
manager = mp.Manager()
q = manager.Queue()
pool = mp.Pool(mp.cpu_count() + 2)
#put listener to work first
watcher = pool.apply_async(listener, (q,))
#fire off workers
jobs = []
for i in range(80):
job = pool.apply_async(worker, (i, q))
jobs.append(job)
# collect results from the workers through the pool result queue
for job in jobs:
job.get()
#now we are done, kill the listener
q.put('kill')
pool.close()
pool.join()
if __name__ == "__main__":
main()
It looks to me that you need to use Manager to temporarily save your results to a list and then write the results from the list to a file. Also, use starmap to pass the object you want to process and the managed list. The first step is to build the parameter to be passed to starmap, which includes the managed list.
from multiprocessing import Manager
from multiprocessing import Pool
import pandas as pd
def worker(row, param):
# do something here and then append it to row
x = param**2
row.append(x)
if __name__ == '__main__':
pool_parameter = [] # list of objects to process
with Manager() as mgr:
row = mgr.list([])
# build list of parameters to send to starmap
for param in pool_parameter:
params.append([row,param])
with Pool() as p:
p.starmap(worker, params)
From this point you need to decide how you are going to handle the list. If you have tons of RAM and a huge data set feel free to concatenate using pandas. Then you can save of the file very easily as a csv or a pickle.
df = pd.concat(row, ignore_index=True)
df.to_pickle('data.pickle')
df.to_csv('data.csv')
I have a single big text file in which I want to process each line ( do some operations ) and store them in a database. Since a single simple program is taking too long, I want it to be done via multiple processes or threads.
Each thread/process should read the DIFFERENT data(different lines) from that single file and do some operations on their piece of data(lines) and put them in the database so that in the end, I have whole of the data processed and my database is dumped with the data I need.
But I am not able to figure it out that how to approach this.
What you are looking for is a Producer/Consumer pattern
Basic threading example
Here is a basic example using the threading module (instead of multiprocessing)
import threading
import Queue
import sys
def do_work(in_queue, out_queue):
while True:
item = in_queue.get()
# process
result = item
out_queue.put(result)
in_queue.task_done()
if __name__ == "__main__":
work = Queue.Queue()
results = Queue.Queue()
total = 20
# start for workers
for i in xrange(4):
t = threading.Thread(target=do_work, args=(work, results))
t.daemon = True
t.start()
# produce data
for i in xrange(total):
work.put(i)
work.join()
# get the results
for i in xrange(total):
print results.get()
sys.exit()
You wouldn't share the file object with the threads. You would produce work for them by supplying the queue with lines of data. Then each thread would pick up a line, process it, and then return it in the queue.
There are some more advanced facilities built into the multiprocessing module to share data, like lists and special kind of Queue. There are trade-offs to using multiprocessing vs threads and it depends on whether your work is cpu bound or IO bound.
Basic multiprocessing.Pool example
Here is a really basic example of a multiprocessing Pool
from multiprocessing import Pool
def process_line(line):
return "FOO: %s" % line
if __name__ == "__main__":
pool = Pool(4)
with open('file.txt') as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 4)
print results
A Pool is a convenience object that manages its own processes. Since an open file can iterate over its lines, you can pass it to the pool.map(), which will loop over it and deliver lines to the worker function. Map blocks and returns the entire result when its done. Be aware that this is an overly simplified example, and that the pool.map() is going to read your entire file into memory all at once before dishing out work. If you expect to have large files, keep this in mind. There are more advanced ways to design a producer/consumer setup.
Manual "pool" with limit and line re-sorting
This is a manual example of the Pool.map, but instead of consuming an entire iterable in one go, you can set a queue size so that you are only feeding it piece by piece as fast as it can process. I also added the line numbers so that you can track them and refer to them if you want, later on.
from multiprocessing import Process, Manager
import time
import itertools
def do_work(in_queue, out_list):
while True:
item = in_queue.get()
line_no, line = item
# exit signal
if line == None:
return
# fake work
time.sleep(.5)
result = (line_no, line)
out_list.append(result)
if __name__ == "__main__":
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
# start for workers
pool = []
for i in xrange(num_workers):
p = Process(target=do_work, args=(work, results))
p.start()
pool.append(p)
# produce data
with open("source.txt") as f:
iters = itertools.chain(f, (None,)*num_workers)
for num_and_line in enumerate(iters):
work.put(num_and_line)
for p in pool:
p.join()
# get the results
# example: [(1, "foo"), (10, "bar"), (0, "start")]
print sorted(results)
Here's a really stupid example that I cooked up:
import os.path
import multiprocessing
def newlinebefore(f,n):
f.seek(n)
c=f.read(1)
while c!='\n' and n > 0:
n-=1
f.seek(n)
c=f.read(1)
f.seek(n)
return n
filename='gpdata.dat' #your filename goes here.
fsize=os.path.getsize(filename) #size of file (in bytes)
#break the file into 20 chunks for processing.
nchunks=20
initial_chunks=range(1,fsize,fsize/nchunks)
#You could also do something like:
#initial_chunks=range(1,fsize,max_chunk_size_in_bytes) #this should work too.
with open(filename,'r') as f:
start_byte=sorted(set([newlinebefore(f,i) for i in initial_chunks]))
end_byte=[i-1 for i in start_byte] [1:] + [None]
def process_piece(filename,start,end):
with open(filename,'r') as f:
f.seek(start+1)
if(end is None):
text=f.read()
else:
nbytes=end-start+1
text=f.read(nbytes)
# process text here. createing some object to be returned
# You could wrap text into a StringIO object if you want to be able to
# read from it the way you would a file.
returnobj=text
return returnobj
def wrapper(args):
return process_piece(*args)
filename_repeated=[filename]*len(start_byte)
args=zip(filename_repeated,start_byte,end_byte)
pool=multiprocessing.Pool(4)
result=pool.map(wrapper,args)
#Now take your results and write them to the database.
print "".join(result) #I just print it to make sure I get my file back ...
The tricky part here is to make sure that we split the file on newline characters so that you don't miss any lines (or only read partial lines). Then, each process reads it's part of the file and returns an object which can be put into the database by the main thread. Of course, you may even need to do this part in chunks so that you don't have to keep all of the information in memory at once. (this is quite easily accomplished -- just split the "args" list into X chunks and call pool.map(wrapper,chunk) -- See here)
well break the single big file into multiple smaller files and have each of them processed in separate threads.