Following code I'm using for parallel csv processing:
#!/usr/bin/env python
import csv
from time import sleep
from multiprocessing import Pool
from multiprocessing import cpu_count
from multiprocessing import current_process
from pprint import pprint as pp
def init_worker(x):
sleep(.5)
print "(%s,%s)" % (x[0],x[1])
x.append(int(x[0])**2)
return x
def parallel_csv_processing(inputFile, outputFile, header=["Default", "header", "please", "change"], separator=",", skipRows = 0, cpuCount = 1):
# OPEN FH FOR READING INPUT FILE
inputFH = open(inputFile, "rt")
csvReader = csv.reader(inputFH, delimiter=separator)
# SKIP HEADERS
for skip in xrange(skipRows):
csvReader.next()
# PARALLELIZE COMPUTING INTENSIVE OPERATIONS - CALL FUNCTION HERE
try:
p = Pool(processes = cpuCount)
results = p.map(init_worker, csvReader, chunksize = 10)
p.close()
p.join()
except KeyboardInterrupt:
p.close()
p.join()
p.terminate()
# CLOSE FH FOR READING INPUT
inputFH.close()
# OPEN FH FOR WRITING OUTPUT FILE
outputFH = open(outputFile, "wt")
csvWriter = csv.writer(outputFH, lineterminator='\n')
# WRITE HEADER TO OUTPUT FILE
csvWriter.writerow(header)
# WRITE RESULTS TO OUTPUT FILE
[csvWriter.writerow(row) for row in results]
# CLOSE FH FOR WRITING OUTPUT
outputFH.close()
print pp(results)
# print len(results)
def main():
inputFile = "input.csv"
outputFile = "output.csv"
parallel_csv_processing(inputFile, outputFile, cpuCount = cpu_count())
if __name__ == '__main__':
main()
I would like to somehow measure the progress of the script (just plain text not any fancy ASCII art). The one option that comes to my mind is to compare the lines that were successfully processed by init_worker to all lines in input.csv, and print the actual state e.g. every second, can you please point me to right solution? I've found several articles with similar problematic but I was not able to adapt it to my needs because neither used the Pool class and map method. I would also like to ask about p.close(), p.join(), p.terminate() methods, I've seen them mainly with Process not Pool class, are they necessary with Pool class and have I use them correctly? Using of p.terminate() was intended to kill the process with ctrl+c but this is different story which has not an happy end yet. Thank you.
PS: My input.csv looks like this, if it matters:
0,0
1,3
2,6
3,9
...
...
48,144
49,147
PPS: as I said I'm newbie in multiprocessing and the code I've put together just works. The one drawback I can see is that whole csv is stored in memory, so if you guys have better idea do not hesitate to share it.
Edit
in reply to #J.F.Sebastian
Here is my actual code based on your suggestions:
#!/usr/bin/env python
import csv
from time import sleep
from multiprocessing import Pool
from multiprocessing import cpu_count
from multiprocessing import current_process
from pprint import pprint as pp
from tqdm import tqdm
def do_job(x):
sleep(.5)
# print "(%s,%s)" % (x[0],x[1])
x.append(int(x[0])**2)
return x
def parallel_csv_processing(inputFile, outputFile, header=["Default", "header", "please", "change"], separator=",", skipRows = 0, cpuCount = 1):
# OPEN FH FOR READING INPUT FILE
inputFH = open(inputFile, "rb")
csvReader = csv.reader(inputFH, delimiter=separator)
# SKIP HEADERS
for skip in xrange(skipRows):
csvReader.next()
# OPEN FH FOR WRITING OUTPUT FILE
outputFH = open(outputFile, "wt")
csvWriter = csv.writer(outputFH, lineterminator='\n')
# WRITE HEADER TO OUTPUT FILE
csvWriter.writerow(header)
# PARALLELIZE COMPUTING INTENSIVE OPERATIONS - CALL FUNCTION HERE
try:
p = Pool(processes = cpuCount)
# results = p.map(do_job, csvReader, chunksize = 10)
for result in tqdm(p.imap_unordered(do_job, csvReader, chunksize=10)):
csvWriter.writerow(result)
p.close()
p.join()
except KeyboardInterrupt:
p.close()
p.join()
# CLOSE FH FOR READING INPUT
inputFH.close()
# CLOSE FH FOR WRITING OUTPUT
outputFH.close()
print pp(result)
# print len(result)
def main():
inputFile = "input.csv"
outputFile = "output.csv"
parallel_csv_processing(inputFile, outputFile, cpuCount = cpu_count())
if __name__ == '__main__':
main()
Here is output of tqdm:
1 [elapsed: 00:05, 0.20 iters/sec]
what does this output mean? On the page you've referred tqdm is used in loop following way:
>>> import time
>>> from tqdm import tqdm
>>> for i in tqdm(range(100)):
... time.sleep(1)
...
|###-------| 35/100 35% [elapsed: 00:35 left: 01:05, 1.00 iters/sec]
This output makes sense, but what does my output mean? Also it does not seems that ctrl+c problem is fixed: after hitting ctrl+c script throws some Traceback, if I hit ctrl+c again then I get new Traceback and so on. The only way to kill it is sending it to background (ctr+z) and then kill it (kill %1)
To show the progress, replace pool.map with pool.imap_unordered:
from tqdm import tqdm # $ pip install tqdm
for result in tqdm(pool.imap_unordered(init_worker, csvReader, chunksize=10)):
csvWriter.writerow(result)
tqdm part is optional, see Text Progress Bar in the Console
Accidentally, it fixes your "whole csv is stored in memory" and "KeyboardInterrupt is not raised" problems.
Here's a complete code example:
#!/usr/bin/env python
import itertools
import logging
import multiprocessing
import time
def compute(i):
time.sleep(.5)
return i**2
if __name__ == "__main__":
logging.basicConfig(format="%(asctime)-15s %(levelname)s %(message)s",
datefmt="%F %T", level=logging.DEBUG)
pool = multiprocessing.Pool()
try:
for square in pool.imap_unordered(compute, itertools.count(), chunksize=10):
logging.debug(square) # report progress by printing the result
except KeyboardInterrupt:
logging.warning("got Ctrl+C")
finally:
pool.terminate()
pool.join()
You should see the output in batches every .5 * chunksize seconds. If you press Ctrl+C; you should see KeyboardInterrupt raised in the child processes and in the main process. In Python 3, the main process exits immediately. In Python 2, the KeyboardInterrupt is delayed until the next batch should have been printed (bug in Python).
Related
Actually I have this code :
#!/usr/bin/env python3
import sys
import requests
import random
from multiprocessing.dummy import Pool
from pathlib import Path
requests.urllib3.disable_warnings()
print ('Give name of txt file on _listeNDD directory (without.txt)'),
file = str(input())
if Path('_listeNDD/'+file+'.txt').is_file():
print ('--------------------------------------------------------')
print ("Found")
print ('--------------------------------------------------------')
print ('Choose name for the output list (without .txt)'),
nomRez = str(input())
filename = '_listeNDD/'+file+'.txt'
domains = [i.strip() for i in open(filename , mode='r').readlines()]
else:
print ('--------------------------------------------------------')
exit('No txt found with this name')
def check(domain):
try:
r = requests.get('https://'+domain+'/test', timeout=5, allow_redirects = False)
if "[core]" in r.text:
with open('_rez/'+nomRez+'.txt', "a+") as f:
print('https://'+domain+'/test', file=f)
except:pass
mp = Pool(100)
mp.map(check, domains)
mp.close()
mp.join()
exit('finished')
Screen of the root file
With this code, it open text file on directory "_listeNDD" and I write new text file on directory "_rez".
Obviously it's super fast for ten elements but when it gets a bigger I would like a progress bar to know if I have time to make a coffee or not.
I had personally tried using the github tqdm but unfortunately it shows a progress bar for every job it does, while I only want one for everything...
Any idea?
Thank you
EDIT : Using this post, I did not succeed with
if __name__ == '__main__':
p = Pool(100)
r = p.map(check, tqdm.tqdm(range(0, 30)))
p.close()
p.join()
I don't have a high enough python level to master this so I may have badly integrated this into my code.
I also saw:
if __name__ == '__main__':
r = process_map(check, range(0, 30), max_workers=2)
How can I redirect prints that occur within a multiprocessing Pool into a StringIO()
I am redirecting the sys.stdout into a StringIO(), this works well as long as I don't use pool from the multiprocessing library.
This toy code is an example:
import io
import sys
from multiprocessing import Pool
print_file = io.StringIO()
sys.stdout = print_file
def a_print_func(some_string):
print(some_string)
pool = Pool(2)
out = pool.map(a_print_func, [['test_1','test_1'],['test_2','test_2']])
a_print_func('no_pool')
print('no_pool, no_func')
fd = open('file.txt', 'w')
fd.write(print_file.getvalue())
fd.close()
file.txt only contains:
no_pool
no_pool, no_func
instead of:
test_1
test_1
test_2
test_2
no_pool
no_pool, no_func
Here's a solution for directing the output from all the child processes to a single file, using an initializer:
import io
import sys
from multiprocessing import Pool
print_file = io.StringIO()
print_file = open("file.txt", "w")
def a_print_func(some_string):
print(some_string)
def foo(*args):
sys.stdout = print_file
pool = Pool(2, initializer = foo)
out = pool.map(a_print_func, [['test_1','test_1'],['test_2','test_2']])
a_print_func('no_pool')
print('no_pool, no_func')
The output of the program is
no_pool
no_pool, no_func
And, the content of file.txt at the end of the execution is:
['test_1', 'test_1']
['test_2', 'test_2']
I have to parse 30 days access logs from the server based on client IP and accessed hosts and need to know top 10 accessed sites. The log file will be around 10-20 GB in size which takes lots of time for single threaded execution of script. Initially, I wrote a script which was working fine but it is taking a lot of time to due to large log file size. Then I tried to implement multiprocessing library for parallel processing but it is not working. It seems implementation of multiprocessing is repeating tasks instead of doing parallel processing. Not sure, what is wrong in the code. Can some one please help on this? Thank you so much in advance for your help.
Code:
from datetime import datetime, timedelta
import commands
import os
import string
import sys
import multiprocessing
def ipauth (slave_list, static_ip_list):
file_record = open('/home/access/top10_domain_accessed/logs/combined_log.txt', 'a')
count = 1
while (count <=30):
Nth_days = datetime.now() - timedelta(days=count)
date = Nth_days.strftime("%Y%m%d")
yr_month = Nth_days.strftime("%Y/%m")
file_name = 'local2' + '.' + date
with open(slave_list) as file:
for line in file:
string = line.split()
slave = string[0]
proxy = string[1]
log_path = "/LOGS/%s/%s" %(slave, yr_month)
try:
os.path.exists(log_path)
file_read = os.path.join(log_path, file_name)
with open(file_read) as log:
for log_line in log:
log_line = log_line.strip()
if proxy in log_line:
file_record.write(log_line + '\n')
except IOError:
pass
count = count + 1
file_log = open('/home/access/top10_domain_accessed/logs/ipauth_logs.txt', 'a')
with open(static_ip_list) as ip:
for line in ip:
with open('/home/access/top10_domain_accessed/logs/combined_log.txt','r') as f:
for content in f:
log_split = content.split()
client_ip = log_split[7]
if client_ip in line:
content = str(content).strip()
file_log.write(content + '\n')
return
if __name__ == '__main__':
slave_list = sys.argv[1]
static_ip_list = sys.argv[2]
jobs = []
for i in range(5):
p = multiprocessing.Process(target=ipauth, args=(slave_list, static_ip_list))
jobs.append(p)
p.start()
p.join()
UPDATE AFTER CONVERSATION WITH OP, PLEASE SEE COMMENTS
My take: Split the file into smaller chunks and use a process pool to work on those chunks:
import multiprocessing
def chunk_of_lines(fp, n):
# read n lines from file
# then yield
pass
def process(lines):
pass # do stuff to a file
p = multiprocessing.Pool()
fp = open(slave_list)
for f in chunk_of_lines(fp,10):
p.apply_async(process, [f,static_ip_list])
p.close()
p.join() # Wait for all child processes to close.
There are many ways to implement the chunk_of_lines method, you could iterate over the file lines using a simple for or do something more advance like call fp.read().
imap version:
import os
import multiprocessing as mp
import timeit
import string
import random
PROCESSES = 5
FILE = 'test_imap.txt'
def remove_file():
try:
os.remove(FILE)
except FileNotFoundError:
pass
def produce(i):
return [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(32)) for i in range(100000)]
def imap_version():
with mp.Pool(PROCESSES) as p:
with open(FILE, 'a') as fp:
for lines in p.imap_unordered(produce, range(5)):
for line in lines:
fp.write(line + '\n')
if __name__ == '__main__':
remove_file()
imap_version_result = timeit.repeat("imap_version()", setup="from __main__ import imap_version", repeat=5, number=5)
print('imap result:', imap_version_result)
apply_async version:
import os
import multiprocessing as mp
import timeit
import string
import random
PROCESSES = 5
FILE = 'test_apply.txt'
def remove_file():
try:
os.remove(FILE)
except FileNotFoundError:
pass
def produce():
return [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(32)) for i in range(100000)]
def worker():
lines = produce()
with open(FILE, 'a') as fp:
for line in lines:
fp.write(line + '\n')
def apply_version():
with mp.Pool(PROCESSES) as p:
processes = []
for i in range(5):
processes.append(p.apply_async(worker))
while True:
if all((p.ready() for p in processes)):
break
if __name__ == '__main__':
remove_file()
apply_version_result = timeit.repeat("apply_version()", setup="from __main__ import apply_version", repeat=5, number=5)
print('apply result', apply_version_result)
Results:
imap result: [62.71130559899029, 62.65627204600605, 62.534730065002805, 62.67373917000077, 62.74415319500258]
apply result [72.03727042900573, 72.17959955699916, 72.2304800950078, 72.02653418600676, 72.11620796499483]
I expected imap to be slower because child processes need to pickle the results to the main process and then write to file, whereas each child process in apply_async directly write the results to file. Instead, imap is slower than apply_async.
Why is this so?
nb: This was done using Python 3.4.3 on Mac OS X 10.11
A quick glance at your source code shows that the imap_version() opens your output file once per process where apply_version() opens it once per worker which is 5 times per process due to being inside your range(5) loop.
with open(FILE, 'a') as fp is called 125 times in your async version vs 25 times in your imap version.
My guess is the busy loop is the culprit (besides it being an anti-pattern in its own right).
By checking the state yourself, you do redundant work: multiprocessing's machinery does pretty much the same with the work queue behind the scenes (in multiprocessing.pool.Pool._handle_workers() running in a separate thread). On the other hand, IMapIterator.next uses threading.Condition(threading.Lock()) to suspend the main thread's execution until an item is ready (so _handle_workers runs unhindered - remember that only one thread can run Python code at each moment).
Anyway, this is but another guess. The only decisive evidence would be a profiling result.
I'm having the following problem in python.
I need to do some calculations in parallel whose results I need to be written sequentially in a file. So I created a function that receives a multiprocessing.Queue and a file handle, do the calculation and print the result in the file:
import multiprocessing
from multiprocessing import Process, Queue
from mySimulation import doCalculation
# doCalculation(pars) is a function I must run for many different sets of parameters and collect the results in a file
def work(queue, fh):
while True:
try:
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
except:
break
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fh = open("foo", "w")
workQueue = Queue()
parList = # list of conditions for which I want to run doCalculation()
for x in parList:
workQueue.put(x)
processes = [Process(target = writefh, args = (workQueue, fh)) for i in range(nthreads)]
for p in processes:
p.start()
for p in processes:
p.join()
fh.close()
But the file ends up empty after the script runs. I tried to change the worker() function to:
def work(queue, filename):
while True:
try:
fh = open(filename, "a")
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
fh.close()
except:
break
and pass the filename as parameter. Then it works as I intended. When I try to do the same thing sequentially, without multiprocessing, it also works normally.
Why it didn't worked in the first version? I can't see the problem.
Also: can I guarantee that two processes won't try to write the file simultaneously?
EDIT:
Thanks. I got it now. This is the working version:
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
from random import uniform
def doCalculation(par):
t = uniform(0,2)
sleep(t)
return par * par # just to simulate some calculation
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "dealing with ", par, ""
res = doCalculation(par)
queueOut.put((par,res))
except:
break
def write(queue, fname):
fhandle = open(fname, "w")
while True:
try:
par, res = queue.get(block = False)
print >>fhandle, par, res
except:
break
fhandle.close()
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fname = "foo"
workerQueue = Queue()
writerQueue = Queue()
parlist = [1,2,3,4,5,6,7,8,9,10]
feedProc = Process(target = feed , args = (workerQueue, parlist))
calcProc = [Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target = write, args = (writerQueue, fname))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join ()
for p in calcProc:
p.join()
writProc.join ()
You really should use two queues and three separate kinds of processing.
Put stuff into Queue #1.
Get stuff out of Queue #1 and do calculations, putting stuff in Queue #2. You can have many of these, since they get from one queue and put into another queue safely.
Get stuff out of Queue #2 and write it to a file. You must have exactly 1 of these and no more. It "owns" the file, guarantees atomic access, and absolutely assures that the file is written cleanly and consistently.
If anyone is looking for a simple way to do the same, this can help you.
I don't think there are any disadvantages to doing it in this way. If there are, please let me know.
import multiprocessing
import re
def mp_worker(item):
# Do something
return item, count
def mp_handler():
cpus = multiprocessing.cpu_count()
p = multiprocessing.Pool(cpus)
# The below 2 lines populate the list. This listX will later be accessed parallely. This can be replaced as long as listX is passed on to the next step.
with open('ExampleFile.txt') as f:
listX = [line for line in (l.strip() for l in f) if line]
with open('results.txt', 'w') as f:
for result in p.imap(mp_worker, listX):
# (item, count) tuples from worker
f.write('%s: %d\n' % result)
if __name__=='__main__':
mp_handler()
Source: Python: Writing to a single file with queue while using multiprocessing Pool
There is a mistake in the write worker code, if the block is false, the worker will never get any data. Should be as follows:
par, res = queue.get(block = True)
You can check it by adding line
print "QSize",queueOut.qsize()
after the
queueOut.put((par,res))
With block=False you would be getting ever increasing length of the queue until it fills up, unlike with block=True where you get always "1".