Why imap and write is faster than apply_async and write? - python

imap version:
import os
import multiprocessing as mp
import timeit
import string
import random
PROCESSES = 5
FILE = 'test_imap.txt'
def remove_file():
try:
os.remove(FILE)
except FileNotFoundError:
pass
def produce(i):
return [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(32)) for i in range(100000)]
def imap_version():
with mp.Pool(PROCESSES) as p:
with open(FILE, 'a') as fp:
for lines in p.imap_unordered(produce, range(5)):
for line in lines:
fp.write(line + '\n')
if __name__ == '__main__':
remove_file()
imap_version_result = timeit.repeat("imap_version()", setup="from __main__ import imap_version", repeat=5, number=5)
print('imap result:', imap_version_result)
apply_async version:
import os
import multiprocessing as mp
import timeit
import string
import random
PROCESSES = 5
FILE = 'test_apply.txt'
def remove_file():
try:
os.remove(FILE)
except FileNotFoundError:
pass
def produce():
return [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(32)) for i in range(100000)]
def worker():
lines = produce()
with open(FILE, 'a') as fp:
for line in lines:
fp.write(line + '\n')
def apply_version():
with mp.Pool(PROCESSES) as p:
processes = []
for i in range(5):
processes.append(p.apply_async(worker))
while True:
if all((p.ready() for p in processes)):
break
if __name__ == '__main__':
remove_file()
apply_version_result = timeit.repeat("apply_version()", setup="from __main__ import apply_version", repeat=5, number=5)
print('apply result', apply_version_result)
Results:
imap result: [62.71130559899029, 62.65627204600605, 62.534730065002805, 62.67373917000077, 62.74415319500258]
apply result [72.03727042900573, 72.17959955699916, 72.2304800950078, 72.02653418600676, 72.11620796499483]
I expected imap to be slower because child processes need to pickle the results to the main process and then write to file, whereas each child process in apply_async directly write the results to file. Instead, imap is slower than apply_async.
Why is this so?
nb: This was done using Python 3.4.3 on Mac OS X 10.11

A quick glance at your source code shows that the imap_version() opens your output file once per process where apply_version() opens it once per worker which is 5 times per process due to being inside your range(5) loop.
with open(FILE, 'a') as fp is called 125 times in your async version vs 25 times in your imap version.

My guess is the busy loop is the culprit (besides it being an anti-pattern in its own right).
By checking the state yourself, you do redundant work: multiprocessing's machinery does pretty much the same with the work queue behind the scenes (in multiprocessing.pool.Pool._handle_workers() running in a separate thread). On the other hand, IMapIterator.next uses threading.Condition(threading.Lock()) to suspend the main thread's execution until an item is ready (so _handle_workers runs unhindered - remember that only one thread can run Python code at each moment).
Anyway, this is but another guess. The only decisive evidence would be a profiling result.

Related

How to process access log using python multiprocessing library?

I have to parse 30 days access logs from the server based on client IP and accessed hosts and need to know top 10 accessed sites. The log file will be around 10-20 GB in size which takes lots of time for single threaded execution of script. Initially, I wrote a script which was working fine but it is taking a lot of time to due to large log file size. Then I tried to implement multiprocessing library for parallel processing but it is not working. It seems implementation of multiprocessing is repeating tasks instead of doing parallel processing. Not sure, what is wrong in the code. Can some one please help on this? Thank you so much in advance for your help.
Code:
from datetime import datetime, timedelta
import commands
import os
import string
import sys
import multiprocessing
def ipauth (slave_list, static_ip_list):
file_record = open('/home/access/top10_domain_accessed/logs/combined_log.txt', 'a')
count = 1
while (count <=30):
Nth_days = datetime.now() - timedelta(days=count)
date = Nth_days.strftime("%Y%m%d")
yr_month = Nth_days.strftime("%Y/%m")
file_name = 'local2' + '.' + date
with open(slave_list) as file:
for line in file:
string = line.split()
slave = string[0]
proxy = string[1]
log_path = "/LOGS/%s/%s" %(slave, yr_month)
try:
os.path.exists(log_path)
file_read = os.path.join(log_path, file_name)
with open(file_read) as log:
for log_line in log:
log_line = log_line.strip()
if proxy in log_line:
file_record.write(log_line + '\n')
except IOError:
pass
count = count + 1
file_log = open('/home/access/top10_domain_accessed/logs/ipauth_logs.txt', 'a')
with open(static_ip_list) as ip:
for line in ip:
with open('/home/access/top10_domain_accessed/logs/combined_log.txt','r') as f:
for content in f:
log_split = content.split()
client_ip = log_split[7]
if client_ip in line:
content = str(content).strip()
file_log.write(content + '\n')
return
if __name__ == '__main__':
slave_list = sys.argv[1]
static_ip_list = sys.argv[2]
jobs = []
for i in range(5):
p = multiprocessing.Process(target=ipauth, args=(slave_list, static_ip_list))
jobs.append(p)
p.start()
p.join()
UPDATE AFTER CONVERSATION WITH OP, PLEASE SEE COMMENTS
My take: Split the file into smaller chunks and use a process pool to work on those chunks:
import multiprocessing
def chunk_of_lines(fp, n):
# read n lines from file
# then yield
pass
def process(lines):
pass # do stuff to a file
p = multiprocessing.Pool()
fp = open(slave_list)
for f in chunk_of_lines(fp,10):
p.apply_async(process, [f,static_ip_list])
p.close()
p.join() # Wait for all child processes to close.
There are many ways to implement the chunk_of_lines method, you could iterate over the file lines using a simple for or do something more advance like call fp.read().

Multiprocessing Pool not working - For loop inside function

I'm trying to get this function work asynchronously (I have tried asyncio, threadpoolexecutor, processpoolexecutor and still no luck).
It takes around 11 seconds on my PC to complete a batch 500 items and there isno difference compared to plain for loop, so I assume It doesn't work as expected (in parallel).
here is the function:
from unidecode import unidecode
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
def is_it_bad(word):
for item in all_names:
if str(word) in str(item['name']):
return item
item = {'name':word, 'gender': 2}
return item
def check_word(arr):
fname = unidecode(str(arr[1]['fullname'] + ' ' + arr[1]['username'])).replace('([^a-z ]+)', ' ').lower()
fname = fname + ' ' + fname.replace(' ', '')
fname = fname.split(' ')
genders = []
for chunk in fname:
if len(chunk) > 2:
genders.append(int(is_it_bad('_' + chunk + '_')['gender']))
if set(genders) == {2}:
followers[arr[0]]['gender'] = 2
#results_new.append(name)
elif set([0,1]).issubset(genders):
followers[arr[0]]['gender'] = 2
#results_new.append(name)
else:
if 0 in genders:
followers[arr[0]]['gender'] = 0
#results_new.append(name)
else:
followers[arr[0]]['gender'] = 1
#results_new.append(name)
results = pool.map(check_word, [(idx, name) for idx, name in enumerate(names)])
Can you please help me with this
You are using the module "multiprocessing.dummy"
According to the documentation provided here,
multiprocessing.dummy replicates the API of multiprocessing but is no more than
a wrapper around the threading module.
The threading module does not provide the same speedup advantages as the multiprocessing module does because the threads in that module are executed serially. For more information on how to use the multiprocessing module, visit this tutorial (no affiliation).
In it, the author uses both multiprocessing.dummy and multiprocessing to accomplish two different tasks. You'll notice multiprocessing is the module used to provide the speedup. Just switch to that module and you should see an increase.
I am unable to run your code due to the unidecode package, but here is how I use multithreading in my previous projects and with the with your code:
import multiprocessing
#get maximum threads
max_threads = multiprocessing.cpu_count()
#max_threads = multiprocessing.cpu_count()-1 #I prefer to use 1 less core if i still wish to use my device
#create pool with max_threads
p = multiprocessing.Pool(max_threads)
#execute pool with function
results = p.map(check_word, [(idx, name) for idx, name in enumerate(names)])
Let me know if this works or helps!
Edit: Added some comments to the code

Progress measuring with python's multiprocessing Pool and map function

Following code I'm using for parallel csv processing:
#!/usr/bin/env python
import csv
from time import sleep
from multiprocessing import Pool
from multiprocessing import cpu_count
from multiprocessing import current_process
from pprint import pprint as pp
def init_worker(x):
sleep(.5)
print "(%s,%s)" % (x[0],x[1])
x.append(int(x[0])**2)
return x
def parallel_csv_processing(inputFile, outputFile, header=["Default", "header", "please", "change"], separator=",", skipRows = 0, cpuCount = 1):
# OPEN FH FOR READING INPUT FILE
inputFH = open(inputFile, "rt")
csvReader = csv.reader(inputFH, delimiter=separator)
# SKIP HEADERS
for skip in xrange(skipRows):
csvReader.next()
# PARALLELIZE COMPUTING INTENSIVE OPERATIONS - CALL FUNCTION HERE
try:
p = Pool(processes = cpuCount)
results = p.map(init_worker, csvReader, chunksize = 10)
p.close()
p.join()
except KeyboardInterrupt:
p.close()
p.join()
p.terminate()
# CLOSE FH FOR READING INPUT
inputFH.close()
# OPEN FH FOR WRITING OUTPUT FILE
outputFH = open(outputFile, "wt")
csvWriter = csv.writer(outputFH, lineterminator='\n')
# WRITE HEADER TO OUTPUT FILE
csvWriter.writerow(header)
# WRITE RESULTS TO OUTPUT FILE
[csvWriter.writerow(row) for row in results]
# CLOSE FH FOR WRITING OUTPUT
outputFH.close()
print pp(results)
# print len(results)
def main():
inputFile = "input.csv"
outputFile = "output.csv"
parallel_csv_processing(inputFile, outputFile, cpuCount = cpu_count())
if __name__ == '__main__':
main()
I would like to somehow measure the progress of the script (just plain text not any fancy ASCII art). The one option that comes to my mind is to compare the lines that were successfully processed by init_worker to all lines in input.csv, and print the actual state e.g. every second, can you please point me to right solution? I've found several articles with similar problematic but I was not able to adapt it to my needs because neither used the Pool class and map method. I would also like to ask about p.close(), p.join(), p.terminate() methods, I've seen them mainly with Process not Pool class, are they necessary with Pool class and have I use them correctly? Using of p.terminate() was intended to kill the process with ctrl+c but this is different story which has not an happy end yet. Thank you.
PS: My input.csv looks like this, if it matters:
0,0
1,3
2,6
3,9
...
...
48,144
49,147
PPS: as I said I'm newbie in multiprocessing and the code I've put together just works. The one drawback I can see is that whole csv is stored in memory, so if you guys have better idea do not hesitate to share it.
Edit
in reply to #J.F.Sebastian
Here is my actual code based on your suggestions:
#!/usr/bin/env python
import csv
from time import sleep
from multiprocessing import Pool
from multiprocessing import cpu_count
from multiprocessing import current_process
from pprint import pprint as pp
from tqdm import tqdm
def do_job(x):
sleep(.5)
# print "(%s,%s)" % (x[0],x[1])
x.append(int(x[0])**2)
return x
def parallel_csv_processing(inputFile, outputFile, header=["Default", "header", "please", "change"], separator=",", skipRows = 0, cpuCount = 1):
# OPEN FH FOR READING INPUT FILE
inputFH = open(inputFile, "rb")
csvReader = csv.reader(inputFH, delimiter=separator)
# SKIP HEADERS
for skip in xrange(skipRows):
csvReader.next()
# OPEN FH FOR WRITING OUTPUT FILE
outputFH = open(outputFile, "wt")
csvWriter = csv.writer(outputFH, lineterminator='\n')
# WRITE HEADER TO OUTPUT FILE
csvWriter.writerow(header)
# PARALLELIZE COMPUTING INTENSIVE OPERATIONS - CALL FUNCTION HERE
try:
p = Pool(processes = cpuCount)
# results = p.map(do_job, csvReader, chunksize = 10)
for result in tqdm(p.imap_unordered(do_job, csvReader, chunksize=10)):
csvWriter.writerow(result)
p.close()
p.join()
except KeyboardInterrupt:
p.close()
p.join()
# CLOSE FH FOR READING INPUT
inputFH.close()
# CLOSE FH FOR WRITING OUTPUT
outputFH.close()
print pp(result)
# print len(result)
def main():
inputFile = "input.csv"
outputFile = "output.csv"
parallel_csv_processing(inputFile, outputFile, cpuCount = cpu_count())
if __name__ == '__main__':
main()
Here is output of tqdm:
1 [elapsed: 00:05, 0.20 iters/sec]
what does this output mean? On the page you've referred tqdm is used in loop following way:
>>> import time
>>> from tqdm import tqdm
>>> for i in tqdm(range(100)):
... time.sleep(1)
...
|###-------| 35/100 35% [elapsed: 00:35 left: 01:05, 1.00 iters/sec]
This output makes sense, but what does my output mean? Also it does not seems that ctrl+c problem is fixed: after hitting ctrl+c script throws some Traceback, if I hit ctrl+c again then I get new Traceback and so on. The only way to kill it is sending it to background (ctr+z) and then kill it (kill %1)
To show the progress, replace pool.map with pool.imap_unordered:
from tqdm import tqdm # $ pip install tqdm
for result in tqdm(pool.imap_unordered(init_worker, csvReader, chunksize=10)):
csvWriter.writerow(result)
tqdm part is optional, see Text Progress Bar in the Console
Accidentally, it fixes your "whole csv is stored in memory" and "KeyboardInterrupt is not raised" problems.
Here's a complete code example:
#!/usr/bin/env python
import itertools
import logging
import multiprocessing
import time
def compute(i):
time.sleep(.5)
return i**2
if __name__ == "__main__":
logging.basicConfig(format="%(asctime)-15s %(levelname)s %(message)s",
datefmt="%F %T", level=logging.DEBUG)
pool = multiprocessing.Pool()
try:
for square in pool.imap_unordered(compute, itertools.count(), chunksize=10):
logging.debug(square) # report progress by printing the result
except KeyboardInterrupt:
logging.warning("got Ctrl+C")
finally:
pool.terminate()
pool.join()
You should see the output in batches every .5 * chunksize seconds. If you press Ctrl+C; you should see KeyboardInterrupt raised in the child processes and in the main process. In Python 3, the main process exits immediately. In Python 2, the KeyboardInterrupt is delayed until the next batch should have been printed (bug in Python).

Python MultiProcessing and Directory Creation

I am using Python Multiprocessing module to scrape a website. Now this website has over 100,000 pages. What I am trying to do is to put every 500 pages I retrieve into a separate folder. The problem is that though I successfully create a new folder, my script only populates the previous folder. Here is the code:
global a = 1
global b = 500
def fetchAfter(y):
global a
global b
strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"
if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):
f = open(strfile, "w")
if __name__ == '__main__':
start = time.time()
for i in range(1,3):
os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))
pool = Pool(processes=12)
pool.map(fetchAfter, range(a,b))
pool.close()
pool.join()
a = b
b = b + 500
print time.time()-start
It is best for the worker function to only rely on the single argument it gets for determining what to do. Because that is the only information it gets from the parent process every time it is called. This argument can be almost any Python object (including a tuple, dict, list) so you're not really limited in the amount of information you pass to a worker.
So make a list of 2-tuples. Each 2-tuple should consist of (1) the file to get and (2) the directory where to stash it. Feed that list of tuples to map(), and let it rip.
I'm not sure if it is useful to specify the number of processes you want to use. Pool generally uses as many processes as your CPU has cores. That is usually enough to max out all the cores. :-)
BTW, you should only call map() once. And since map() blocks until everything is done, there is no need to call join().
Edit: Added example code below.
import multiprocessing
import requests
import os
def processfile(arg):
"""Worker function to scrape the pages and write them to a file.
Keyword arguments:
arg -- 2-tuple containing the URL of the page and the directory
where to save it.
"""
# Unpack the arguments
url, savedir = arg
# It might be a good idea to put a random delay of a few seconds here,
# so we don't hammer the webserver!
# Scrape the page. Requests rules ;-)
r = requests.get(url)
# Write it, keep the original HTML file name.
fname = url.split('/')[-1]
with open(savedir + '/' + fname, 'w+') as outfile:
outfile.write(r.text)
def main():
"""Main program.
"""
# This list of tuples should hold all the pages...
# Up to you how to generate it, this is just an example.
worklist = [('http://www.foo.org/page1.html', 'dir1'),
('http://www.foo.org/page2.html', 'dir1'),
('http://www.foo.org/page3.html', 'dir2'),
('http://www.foo.org/page4.html', 'dir2')]
# Create output directories
dirlist = ['dir1', 'dir2']
for d in dirlist:
os.makedirs(d)
p = Pool()
# Let'er rip!
p.map(processfile, worklist)
p.close()
if __name__ == '__main__':
main()
Multiprocessing, as the name implies, uses separate processes. The processes you create with your Pool do not have access to the original values of a and b that you are adding 500 to in the main program. See this previous question.
The easiest solution is to just refactor your code so that you pass a and b to fetchAfter (in addition to passing y).
Here's one way to implement it:
#!/usr/bin/env python
import logging
import multiprocessing as mp
import os
import urllib
def download_page(url_path):
try:
urllib.urlretrieve(*url_path)
mp.get_logger().info('done %s' % (url_path,))
except Exception as e:
mp.get_logger().error('failed %s: %s' % (url_path, e))
def generate_url_path(rootdir, urls_per_dir=500):
for i in xrange(100*1000):
if i % urls_per_dir == 0: # make new dir
dirpath = os.path.join(rootdir, '%d-%d' % (i, i+urls_per_dir))
if not os.path.isdir(dirpath):
os.makedirs(dirpath) # stop if it fails
url = 'http://example.com/page?' + urllib.urlencode(dict(number=i))
path = os.path.join(dirpath, '%d.html' % (i,))
yield url, path
def main():
mp.log_to_stderr().setLevel(logging.INFO)
pool = mp.Pool(4) # number of processes is unrelated to number of CPUs
# due to the task is IO-bound
for _ in pool.imap_unordered(download_page, generate_url_path(r'E:\A\B')):
pass
if __name__ == '__main__':
main()
See also Python multiprocessing pool.map for multiple arguments and the code
Brute force basic http authorization using httplib and multiprocessing from how to make HTTP in Python faster?

Writing to a file with multiprocessing

I'm having the following problem in python.
I need to do some calculations in parallel whose results I need to be written sequentially in a file. So I created a function that receives a multiprocessing.Queue and a file handle, do the calculation and print the result in the file:
import multiprocessing
from multiprocessing import Process, Queue
from mySimulation import doCalculation
# doCalculation(pars) is a function I must run for many different sets of parameters and collect the results in a file
def work(queue, fh):
while True:
try:
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
except:
break
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fh = open("foo", "w")
workQueue = Queue()
parList = # list of conditions for which I want to run doCalculation()
for x in parList:
workQueue.put(x)
processes = [Process(target = writefh, args = (workQueue, fh)) for i in range(nthreads)]
for p in processes:
p.start()
for p in processes:
p.join()
fh.close()
But the file ends up empty after the script runs. I tried to change the worker() function to:
def work(queue, filename):
while True:
try:
fh = open(filename, "a")
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
fh.close()
except:
break
and pass the filename as parameter. Then it works as I intended. When I try to do the same thing sequentially, without multiprocessing, it also works normally.
Why it didn't worked in the first version? I can't see the problem.
Also: can I guarantee that two processes won't try to write the file simultaneously?
EDIT:
Thanks. I got it now. This is the working version:
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
from random import uniform
def doCalculation(par):
t = uniform(0,2)
sleep(t)
return par * par # just to simulate some calculation
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "dealing with ", par, ""
res = doCalculation(par)
queueOut.put((par,res))
except:
break
def write(queue, fname):
fhandle = open(fname, "w")
while True:
try:
par, res = queue.get(block = False)
print >>fhandle, par, res
except:
break
fhandle.close()
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fname = "foo"
workerQueue = Queue()
writerQueue = Queue()
parlist = [1,2,3,4,5,6,7,8,9,10]
feedProc = Process(target = feed , args = (workerQueue, parlist))
calcProc = [Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target = write, args = (writerQueue, fname))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join ()
for p in calcProc:
p.join()
writProc.join ()
You really should use two queues and three separate kinds of processing.
Put stuff into Queue #1.
Get stuff out of Queue #1 and do calculations, putting stuff in Queue #2. You can have many of these, since they get from one queue and put into another queue safely.
Get stuff out of Queue #2 and write it to a file. You must have exactly 1 of these and no more. It "owns" the file, guarantees atomic access, and absolutely assures that the file is written cleanly and consistently.
If anyone is looking for a simple way to do the same, this can help you.
I don't think there are any disadvantages to doing it in this way. If there are, please let me know.
import multiprocessing
import re
def mp_worker(item):
# Do something
return item, count
def mp_handler():
cpus = multiprocessing.cpu_count()
p = multiprocessing.Pool(cpus)
# The below 2 lines populate the list. This listX will later be accessed parallely. This can be replaced as long as listX is passed on to the next step.
with open('ExampleFile.txt') as f:
listX = [line for line in (l.strip() for l in f) if line]
with open('results.txt', 'w') as f:
for result in p.imap(mp_worker, listX):
# (item, count) tuples from worker
f.write('%s: %d\n' % result)
if __name__=='__main__':
mp_handler()
Source: Python: Writing to a single file with queue while using multiprocessing Pool
There is a mistake in the write worker code, if the block is false, the worker will never get any data. Should be as follows:
par, res = queue.get(block = True)
You can check it by adding line
print "QSize",queueOut.qsize()
after the
queueOut.put((par,res))
With block=False you would be getting ever increasing length of the queue until it fills up, unlike with block=True where you get always "1".

Categories