I am using Python Multiprocessing module to scrape a website. Now this website has over 100,000 pages. What I am trying to do is to put every 500 pages I retrieve into a separate folder. The problem is that though I successfully create a new folder, my script only populates the previous folder. Here is the code:
global a = 1
global b = 500
def fetchAfter(y):
global a
global b
strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"
if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):
f = open(strfile, "w")
if __name__ == '__main__':
start = time.time()
for i in range(1,3):
os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))
pool = Pool(processes=12)
pool.map(fetchAfter, range(a,b))
pool.close()
pool.join()
a = b
b = b + 500
print time.time()-start
It is best for the worker function to only rely on the single argument it gets for determining what to do. Because that is the only information it gets from the parent process every time it is called. This argument can be almost any Python object (including a tuple, dict, list) so you're not really limited in the amount of information you pass to a worker.
So make a list of 2-tuples. Each 2-tuple should consist of (1) the file to get and (2) the directory where to stash it. Feed that list of tuples to map(), and let it rip.
I'm not sure if it is useful to specify the number of processes you want to use. Pool generally uses as many processes as your CPU has cores. That is usually enough to max out all the cores. :-)
BTW, you should only call map() once. And since map() blocks until everything is done, there is no need to call join().
Edit: Added example code below.
import multiprocessing
import requests
import os
def processfile(arg):
"""Worker function to scrape the pages and write them to a file.
Keyword arguments:
arg -- 2-tuple containing the URL of the page and the directory
where to save it.
"""
# Unpack the arguments
url, savedir = arg
# It might be a good idea to put a random delay of a few seconds here,
# so we don't hammer the webserver!
# Scrape the page. Requests rules ;-)
r = requests.get(url)
# Write it, keep the original HTML file name.
fname = url.split('/')[-1]
with open(savedir + '/' + fname, 'w+') as outfile:
outfile.write(r.text)
def main():
"""Main program.
"""
# This list of tuples should hold all the pages...
# Up to you how to generate it, this is just an example.
worklist = [('http://www.foo.org/page1.html', 'dir1'),
('http://www.foo.org/page2.html', 'dir1'),
('http://www.foo.org/page3.html', 'dir2'),
('http://www.foo.org/page4.html', 'dir2')]
# Create output directories
dirlist = ['dir1', 'dir2']
for d in dirlist:
os.makedirs(d)
p = Pool()
# Let'er rip!
p.map(processfile, worklist)
p.close()
if __name__ == '__main__':
main()
Multiprocessing, as the name implies, uses separate processes. The processes you create with your Pool do not have access to the original values of a and b that you are adding 500 to in the main program. See this previous question.
The easiest solution is to just refactor your code so that you pass a and b to fetchAfter (in addition to passing y).
Here's one way to implement it:
#!/usr/bin/env python
import logging
import multiprocessing as mp
import os
import urllib
def download_page(url_path):
try:
urllib.urlretrieve(*url_path)
mp.get_logger().info('done %s' % (url_path,))
except Exception as e:
mp.get_logger().error('failed %s: %s' % (url_path, e))
def generate_url_path(rootdir, urls_per_dir=500):
for i in xrange(100*1000):
if i % urls_per_dir == 0: # make new dir
dirpath = os.path.join(rootdir, '%d-%d' % (i, i+urls_per_dir))
if not os.path.isdir(dirpath):
os.makedirs(dirpath) # stop if it fails
url = 'http://example.com/page?' + urllib.urlencode(dict(number=i))
path = os.path.join(dirpath, '%d.html' % (i,))
yield url, path
def main():
mp.log_to_stderr().setLevel(logging.INFO)
pool = mp.Pool(4) # number of processes is unrelated to number of CPUs
# due to the task is IO-bound
for _ in pool.imap_unordered(download_page, generate_url_path(r'E:\A\B')):
pass
if __name__ == '__main__':
main()
See also Python multiprocessing pool.map for multiple arguments and the code
Brute force basic http authorization using httplib and multiprocessing from how to make HTTP in Python faster?
Related
Actually I have this code :
#!/usr/bin/env python3
import sys
import requests
import random
from multiprocessing.dummy import Pool
from pathlib import Path
requests.urllib3.disable_warnings()
print ('Give name of txt file on _listeNDD directory (without.txt)'),
file = str(input())
if Path('_listeNDD/'+file+'.txt').is_file():
print ('--------------------------------------------------------')
print ("Found")
print ('--------------------------------------------------------')
print ('Choose name for the output list (without .txt)'),
nomRez = str(input())
filename = '_listeNDD/'+file+'.txt'
domains = [i.strip() for i in open(filename , mode='r').readlines()]
else:
print ('--------------------------------------------------------')
exit('No txt found with this name')
def check(domain):
try:
r = requests.get('https://'+domain+'/test', timeout=5, allow_redirects = False)
if "[core]" in r.text:
with open('_rez/'+nomRez+'.txt', "a+") as f:
print('https://'+domain+'/test', file=f)
except:pass
mp = Pool(100)
mp.map(check, domains)
mp.close()
mp.join()
exit('finished')
Screen of the root file
With this code, it open text file on directory "_listeNDD" and I write new text file on directory "_rez".
Obviously it's super fast for ten elements but when it gets a bigger I would like a progress bar to know if I have time to make a coffee or not.
I had personally tried using the github tqdm but unfortunately it shows a progress bar for every job it does, while I only want one for everything...
Any idea?
Thank you
EDIT : Using this post, I did not succeed with
if __name__ == '__main__':
p = Pool(100)
r = p.map(check, tqdm.tqdm(range(0, 30)))
p.close()
p.join()
I don't have a high enough python level to master this so I may have badly integrated this into my code.
I also saw:
if __name__ == '__main__':
r = process_map(check, range(0, 30), max_workers=2)
I have to parse 30 days access logs from the server based on client IP and accessed hosts and need to know top 10 accessed sites. The log file will be around 10-20 GB in size which takes lots of time for single threaded execution of script. Initially, I wrote a script which was working fine but it is taking a lot of time to due to large log file size. Then I tried to implement multiprocessing library for parallel processing but it is not working. It seems implementation of multiprocessing is repeating tasks instead of doing parallel processing. Not sure, what is wrong in the code. Can some one please help on this? Thank you so much in advance for your help.
Code:
from datetime import datetime, timedelta
import commands
import os
import string
import sys
import multiprocessing
def ipauth (slave_list, static_ip_list):
file_record = open('/home/access/top10_domain_accessed/logs/combined_log.txt', 'a')
count = 1
while (count <=30):
Nth_days = datetime.now() - timedelta(days=count)
date = Nth_days.strftime("%Y%m%d")
yr_month = Nth_days.strftime("%Y/%m")
file_name = 'local2' + '.' + date
with open(slave_list) as file:
for line in file:
string = line.split()
slave = string[0]
proxy = string[1]
log_path = "/LOGS/%s/%s" %(slave, yr_month)
try:
os.path.exists(log_path)
file_read = os.path.join(log_path, file_name)
with open(file_read) as log:
for log_line in log:
log_line = log_line.strip()
if proxy in log_line:
file_record.write(log_line + '\n')
except IOError:
pass
count = count + 1
file_log = open('/home/access/top10_domain_accessed/logs/ipauth_logs.txt', 'a')
with open(static_ip_list) as ip:
for line in ip:
with open('/home/access/top10_domain_accessed/logs/combined_log.txt','r') as f:
for content in f:
log_split = content.split()
client_ip = log_split[7]
if client_ip in line:
content = str(content).strip()
file_log.write(content + '\n')
return
if __name__ == '__main__':
slave_list = sys.argv[1]
static_ip_list = sys.argv[2]
jobs = []
for i in range(5):
p = multiprocessing.Process(target=ipauth, args=(slave_list, static_ip_list))
jobs.append(p)
p.start()
p.join()
UPDATE AFTER CONVERSATION WITH OP, PLEASE SEE COMMENTS
My take: Split the file into smaller chunks and use a process pool to work on those chunks:
import multiprocessing
def chunk_of_lines(fp, n):
# read n lines from file
# then yield
pass
def process(lines):
pass # do stuff to a file
p = multiprocessing.Pool()
fp = open(slave_list)
for f in chunk_of_lines(fp,10):
p.apply_async(process, [f,static_ip_list])
p.close()
p.join() # Wait for all child processes to close.
There are many ways to implement the chunk_of_lines method, you could iterate over the file lines using a simple for or do something more advance like call fp.read().
I'm trying to get this function work asynchronously (I have tried asyncio, threadpoolexecutor, processpoolexecutor and still no luck).
It takes around 11 seconds on my PC to complete a batch 500 items and there isno difference compared to plain for loop, so I assume It doesn't work as expected (in parallel).
here is the function:
from unidecode import unidecode
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
def is_it_bad(word):
for item in all_names:
if str(word) in str(item['name']):
return item
item = {'name':word, 'gender': 2}
return item
def check_word(arr):
fname = unidecode(str(arr[1]['fullname'] + ' ' + arr[1]['username'])).replace('([^a-z ]+)', ' ').lower()
fname = fname + ' ' + fname.replace(' ', '')
fname = fname.split(' ')
genders = []
for chunk in fname:
if len(chunk) > 2:
genders.append(int(is_it_bad('_' + chunk + '_')['gender']))
if set(genders) == {2}:
followers[arr[0]]['gender'] = 2
#results_new.append(name)
elif set([0,1]).issubset(genders):
followers[arr[0]]['gender'] = 2
#results_new.append(name)
else:
if 0 in genders:
followers[arr[0]]['gender'] = 0
#results_new.append(name)
else:
followers[arr[0]]['gender'] = 1
#results_new.append(name)
results = pool.map(check_word, [(idx, name) for idx, name in enumerate(names)])
Can you please help me with this
You are using the module "multiprocessing.dummy"
According to the documentation provided here,
multiprocessing.dummy replicates the API of multiprocessing but is no more than
a wrapper around the threading module.
The threading module does not provide the same speedup advantages as the multiprocessing module does because the threads in that module are executed serially. For more information on how to use the multiprocessing module, visit this tutorial (no affiliation).
In it, the author uses both multiprocessing.dummy and multiprocessing to accomplish two different tasks. You'll notice multiprocessing is the module used to provide the speedup. Just switch to that module and you should see an increase.
I am unable to run your code due to the unidecode package, but here is how I use multithreading in my previous projects and with the with your code:
import multiprocessing
#get maximum threads
max_threads = multiprocessing.cpu_count()
#max_threads = multiprocessing.cpu_count()-1 #I prefer to use 1 less core if i still wish to use my device
#create pool with max_threads
p = multiprocessing.Pool(max_threads)
#execute pool with function
results = p.map(check_word, [(idx, name) for idx, name in enumerate(names)])
Let me know if this works or helps!
Edit: Added some comments to the code
I want to download 20 csv files with the size of all of them together - 5MB.
Here is the first version of my code:
import os
from bs4 import BeautifulSoup
import urllib.request
import datetime
def get_page(url):
try:
return urllib.request.urlopen(url).read()
except:
print("[warn] %s" % (url))
raise
def get_all_links(page):
soup = BeautifulSoup(page)
links = []
for link in soup.find_all('a'):
url = link.get('href')
if '.csv' in url:
return url
print("[warn] Can't find a link with CSV file!")
def get_csv_file(company):
link = 'http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices'
g = link.find('s=')
name = link[g + 2:g + 6]
link = link.replace(name, company)
urllib.request.urlretrieve(get_all_links(get_page(link)), os.path.join('prices', company + '.csv'))
print("[info][" + company + "] Download is complete!")
if __name__ == "__main__":
start = datetime.datetime.now()
security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
for security in security_list:
get_csv_file(security)
end = datetime.datetime.now()
print('[success] Total time: ' + str(end-start))
This code downloads 20 csv files with the size of all of them together - 5MB, within 1.2 minute.
Then i have tried to use multiprocessing to make it download faster.
Here is version 2:
if __name__ == "__main__":
import multiprocessing
start = datetime.datetime.now()
security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
for i in range(20):
p = multiprocessing.Process(target=hP.get_csv_files([index] + security_list), args=(i,))
p.start()
end = datetime.datetime.now()
print('[success] Total time: ' + str(end-start))
But, unfortunately version 2 downloads 20 csv files with the size of all of them together - 5MB, within 2.4 minutes.
Why multiprocessing slowdowns my program?
What am I doing wrong?
What is the best way to download these files faster than now?
Thank you?
I don't know what exactly you are trying to start with Process in your example (I think you have a few typos). I think you want something like this:
processs = []
for security in security_list:
p = multiprocessing.Process(target=get_csv_file, args=(security,))
p.start()
processs.append(p)
for p in processs:
p.join()
You can iterate in this way over the security, create a new process for each security name and put the process in a list.
After you started all the processes, you loop over them and wait for them to finish, using join.
There is also a simpler way to do this, using Pool and its parallel map implementation.
pool = multiprocessing.Pool(processes=5)
pool.map(get_csv_file, security_list)
You create a Pool of processes (if you omit the argument, it will create a number equal to your processor count), and then you apply your function to each element in the list using map. The pool will take care of the rest.
I'm having the following problem in python.
I need to do some calculations in parallel whose results I need to be written sequentially in a file. So I created a function that receives a multiprocessing.Queue and a file handle, do the calculation and print the result in the file:
import multiprocessing
from multiprocessing import Process, Queue
from mySimulation import doCalculation
# doCalculation(pars) is a function I must run for many different sets of parameters and collect the results in a file
def work(queue, fh):
while True:
try:
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
except:
break
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fh = open("foo", "w")
workQueue = Queue()
parList = # list of conditions for which I want to run doCalculation()
for x in parList:
workQueue.put(x)
processes = [Process(target = writefh, args = (workQueue, fh)) for i in range(nthreads)]
for p in processes:
p.start()
for p in processes:
p.join()
fh.close()
But the file ends up empty after the script runs. I tried to change the worker() function to:
def work(queue, filename):
while True:
try:
fh = open(filename, "a")
parameter = queue.get(block = False)
result = doCalculation(parameter)
print >>fh, string
fh.close()
except:
break
and pass the filename as parameter. Then it works as I intended. When I try to do the same thing sequentially, without multiprocessing, it also works normally.
Why it didn't worked in the first version? I can't see the problem.
Also: can I guarantee that two processes won't try to write the file simultaneously?
EDIT:
Thanks. I got it now. This is the working version:
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
from random import uniform
def doCalculation(par):
t = uniform(0,2)
sleep(t)
return par * par # just to simulate some calculation
def feed(queue, parlist):
for par in parlist:
queue.put(par)
def calc(queueIn, queueOut):
while True:
try:
par = queueIn.get(block = False)
print "dealing with ", par, ""
res = doCalculation(par)
queueOut.put((par,res))
except:
break
def write(queue, fname):
fhandle = open(fname, "w")
while True:
try:
par, res = queue.get(block = False)
print >>fhandle, par, res
except:
break
fhandle.close()
if __name__ == "__main__":
nthreads = multiprocessing.cpu_count()
fname = "foo"
workerQueue = Queue()
writerQueue = Queue()
parlist = [1,2,3,4,5,6,7,8,9,10]
feedProc = Process(target = feed , args = (workerQueue, parlist))
calcProc = [Process(target = calc , args = (workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target = write, args = (writerQueue, fname))
feedProc.start()
for p in calcProc:
p.start()
writProc.start()
feedProc.join ()
for p in calcProc:
p.join()
writProc.join ()
You really should use two queues and three separate kinds of processing.
Put stuff into Queue #1.
Get stuff out of Queue #1 and do calculations, putting stuff in Queue #2. You can have many of these, since they get from one queue and put into another queue safely.
Get stuff out of Queue #2 and write it to a file. You must have exactly 1 of these and no more. It "owns" the file, guarantees atomic access, and absolutely assures that the file is written cleanly and consistently.
If anyone is looking for a simple way to do the same, this can help you.
I don't think there are any disadvantages to doing it in this way. If there are, please let me know.
import multiprocessing
import re
def mp_worker(item):
# Do something
return item, count
def mp_handler():
cpus = multiprocessing.cpu_count()
p = multiprocessing.Pool(cpus)
# The below 2 lines populate the list. This listX will later be accessed parallely. This can be replaced as long as listX is passed on to the next step.
with open('ExampleFile.txt') as f:
listX = [line for line in (l.strip() for l in f) if line]
with open('results.txt', 'w') as f:
for result in p.imap(mp_worker, listX):
# (item, count) tuples from worker
f.write('%s: %d\n' % result)
if __name__=='__main__':
mp_handler()
Source: Python: Writing to a single file with queue while using multiprocessing Pool
There is a mistake in the write worker code, if the block is false, the worker will never get any data. Should be as follows:
par, res = queue.get(block = True)
You can check it by adding line
print "QSize",queueOut.qsize()
after the
queueOut.put((par,res))
With block=False you would be getting ever increasing length of the queue until it fills up, unlike with block=True where you get always "1".