Multiprocessing slowdowns my web-crawler?

Multiprocessing slowdowns my web-crawler? - python

I want to download 20 csv files with the size of all of them together - 5MB.
Here is the first version of my code:
import os
from bs4 import BeautifulSoup
import urllib.request
import datetime
def get_page(url):
try:
return urllib.request.urlopen(url).read()
except:
print("[warn] %s" % (url))
raise
def get_all_links(page):
soup = BeautifulSoup(page)
links = []
for link in soup.find_all('a'):
url = link.get('href')
if '.csv' in url:
return url
print("[warn] Can't find a link with CSV file!")
def get_csv_file(company):
link = 'http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices'
g = link.find('s=')
name = link[g + 2:g + 6]
link = link.replace(name, company)
urllib.request.urlretrieve(get_all_links(get_page(link)), os.path.join('prices', company + '.csv'))
print("[info][" + company + "] Download is complete!")
if __name__ == "__main__":
start = datetime.datetime.now()
security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
for security in security_list:
get_csv_file(security)
end = datetime.datetime.now()
print('[success] Total time: ' + str(end-start))
This code downloads 20 csv files with the size of all of them together - 5MB, within 1.2 minute.
Then i have tried to use multiprocessing to make it download faster.
Here is version 2:
if __name__ == "__main__":
import multiprocessing
start = datetime.datetime.now()
security_list = ["AAPL", "ADBE", "AMD", "AMZN", "CRM", "EXPE", "FB", "GOOG", "GRPN", "INTC", "LNKD", "MCD", "MSFT", "NFLX", "NVDA", "NVTL", "ORCL", "SBUX", "STX"]
for i in range(20):
p = multiprocessing.Process(target=hP.get_csv_files([index] + security_list), args=(i,))
p.start()
end = datetime.datetime.now()
print('[success] Total time: ' + str(end-start))
But, unfortunately version 2 downloads 20 csv files with the size of all of them together - 5MB, within 2.4 minutes.
Why multiprocessing slowdowns my program?
What am I doing wrong?
What is the best way to download these files faster than now?
Thank you?

I don't know what exactly you are trying to start with Process in your example (I think you have a few typos). I think you want something like this:
processs = []
for security in security_list:
p = multiprocessing.Process(target=get_csv_file, args=(security,))
p.start()
processs.append(p)
for p in processs:
p.join()
You can iterate in this way over the security, create a new process for each security name and put the process in a list.
After you started all the processes, you loop over them and wait for them to finish, using join.
There is also a simpler way to do this, using Pool and its parallel map implementation.
pool = multiprocessing.Pool(processes=5)
pool.map(get_csv_file, security_list)
You create a Pool of processes (if you omit the argument, it will create a number equal to your processor count), and then you apply your function to each element in the list using map. The pool will take care of the rest.

Related

Using tqdm progress bar in a if statement

Actually I have this code :
#!/usr/bin/env python3
import sys
import requests
import random
from multiprocessing.dummy import Pool
from pathlib import Path
requests.urllib3.disable_warnings()
print ('Give name of txt file on _listeNDD directory (without.txt)'),
file = str(input())
if Path('_listeNDD/'+file+'.txt').is_file():
print ('--------------------------------------------------------')
print ("Found")
print ('--------------------------------------------------------')
print ('Choose name for the output list (without .txt)'),
nomRez = str(input())
filename = '_listeNDD/'+file+'.txt'
domains = [i.strip() for i in open(filename , mode='r').readlines()]
else:
print ('--------------------------------------------------------')
exit('No txt found with this name')
def check(domain):
try:
r = requests.get('https://'+domain+'/test', timeout=5, allow_redirects = False)
if "[core]" in r.text:
with open('_rez/'+nomRez+'.txt', "a+") as f:
print('https://'+domain+'/test', file=f)
except:pass
mp = Pool(100)
mp.map(check, domains)
mp.close()
mp.join()
exit('finished')
Screen of the root file
With this code, it open text file on directory "_listeNDD" and I write new text file on directory "_rez".
Obviously it's super fast for ten elements but when it gets a bigger I would like a progress bar to know if I have time to make a coffee or not.
I had personally tried using the github tqdm but unfortunately it shows a progress bar for every job it does, while I only want one for everything...
Any idea?
Thank you
EDIT : Using this post, I did not succeed with
if __name__ == '__main__':
p = Pool(100)
r = p.map(check, tqdm.tqdm(range(0, 30)))
p.close()
p.join()
I don't have a high enough python level to master this so I may have badly integrated this into my code.
I also saw:
if __name__ == '__main__':
r = process_map(check, range(0, 30), max_workers=2)

How to process access log using python multiprocessing library?

I have to parse 30 days access logs from the server based on client IP and accessed hosts and need to know top 10 accessed sites. The log file will be around 10-20 GB in size which takes lots of time for single threaded execution of script. Initially, I wrote a script which was working fine but it is taking a lot of time to due to large log file size. Then I tried to implement multiprocessing library for parallel processing but it is not working. It seems implementation of multiprocessing is repeating tasks instead of doing parallel processing. Not sure, what is wrong in the code. Can some one please help on this? Thank you so much in advance for your help.
Code:
from datetime import datetime, timedelta
import commands
import os
import string
import sys
import multiprocessing
def ipauth (slave_list, static_ip_list):
file_record = open('/home/access/top10_domain_accessed/logs/combined_log.txt', 'a')
count = 1
while (count <=30):
Nth_days = datetime.now() - timedelta(days=count)
date = Nth_days.strftime("%Y%m%d")
yr_month = Nth_days.strftime("%Y/%m")
file_name = 'local2' + '.' + date
with open(slave_list) as file:
for line in file:
string = line.split()
slave = string[0]
proxy = string[1]
log_path = "/LOGS/%s/%s" %(slave, yr_month)
try:
os.path.exists(log_path)
file_read = os.path.join(log_path, file_name)
with open(file_read) as log:
for log_line in log:
log_line = log_line.strip()
if proxy in log_line:
file_record.write(log_line + '\n')
except IOError:
pass
count = count + 1
file_log = open('/home/access/top10_domain_accessed/logs/ipauth_logs.txt', 'a')
with open(static_ip_list) as ip:
for line in ip:
with open('/home/access/top10_domain_accessed/logs/combined_log.txt','r') as f:
for content in f:
log_split = content.split()
client_ip = log_split[7]
if client_ip in line:
content = str(content).strip()
file_log.write(content + '\n')
return
if __name__ == '__main__':
slave_list = sys.argv[1]
static_ip_list = sys.argv[2]
jobs = []
for i in range(5):
p = multiprocessing.Process(target=ipauth, args=(slave_list, static_ip_list))
jobs.append(p)
p.start()
p.join()

UPDATE AFTER CONVERSATION WITH OP, PLEASE SEE COMMENTS
My take: Split the file into smaller chunks and use a process pool to work on those chunks:
import multiprocessing
def chunk_of_lines(fp, n):
# read n lines from file
# then yield
pass
def process(lines):
pass # do stuff to a file
p = multiprocessing.Pool()
fp = open(slave_list)
for f in chunk_of_lines(fp,10):
p.apply_async(process, [f,static_ip_list])
p.close()
p.join() # Wait for all child processes to close.
There are many ways to implement the chunk_of_lines method, you could iterate over the file lines using a simple for or do something more advance like call fp.read().

Multiprocessing Pool not working - For loop inside function

I'm trying to get this function work asynchronously (I have tried asyncio, threadpoolexecutor, processpoolexecutor and still no luck).
It takes around 11 seconds on my PC to complete a batch 500 items and there isno difference compared to plain for loop, so I assume It doesn't work as expected (in parallel).
here is the function:
from unidecode import unidecode
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
def is_it_bad(word):
for item in all_names:
if str(word) in str(item['name']):
return item
item = {'name':word, 'gender': 2}
return item
def check_word(arr):
fname = unidecode(str(arr[1]['fullname'] + ' ' + arr[1]['username'])).replace('([^a-z ]+)', ' ').lower()
fname = fname + ' ' + fname.replace(' ', '')
fname = fname.split(' ')
genders = []
for chunk in fname:
if len(chunk) > 2:
genders.append(int(is_it_bad('_' + chunk + '_')['gender']))
if set(genders) == {2}:
followers[arr[0]]['gender'] = 2
#results_new.append(name)
elif set([0,1]).issubset(genders):
followers[arr[0]]['gender'] = 2
#results_new.append(name)
else:
if 0 in genders:
followers[arr[0]]['gender'] = 0
#results_new.append(name)
else:
followers[arr[0]]['gender'] = 1
#results_new.append(name)
results = pool.map(check_word, [(idx, name) for idx, name in enumerate(names)])
Can you please help me with this

You are using the module "multiprocessing.dummy"
According to the documentation provided here,
multiprocessing.dummy replicates the API of multiprocessing but is no more than
a wrapper around the threading module.
The threading module does not provide the same speedup advantages as the multiprocessing module does because the threads in that module are executed serially. For more information on how to use the multiprocessing module, visit this tutorial (no affiliation).
In it, the author uses both multiprocessing.dummy and multiprocessing to accomplish two different tasks. You'll notice multiprocessing is the module used to provide the speedup. Just switch to that module and you should see an increase.

I am unable to run your code due to the unidecode package, but here is how I use multithreading in my previous projects and with the with your code:
import multiprocessing
#get maximum threads
max_threads = multiprocessing.cpu_count()
#max_threads = multiprocessing.cpu_count()-1 #I prefer to use 1 less core if i still wish to use my device
#create pool with max_threads
p = multiprocessing.Pool(max_threads)
#execute pool with function
results = p.map(check_word, [(idx, name) for idx, name in enumerate(names)])
Let me know if this works or helps!
Edit: Added some comments to the code

How can I use Python to time how long a mp3 takes to download from website?

A mp3 is accessible via two different URLs. I'm trying to use Python to figure out which URL is fastest to download from...?
For example, I want to time how long https://cpx.podbean.com/mf/download/a6bxxa/LAF_15min_044_mindfulness.mp3 takes to download and compare that to how long http://cpx.podbean.com/mf/play/a6bxxa/LAF_15min_044_mindfulness.mp3 takes to download.
To download the mp3 I'm currently using:
urllib.request.urlretrieve(mp3_url, mp3_filename)

you could essentially do something like:
from datetime import datetime
starttime = datetime.now()
urllib.request.urlretrieve(mp3_url, mp3_filename) # Whatever code you're using...
finishtime = datetime.now()
runtime = finishtime - starttime
print str(runtime)
this will print a timestamp like 0:03:19.356798 in the format of [hours]:[minutes]:[seconds.micro seconds]
My bad... i didn't realize you're trying to figure out which link was the fastest. I have no clue how you're storing the your mp3_url and mp3_filename elements, but try something like this (adjust accordingly):
from datetime import datetime
mp3_list = {
'file1.mp3': 'http://www.url1.com',
'file2.mp3': 'http://www.url2.com',
'file3.mp3': 'http://www.url3.com',
}
runtimes = []
for mp3_url, mp3_filename in mp3_list.items(): # i'm not sure how or where you are storing mp3_url or mp3_filename, so you'll have to modify this line accordingly...
starttime = datetime.now()
urllib.request.urlretrieve(mp3_url, mp3_filename) # Whatever code you're using...
finishtime = datetime.now()
runtime = finishtime - starttime
runtimes.append({'runtime': runtime, 'url': mp3_url, 'filename': mp3_filename})
fastest_mp3_url = sorted(runtimes, key=lambda k: k['runtime'])[0]['url']
fastest_mp3_filename = sorted(runtimes, key=lambda k: k['runtime'])[0]['filename']
print fastest_mp3_url
print fastest_mp3_filename

It's simple there are plenty of methods to do so (python3x)
using win64pyinstaller with progress
from win64pyinstaller import install
install("your_url", "destination_folder_with_file_name")
using urllib3 with progress
modifying [PabloG's] solution which is in python 2x
How to download a file over HTTP?
import urllib3
from sys import stdout
from urllib.request import urlopen
def _restart_line():
stdout.write('\r')
stdout.flush()
url = "your_url"
file_name = url.split('/')[-1]
u = urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.get("Content-Length"))
print(f"Downloading: {file_name} Bytes: {file_size}")
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = f"done - {(file_size_dl/1000000):.2f}, {(file_size_dl * 100 / file_size):.2f} %"
status = status + chr(8)*(len(status)+1)
stdout.write(status)
stdout.flush()
_restart_line()
f.close()
there are more ways to do it, hope you got your answer thankyou!

Python MultiProcessing and Directory Creation

I am using Python Multiprocessing module to scrape a website. Now this website has over 100,000 pages. What I am trying to do is to put every 500 pages I retrieve into a separate folder. The problem is that though I successfully create a new folder, my script only populates the previous folder. Here is the code:
global a = 1
global b = 500
def fetchAfter(y):
global a
global b
strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"
if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):
f = open(strfile, "w")
if __name__ == '__main__':
start = time.time()
for i in range(1,3):
os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))
pool = Pool(processes=12)
pool.map(fetchAfter, range(a,b))
pool.close()
pool.join()
a = b
b = b + 500
print time.time()-start

It is best for the worker function to only rely on the single argument it gets for determining what to do. Because that is the only information it gets from the parent process every time it is called. This argument can be almost any Python object (including a tuple, dict, list) so you're not really limited in the amount of information you pass to a worker.
So make a list of 2-tuples. Each 2-tuple should consist of (1) the file to get and (2) the directory where to stash it. Feed that list of tuples to map(), and let it rip.
I'm not sure if it is useful to specify the number of processes you want to use. Pool generally uses as many processes as your CPU has cores. That is usually enough to max out all the cores. :-)
BTW, you should only call map() once. And since map() blocks until everything is done, there is no need to call join().
Edit: Added example code below.
import multiprocessing
import requests
import os
def processfile(arg):
"""Worker function to scrape the pages and write them to a file.
Keyword arguments:
arg -- 2-tuple containing the URL of the page and the directory
where to save it.
"""
# Unpack the arguments
url, savedir = arg
# It might be a good idea to put a random delay of a few seconds here,
# so we don't hammer the webserver!
# Scrape the page. Requests rules ;-)
r = requests.get(url)
# Write it, keep the original HTML file name.
fname = url.split('/')[-1]
with open(savedir + '/' + fname, 'w+') as outfile:
outfile.write(r.text)
def main():
"""Main program.
"""
# This list of tuples should hold all the pages...
# Up to you how to generate it, this is just an example.
worklist = [('http://www.foo.org/page1.html', 'dir1'),
('http://www.foo.org/page2.html', 'dir1'),
('http://www.foo.org/page3.html', 'dir2'),
('http://www.foo.org/page4.html', 'dir2')]
# Create output directories
dirlist = ['dir1', 'dir2']
for d in dirlist:
os.makedirs(d)
p = Pool()
# Let'er rip!
p.map(processfile, worklist)
p.close()
if __name__ == '__main__':
main()

Multiprocessing, as the name implies, uses separate processes. The processes you create with your Pool do not have access to the original values of a and b that you are adding 500 to in the main program. See this previous question.
The easiest solution is to just refactor your code so that you pass a and b to fetchAfter (in addition to passing y).

Here's one way to implement it:
#!/usr/bin/env python
import logging
import multiprocessing as mp
import os
import urllib
def download_page(url_path):
try:
urllib.urlretrieve(*url_path)
mp.get_logger().info('done %s' % (url_path,))
except Exception as e:
mp.get_logger().error('failed %s: %s' % (url_path, e))
def generate_url_path(rootdir, urls_per_dir=500):
for i in xrange(100*1000):
if i % urls_per_dir == 0: # make new dir
dirpath = os.path.join(rootdir, '%d-%d' % (i, i+urls_per_dir))
if not os.path.isdir(dirpath):
os.makedirs(dirpath) # stop if it fails
url = 'http://example.com/page?' + urllib.urlencode(dict(number=i))
path = os.path.join(dirpath, '%d.html' % (i,))
yield url, path
def main():
mp.log_to_stderr().setLevel(logging.INFO)
pool = mp.Pool(4) # number of processes is unrelated to number of CPUs
# due to the task is IO-bound
for _ in pool.imap_unordered(download_page, generate_url_path(r'E:\A\B')):
pass
if __name__ == '__main__':
main()
See also Python multiprocessing pool.map for multiple arguments and the code
Brute force basic http authorization using httplib and multiprocessing from how to make HTTP in Python faster?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing slowdowns my web-crawler? - python

Related

Using tqdm progress bar in a if statement

How to process access log using python multiprocessing library?

Multiprocessing Pool not working - For loop inside function

How can I use Python to time how long a mp3 takes to download from website?

Python MultiProcessing and Directory Creation

Categories

Resources