I need to download a large number of files and save them to my local machine. Given the introduction of Async Await in python is it best to use this method, or use continue to use Concurrent.Futures with something like:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Related
I'm learning how to use concurrent with executor.map() and executor.submit().
I have a list that contains 20 url and want to send 20 requests at the same time, the problem is .submit() returns results in different order than the given list from the beginning. I've read that map() does what I need but i don't know how to write code with it.
The code below worked perfect to me.
Questions: is there any code block of map()that equivalent to the code below, or any sorting methods that can sort the result list from submit() by order of the list given?
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Here is the map version of your existing code. Note that the callback now accepts a tuple as a parameter. I added an try\except in the callback so the results will not throw an error. The results are ordered according to the input list.
from concurrent.futures import ThreadPoolExecutor
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://www.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(tt): # (url,timeout)
url, timeout = tt
try:
with urllib.request.urlopen(url, timeout=timeout) as conn:
return (url, conn.read())
except Exception as ex:
print("Error:", url, ex)
return(url,"") # error, return empty string
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(load_url, [(u,60) for u in URLS]) # pass url and timeout as tuple to callback
executor.shutdown(wait=True) # wait for all complete
print("Results:")
for r in results: # ordered results, will throw exception here if not handled in callback
print(' %r page is %d bytes' % (r[0], len(r[1])))
Output
Error: http://www.wsj.com/ HTTP Error 404: Not Found
Results:
'http://www.foxnews.com/' page is 320028 bytes
'http://www.cnn.com/' page is 1144916 bytes
'http://www.wsj.com/' page is 0 bytes
'http://www.bbc.co.uk/' page is 279418 bytes
'http://some-made-up-domain.com/' page is 64668 bytes
Without using the map method, you can use enumerate to build the future_to_url dict with not just the URLs as values, but also their indices in the list. You can then build a dict from the future objects returned by the call to concurrent.futures.as_completed(future_to_url) with indices as the keys, so that you can iterate an index over the length of the dict to read the dict in the same order as the corresponding items in the original list:
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {
executor.submit(load_url, url, 60): (i, url) for i, url in enumerate(URLS)
}
futures = {}
for future in concurrent.futures.as_completed(future_to_url):
i, url = future_to_url[future]
futures[i] = url, future
for i in range(len(futures)):
url, future = futures[i]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Hi how can i use the request module to go through a bunch of URLs and if a url in the list takes more time to load or a connection timeout how can i skip that particular url and skip to the next one
def req():
with open('demofile.txt','r') as http:
for url in http.readlines():
req = url.strip()
print(req)
page=requests.get("http://"+req,verify=False)
if page.status_code == 400:
break
else:
continue
time.sleep(1)
You can raise exception if there is a timeout and continue on finally block for next request,
import requests
import logging
timeout = 0.00001
try:
response = requests.get(url="https://google.com", timeout=timeout)
except requests.exceptions.ConnectTimeout as e:
logging.error("Time out!")
finally:
# continue request here
print("hello")
# output,
ERROR:root:Time out!
hello
I have a server which waits for a request containing a pictures:
#app.route("/uploader_ios", methods=['POST'])
def upload_file_ios():
imagefile = request.files['imagefile']
I can submit a post request quite easily using requests in python like so:
url = "<myserver>/uploader_ios"
files = {'imagefile': open(fname, 'rb')}
%time requests.post(url, files=files).json() # 2.77s
However, what I would like to do is submit 1000 or perhaps 100,000 requests at the same time. I wanted to try to do this using asyncio because I have been able to use this for get requests without a problem. However I can't see to create a valid post request that the server accepts.
My attempt is below:
import aiohttp
import asyncio
import json
# Testing with small amount
concurrent = 2
url_list = ['<myserver>/uploader_ios'] * 10
def handle_req(data):
return json.loads(data)['English']
def chunked_http_client(num_chunks, s):
# Use semaphore to limit number of requests
semaphore = asyncio.Semaphore(num_chunks)
#asyncio.coroutine
# Return co-routine that will work asynchronously and respect
# locking of semaphore
def http_get(url):
nonlocal semaphore
with (yield from semaphore):
# Attach files
files = aiohttp.FormData()
files.add_field('imagefile', open(fname, 'rb'))
response = yield from s.request('post', url, data=files)
print(response)
body = yield from response.content.read()
yield from response.wait_for_close()
return body
return http_get
def run_experiment(urls, _session):
http_client = chunked_http_client(num_chunks=concurrent, s=_session)
# http_client returns futures, save all the futures to a list
tasks = [http_client(url) for url in urls]
dfs_route = []
# wait for futures to be ready then iterate over them
for future in asyncio.as_completed(tasks):
data = yield from future
try:
out = handle_req(data)
dfs_route.append(out)
except Exception as err:
print("Error {0}".format(err))
return dfs_route
with aiohttp.ClientSession() as session: # We create a persistent connection
loop = asyncio.get_event_loop()
calc_routes = loop.run_until_complete(run_experiment(url_list, session))
The issue is that the response I get is:
.../uploader_ios) [400 BAD REQUEST]>
I am assuming this is because I am not correctly attaching the image-file
Since my scaper is running so slow (one page at a time) so I'm trying to use thread to make it work faster. I have a function scrape(website) that take in a website to scrape, so easily I can create each thread and call start() on each of them.
Now I want to implement a num_threads variable that is the number of threads that I want to run at the same time. What is the best way to handle those multiple threads?
For ex: supposed num_threads = 5 , my goal is to start 5 threads then grab the first 5 website in the list and scrape them, then if thread #3 finishes, it will grab the 6th website from the list to scrape immidiately, not wait until other threads end.
Any recommendation for how to handle it? Thank you
It depends.
If your code is spending most of its time waiting for network operations (likely, in a web scraping application), threading is appropriate. The best way to implement a thread pool is to use concurrent.futures in 3.4. Failing that, you can create a threading.Queue object and write each thread as an infinite loop that consumes work objects from the queue and processes them.
If your code is spending most of its time processing data after you've downloaded it, threading is useless due to the GIL. concurrent.futures provides support for process concurrency, but again only works in 3.4+. For older Pythons, use multiprocessing. It provides a Pool type which simplifies the process of creating a process pool.
You should profile your code (using cProfile) to determine which of those two scenarios you are experiencing.
If you're using Python 3, have a look at concurrent.futures.ThreadPoolExecutor
Example pulled from the docs ThreadPoolExecutor Example:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
If you're using Python 2, there is a backport available:
ThreadPoolExecutor Example:
from concurrent import futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
def load_url(url, timeout):
return urllib.request.urlopen(url, timeout=timeout).read()
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = dict((executor.submit(load_url, url, 60), url)
for url in URLS)
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is not None:
print('%r generated an exception: %s' % (url,
future.exception()))
else:
print('%r page is %d bytes' % (url, len(future.result())))
I have a need for a callback kind of functionality in Python where I am sending a request to a webservice multiple times, with a change in the parameter each time. I want these requests to happen concurrently instead of sequentially, so I want the function to be called asynchronously.
It looks like asyncore is what I might want to use, but the examples I've seen of how it works all look like overkill, so I'm wondering if there's another path I should be going down. Any suggestions on modules/process? Ideally I'd like to use these in a procedural fashion instead of creating classes but I may not be able to get around that.
Starting in Python 3.2, you can use concurrent.futures for launching parallel tasks.
Check out this ThreadPoolExecutor example:
http://docs.python.org/dev/library/concurrent.futures.html#threadpoolexecutor-example
It spawns threads to retrieve HTML and acts on responses as they are received.
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
The above example uses threading. There is also a similar ProcessPoolExecutor that uses a pool of processes, rather than threads:
http://docs.python.org/dev/library/concurrent.futures.html#processpoolexecutor-example
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Do you know about eventlet? It lets you write what appears to be synchronous code, but have it operate asynchronously over the network.
Here's an example of a super minimal crawler:
urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
"https://wiki.secondlife.com/w/images/secondlife.jpg",
"http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
Twisted framework is just the ticket for that. But if you don't want to take that on you might also use pycurl, wrapper for libcurl, that has its own async event loop and supports callbacks.
(Although this thread is about server-side Python. Since this question was asked a while back. Others might stumble on this where they are looking for a similar answer on the client side)
For a client side solution, you might want to take a look at Async.js library especially the "Control-Flow" section.
https://github.com/caolan/async#control-flow
By combining the "Parallel" with a "Waterfall" you can achieve your desired result.
WaterFall( Parallel(TaskA, TaskB, TaskC) -> PostParallelTask)
If you examine the example under Control-Flow - "Auto" they give you an example of the above:
https://github.com/caolan/async#autotasks-callback
where "write-file" depends on "get_data" and "make_folder" and "email_link" depends on write-file".
Please note that all of this happens on the client side (unless you're doing Node.JS - on the server-side)
For server-side Python, look at PyCURL # https://github.com/pycurl/pycurl/blob/master/examples/basicfirst.py
By combining the example below with pyCurl, you can achieve the non-blocking multi-threaded functionality.