I'm learning how to use concurrent with executor.map() and executor.submit().
I have a list that contains 20 url and want to send 20 requests at the same time, the problem is .submit() returns results in different order than the given list from the beginning. I've read that map() does what I need but i don't know how to write code with it.
The code below worked perfect to me.
Questions: is there any code block of map()that equivalent to the code below, or any sorting methods that can sort the result list from submit() by order of the list given?
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Here is the map version of your existing code. Note that the callback now accepts a tuple as a parameter. I added an try\except in the callback so the results will not throw an error. The results are ordered according to the input list.
from concurrent.futures import ThreadPoolExecutor
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://www.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(tt): # (url,timeout)
url, timeout = tt
try:
with urllib.request.urlopen(url, timeout=timeout) as conn:
return (url, conn.read())
except Exception as ex:
print("Error:", url, ex)
return(url,"") # error, return empty string
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(load_url, [(u,60) for u in URLS]) # pass url and timeout as tuple to callback
executor.shutdown(wait=True) # wait for all complete
print("Results:")
for r in results: # ordered results, will throw exception here if not handled in callback
print(' %r page is %d bytes' % (r[0], len(r[1])))
Output
Error: http://www.wsj.com/ HTTP Error 404: Not Found
Results:
'http://www.foxnews.com/' page is 320028 bytes
'http://www.cnn.com/' page is 1144916 bytes
'http://www.wsj.com/' page is 0 bytes
'http://www.bbc.co.uk/' page is 279418 bytes
'http://some-made-up-domain.com/' page is 64668 bytes
Without using the map method, you can use enumerate to build the future_to_url dict with not just the URLs as values, but also their indices in the list. You can then build a dict from the future objects returned by the call to concurrent.futures.as_completed(future_to_url) with indices as the keys, so that you can iterate an index over the length of the dict to read the dict in the same order as the corresponding items in the original list:
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {
executor.submit(load_url, url, 60): (i, url) for i, url in enumerate(URLS)
}
futures = {}
for future in concurrent.futures.as_completed(future_to_url):
i, url = future_to_url[future]
futures[i] = url, future
for i in range(len(futures)):
url, future = futures[i]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Related
I'm trying to speed up API requests using multi threading.
I don't understand why but I often get the same API response for different calls (they should not have the same response). At the end I get a lot of duplicates in my new file and a lot of rows are missing.
example : request.post("id=5555") --> response for the request.post("id=444") instead of request.post("id=5555")
It looks like workers catch the wrong responses.
Have anybody faced this issue ?
` def request_data(id, useragent):
- ADD ID to data and useragent to headers -
time.sleep(0.2)
resp = requests.post(
-URL-,
params=params,
headers=headerstemp,
cookies=cookies,
data=datatemp,
)
return resp
df = pd.DataFrame(columns=["ID", "prenom", "nom", "adresse", "tel", "mail", "prem_dispo", "capac_acc", "tarif_haut", "tarif_bas", "presentation", "agenda"])
ids = pd.read_csv('ids.csv')
ids.drop_duplicates(inplace=True)
ids = list(ids['0'].to_numpy())
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
future_to_url = {executor.submit(request_data, id, usera): id for id in ids}
for future in concurrent.futures.as_completed(future_to_url):
ok=False
while(ok==False):
try:
resp = future.result()
ok=True
except Exception as e:
print(e)
df.loc[len(df)] = parse(json.loads(resp))
`
I tried using asyncio, first response from Multiple async requests simultaneously but it returned the request and not the API response...
I am trying to test an entire list of websites to see if the URLs are valid, and I want to know which ones are not.
import urllib2
filename=open(argfile,'r')
f=filename.readlines()
filename.close()
def urlcheck() :
for line in f:
try:
urllib2.urlopen()
print "SITE IS FUNCTIONAL"
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
urlcheck()
You have to pass url
def urlcheck() :
for line in f:
try:
urllib2.urlopen(line)
print line, "SITE IS FUNCTIONAL"
except urllib2.HTTPError, e:
print line, "SITE IS NOT FUNCTIONAL"
print(e.code)
except urllib2.URLError, e:
print line, "SITE IS NOT FUNCTIONAL"
print(e.args)
except Exception,e:
print line, "Invalid URL"
Some edge cases or things to consider
Little bit on error codes and HTTPError
Every HTTP response from the server contains a numeric “status code”.
Sometimes the status code indicates that the server is unable to
fulfil the request. The default handlers will handle some of these
responses for you (for example, if the response is a “redirection”
that requests the client fetch the document from a different URL,
urllib2 will handle that for you). For those it can’t handle, urlopen
will raise an HTTPError. Typical errors include ‘404’ (page not
found), ‘403’ (request forbidden), and ‘401’ (authentication
required).
Even if HTTPError is raised you may check for the error code
So sometimes even if the URL is valid and available it may raise HTTPError with code 403,401 etc .
Sometime valid urls would give 5xx due to temporary ServerErrors
I would suggest you to use requests library.
import requests
resp = requests.get('your url')
if not resp.ok:
print resp.status_code
You have to pass url as a parameter to the urlopen function.
import urllib2
filename=open(argfile,'r')
f=filename.readlines()
filename.close()
def urlcheck() :
for line in f:
try:
urllib2.urlopen(line) # careful here
print "SITE IS FUNCTIONAL"
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
urlcheck()
import urllib2
def check(url):
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD' # gets only headers without body (increase speed)
request.add_header('Content-Encoding', 'gzip, deflate, br') # gets archived headers (increase speed)
try:
response = urllib2.urlopen(request)
return response.getcode() <= 400
except Exception:
return False
'''
Contents of "/tmp/urls.txt"
http://www.google.com
https://fb.com
http://not-valid
http://not-valid.nvd
not-valid
'''
filename = open('/tmp/urls.txt', 'r')
urls = filename.readlines()
filename.close()
for url in urls:
print url + ' ' + str(check(url))
I would probably write it like this:
import urllib2
with open('urls.txt') as f:
urls = [url.strip() for url in f.readlines()]
def urlcheck() :
for url in urls:
try:
urllib2.urlopen(url)
except (ValueError, urllib2.URLError) as e:
print('invalid url: {}'.format(url))
urlcheck()
some changes from the OP's original implementation:
use a context manager to open/close data file
strip newlines from URLs as they are read from file
use better variable names
switch to more modern exception handling style
also catch ValueError for malformed URL's
display a more useful error message
example output:
$ python urlcheck.py
invalid url: http://www.google.com/wertbh
invalid url: htp:/google.com
invalid url: google.com
invalid url: https://wwwbad-domain-zzzz.com
I need to download a large number of files and save them to my local machine. Given the introduction of Async Await in python is it best to use this method, or use continue to use Concurrent.Futures with something like:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Since my scaper is running so slow (one page at a time) so I'm trying to use thread to make it work faster. I have a function scrape(website) that take in a website to scrape, so easily I can create each thread and call start() on each of them.
Now I want to implement a num_threads variable that is the number of threads that I want to run at the same time. What is the best way to handle those multiple threads?
For ex: supposed num_threads = 5 , my goal is to start 5 threads then grab the first 5 website in the list and scrape them, then if thread #3 finishes, it will grab the 6th website from the list to scrape immidiately, not wait until other threads end.
Any recommendation for how to handle it? Thank you
It depends.
If your code is spending most of its time waiting for network operations (likely, in a web scraping application), threading is appropriate. The best way to implement a thread pool is to use concurrent.futures in 3.4. Failing that, you can create a threading.Queue object and write each thread as an infinite loop that consumes work objects from the queue and processes them.
If your code is spending most of its time processing data after you've downloaded it, threading is useless due to the GIL. concurrent.futures provides support for process concurrency, but again only works in 3.4+. For older Pythons, use multiprocessing. It provides a Pool type which simplifies the process of creating a process pool.
You should profile your code (using cProfile) to determine which of those two scenarios you are experiencing.
If you're using Python 3, have a look at concurrent.futures.ThreadPoolExecutor
Example pulled from the docs ThreadPoolExecutor Example:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
If you're using Python 2, there is a backport available:
ThreadPoolExecutor Example:
from concurrent import futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
def load_url(url, timeout):
return urllib.request.urlopen(url, timeout=timeout).read()
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = dict((executor.submit(load_url, url, 60), url)
for url in URLS)
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is not None:
print('%r generated an exception: %s' % (url,
future.exception()))
else:
print('%r page is %d bytes' % (url, len(future.result())))
I have a need for a callback kind of functionality in Python where I am sending a request to a webservice multiple times, with a change in the parameter each time. I want these requests to happen concurrently instead of sequentially, so I want the function to be called asynchronously.
It looks like asyncore is what I might want to use, but the examples I've seen of how it works all look like overkill, so I'm wondering if there's another path I should be going down. Any suggestions on modules/process? Ideally I'd like to use these in a procedural fashion instead of creating classes but I may not be able to get around that.
Starting in Python 3.2, you can use concurrent.futures for launching parallel tasks.
Check out this ThreadPoolExecutor example:
http://docs.python.org/dev/library/concurrent.futures.html#threadpoolexecutor-example
It spawns threads to retrieve HTML and acts on responses as they are received.
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
The above example uses threading. There is also a similar ProcessPoolExecutor that uses a pool of processes, rather than threads:
http://docs.python.org/dev/library/concurrent.futures.html#processpoolexecutor-example
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Do you know about eventlet? It lets you write what appears to be synchronous code, but have it operate asynchronously over the network.
Here's an example of a super minimal crawler:
urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
"https://wiki.secondlife.com/w/images/secondlife.jpg",
"http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
Twisted framework is just the ticket for that. But if you don't want to take that on you might also use pycurl, wrapper for libcurl, that has its own async event loop and supports callbacks.
(Although this thread is about server-side Python. Since this question was asked a while back. Others might stumble on this where they are looking for a similar answer on the client side)
For a client side solution, you might want to take a look at Async.js library especially the "Control-Flow" section.
https://github.com/caolan/async#control-flow
By combining the "Parallel" with a "Waterfall" you can achieve your desired result.
WaterFall( Parallel(TaskA, TaskB, TaskC) -> PostParallelTask)
If you examine the example under Control-Flow - "Auto" they give you an example of the above:
https://github.com/caolan/async#autotasks-callback
where "write-file" depends on "get_data" and "make_folder" and "email_link" depends on write-file".
Please note that all of this happens on the client side (unless you're doing Node.JS - on the server-side)
For server-side Python, look at PyCURL # https://github.com/pycurl/pycurl/blob/master/examples/basicfirst.py
By combining the example below with pyCurl, you can achieve the non-blocking multi-threaded functionality.