example urllib3 and threading in python - python

I am trying to use urllib3 in simple thread to fetch several wiki pages.
The script will
Create 1 connection for every thread (I don't understand why) and Hang forever.
Any tip, advice or simple example of urllib3 and threading
import threadpool
from urllib3 import connection_from_url
HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
def fetch(url, fiedls):
kwargs={'retries':6}
return HTTP_POOL.get_url(url, fields, **kwargs)
pool = threadpool.ThreadPool(5)
requests = threadpool.makeRequests(fetch, iterable)
[pool.putRequest(req) for req in requests]
#Lennart's script got this error:
http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
After adding import threadpool; import urllib3 and tpool = threadpool.ThreadPool(4) #user318904's code got this error:
Traceback (most recent call last):
File "crawler.py", line 21, in <module>
tpool.map_async(fetch, urls)
AttributeError: ThreadPool instance has no attribute 'map_async'

Here is my take, a more current solution using Python3 and concurrent.futures.ThreadPoolExecutor.
import urllib3
from concurrent.futures import ThreadPoolExecutor
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
]
def download(url, cmanager):
response = cmanager.request('GET', url)
if response and response.status == 200:
print("+++++++++ url: " + url)
print(response.data[:1024])
connection_mgr = urllib3.PoolManager(maxsize=5)
thread_pool = ThreadPoolExecutor(5)
for url in urls:
thread_pool.submit(download, url, connection_mgr)
Some remarks
My code is based on a similar example from the Python Cookbook by Beazley and Jones.
I particularly like the fact that you only need a standard module besides urllib3.
The setup is extremely simple, and if you are only going for side-effects in download (like printing, saving to a file, etc.), there is no additional effort in joining the threads.
If you want something different, ThreadPoolExecutor.submit actually returns whatever download would return, wrapped in a Future.
I found it helpful to align the number of threads in the thread pool with the number of HTTPConnection's in a connection pool (via maxsize). Otherwise you might encounter (harmless) warnings when all threads try to access the same server (as in the example).

Obviously it will create one connection per thread, how should else each thread be able to fetch a page? And you try to use the same connection, made from one url, for all urls. That can hardly be what you intended.
This code worked just fine:
import threadpool
from urllib3 import connection_from_url
def fetch(url):
kwargs={'retries':6}
conn = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
print url, conn.get_url(url)
print "Done!"
pool = threadpool.ThreadPool(4)
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
]
requests = threadpool.makeRequests(fetch, urls)
[pool.putRequest(req) for req in requests]
pool.wait()

Thread programming is hard, so I wrote workerpool to make exactly what you're doing easier.
More specifically, see the Mass Downloader example.
To do the same thing with urllib3, it looks something like this:
import urllib3
import workerpool
pool = urllib3.connection_from_url("foo", maxsize=3)
def download(url):
r = pool.get_url(url)
# TODO: Do something with r.data
print "Downloaded %s" % url
# Initialize a pool, 5 threads in this case
pool = workerpool.WorkerPool(size=5)
# The ``download`` method will be called with a line from the second
# parameter for each job.
pool.map(download, open("urls.txt").readlines())
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
For more sophisticated code, have a look at workerpool.EquippedWorker (and the tests here for example usage). You can make the pool be the toolbox you pass in.

I use something like this:
#excluding setup for threadpool etc
upool = urllib3.HTTPConnectionPool('en.wikipedia.org', block=True)
urls = ['/wiki/2010-11_Premier_League',
'/wiki/List_of_MythBusters_episodes',
'/wiki/List_of_Top_Gear_episodes',
'/wiki/List_of_Unicode_characters',
]
def fetch(path):
# add error checking
return pool.get_url(path).data
tpool = ThreadPool()
tpool.map_async(fetch, urls)
# either wait on the result object or give map_async a callback function for the results

Related

Multiprocessing Manager failing on very simple example with pool.apply_async

I'm seeing some unexpected behavior in my code related to python multiprocessing, and the Manager class in particular. I wrote out a super simple example to try and better understand what's going on:
import multiprocessing as mp
from collections import defaultdict
def process(d):
print('doing the process')
d['a'] = []
d['a'].append(1)
d['a'].append(2)
def main():
pool = mp.Pool(mp.cpu_count())
with mp.Manager() as manager:
d = manager.dict({'c': 2})
result = pool.apply_async(process, args=(d))
print(result.get())
pool.close()
pool.join()
print(d)
if __name__ == '__main__':
main()
This fails, and the stack trace printed from result.get() is as follows:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "<string>", line 2, in __iter__
File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 825, in _callmethod
proxytype = self._manager._registry[token.typeid][-1]
AttributeError: 'NoneType' object has no attribute '_registry'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "mp_test.py", line 34, in <module>
main()
File "mp_test.py", line 25, in main
print(result.get())
File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
AttributeError: 'NoneType' object has no attribute '_registry'
I'm still unclear on what's happening here. This seems to me to be a very, very straightforward application of the Manager class. It's nearly a copy of the actual example used in the official python documentation, with the only difference being that i'm using a pool and running the process with apply_async. I'm doing this because that's what i'm using in my actual project.
To clarify, I wouldn't get a stack trace if I didn't have the result = and print(result.get()) in there. I just see {'c': 2} printed when I run the script, which indicated to me that something was going wrong and wasn't being shown.
A couple things to start with: first, this isn't the code you ran. The code you posted has
result = pool.apply_async(process2, args=(d))
but there is no process2() defined. Assuming "process` was intended, the next thing is the
args=(d)
part. That's the same as typing
args=d
but that's not what's needed. You need to pass a sequence of the intended arguments. So you need to change that part to
args=(d,) # build a 1-tuple
or
args=[d] # build a list
Then the output changes, to
{'c': 2, 'a': []}
Why aren't 1 and 2 in the the 'a' list? Because it's only the dict itself that lives on the manager server.
d['a'].append(1)
first gets the mapping for 'a' from the server, which is an empty list. But that empty list is not shared in any way - it's local to process(). You append 1 to it, and then it's thrown away - the server knows nothing about it. Same thing for 2.
To get what you want, you need to "do something" to tell the manager server about what you changed; e.g.,
d['a'] = L = []
L.append(1)
L.append(2)
d['a'] = L

Why does concurrent.futures executor map throw error when using with futures.as_completed after all the futures are complete?

I am trying to send HTTP requests concurrently. In order to do so, I am using concurrent.futures
Here is simple code:
import requests
from concurrent import futures
data = range(10)
def send_request(item):
requests.get("https://httpbin.org/ip")
print("Request {} complete.".format(item))
executor = futures.ThreadPoolExecutor(max_workers=25)
futures_ = executor.map(send_request, data)
for f in futures.as_completed(futures_):
f.result()
If I run it, I can see requests are sent asynchronously, which is exactly what I want to do. However, when all the requests are complete, I get following error:
Request 0 complete.
Request 6 complete.
...
Request 7 complete.
Request 9 complete.
Request 3 complete.
Traceback (most recent call last):
File "send_thread.py", line 18, in <module>
for f in futures.as_completed(futures_):
File "/usr/local/Cellar/python3/3.6.4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 219, in as_completed
with _AcquireFutures(fs):
File "/usr/local/Cellar/python3/3.6.4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 146, in __enter__
future._condition.acquire()
AttributeError: 'NoneType' object has no attribute '_condition'
This is quite strange error. Here executor.map seems to be problematic. If I replace map with following line, it works as expected.
futures_ = [executor.submit(send_request, x) for x in data]
What am I missing? Tried to find difference between two, but can't seem to understand what could cause above issue. Any input would be highly appreciated.
Executor.map does not return you a list of futures but a generator of results, so instead of:
futures_ = executor.map(send_request, data)
for f in futures.as_completed(futures_):
f.result()
you should run:
results = executor.map(send_request, data)
for r in results:
print(r)

Error Web Scraping with ThreadPoolExecutor

I'm working on a simple web scraping program, but I can't even seem to download a simple set of pages and get their sizes.
Here is my code:
from concurrent.futures import ThreadPoolExecutor as Executor
urls = """reddit twitter tumblr instagram linkedin""".split()
def fetch(url):
from urllib import request, error
try:
data = request.urlopen(url).read()
return '{}: length {}'.format(url, len(data))
except error.HTTPError as e:
return '{}: {}'.format(url, e)
with Executor(max_workers=4) as exe:
template = 'http://www.{}.com'
jobs = [exe.submit(
fetch, template.format(u)) for u in urls]
results = [job.result() for job in jobs]
print('\n'.join(results))
In the command line I'm running
python scrape.py
but I'm getting the error
Traceback (most recent call last):
File "scrape.py", line 1, in
from concurrent.futures import ThreadPoolExecutor as Executor
ImportError: No module named concurrent.futures
What do I need to do to surmount this error?
Use Python 3.
https://docs.python.org/3/library/concurrent.futures.html
New in version 3.2.

Python - multiprocessing.pool.MaybeEncodingError while downloading images [duplicate]

Why does the code below work only with multiprocessing.dummy, but not with simple multiprocessing.
import urllib.request
#from multiprocessing.dummy import Pool #this works
from multiprocessing import Pool
urls = ['http://www.python.org', 'http://www.yahoo.com','http://www.scala.org', 'http://www.google.com']
if __name__ == '__main__':
with Pool(5) as p:
results = p.map(urllib.request.urlopen, urls)
Error :
Traceback (most recent call last):
File "urlthreads.py", line 31, in <module>
results = p.map(urllib.request.urlopen, urls)
File "C:\Users\patri\Anaconda3\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\patri\Anaconda3\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[<http.client.HTTPResponse object at 0x0000016AEF204198>]'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
What's missing so that it works without "dummy" ?
The http.client.HTTPResponse-object you get back from urlopen() has a _io.BufferedReader-object attached, and this object cannot be pickled.
pickle.dumps(urllib.request.urlopen('http://www.python.org').fp)
Traceback (most recent call last):
...
pickle.dumps(urllib.request.urlopen('http://www.python.org').fp)
TypeError: cannot serialize '_io.BufferedReader' object
multiprocessing.Pool will need to pickle (serialize) the results to send it back to the parent process and this fails here. Since dummy uses threads instead of processes, there will be no pickling, because threads in the same process share their memory naturally.
A general solution to this TypeError is:
read out the buffer and save the content (if needed)
remove the reference to '_io.BufferedReader' from the object you are trying to pickle
In your case, calling .read() on the http.client.HTTPResponse will empty and remove the buffer, so a function for converting the response into something pickleable could simply do this:
def read_buffer(response):
response.text = response.read()
return response
Example:
r = urllib.request.urlopen('http://www.python.org')
r = read_buffer(r)
pickle.dumps(r)
# Out: b'\x80\x03chttp.client\nHTTPResponse\...
Before you consider this approach, make sure you really want to use multiprocessing instead of multithreading. For I/O-bound tasks like you have it here, multithreading would be sufficient, since most of the time is spend in waiting (no need for cpu-time) for the response anyway. Multiprocessing and the IPC involved also introduces substantial overhead.

Unit testing bottle py application that uses request body results in KeyError: 'wsgi.input'

When unit testing a bottle py route function:
from bottle import request, run, post
#post("/blah/<boo>")
def blah(boo):
body = request.body.readline()
return "body is %s" % body
blah("booooo!")
The following exception is raised:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in blah
File "bottle.py", line 1197, in body
self._body.seek(0)
File "bottle.py", line 166, in __get__
if key not in storage: storage[key] = self.getter(obj)
File "bottle.py", line 1164, in _body
read_func = self.environ['wsgi.input'].read
KeyError: 'wsgi.input'
The code will work if running as a server via bottle's run function, it's purely when I call it as a normal Python function e.g. in a unit test.
What am I missing? How can I invoke this as a normal python func inside a unit test?
I eventually worked out what the problem is. I needed to "fake" the request environment for bottle to play nicely:
from bottle import request, run, post, tob
from io import BytesIO
body = "abc"
request.environ['CONTENT_LENGTH'] = str(len(tob(body)))
request.environ['wsgi.input'] = BytesIO()
request.environ['wsgi.input'].write(tob(body))
request.environ['wsgi.input'].seek(0)
# Now call your route function and assert
Another issue is that Bottle uses thread locals and reads the BytesIO object you put into request.environ the first time you access the body property on request. Therefore if you run multiple tests with post data, e.g. in a TestCase, when you come to read it in your request callback it will only return the value it was initially given, not your updated value.
The solution is to scrub all the values stored on the request object before each test, so in your setUp(self) you can do something like this:
class MyTestCase(TestCase):
def setUp():
# Flush any cached values
request.bind({})
Check out https://pypi.python.org/pypi/boddle. In your test you could do:
from bottle import request, run, post
from boddle import boddle
#post("/blah/<boo>")
def blah(boo):
body = request.body.readline()
return "body is %s" % body
with boddle(body='woot'):
print blah("booooo!")
Which would print body is woot.
Disclaimer: I authored. (Wrote it for work.)

Categories