Python - multiprocessing.pool.MaybeEncodingError while downloading images [duplicate] - python

Why does the code below work only with multiprocessing.dummy, but not with simple multiprocessing.
import urllib.request
#from multiprocessing.dummy import Pool #this works
from multiprocessing import Pool
urls = ['http://www.python.org', 'http://www.yahoo.com','http://www.scala.org', 'http://www.google.com']
if __name__ == '__main__':
with Pool(5) as p:
results = p.map(urllib.request.urlopen, urls)
Error :
Traceback (most recent call last):
File "urlthreads.py", line 31, in <module>
results = p.map(urllib.request.urlopen, urls)
File "C:\Users\patri\Anaconda3\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\patri\Anaconda3\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[<http.client.HTTPResponse object at 0x0000016AEF204198>]'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
What's missing so that it works without "dummy" ?

The http.client.HTTPResponse-object you get back from urlopen() has a _io.BufferedReader-object attached, and this object cannot be pickled.
pickle.dumps(urllib.request.urlopen('http://www.python.org').fp)
Traceback (most recent call last):
...
pickle.dumps(urllib.request.urlopen('http://www.python.org').fp)
TypeError: cannot serialize '_io.BufferedReader' object
multiprocessing.Pool will need to pickle (serialize) the results to send it back to the parent process and this fails here. Since dummy uses threads instead of processes, there will be no pickling, because threads in the same process share their memory naturally.
A general solution to this TypeError is:
read out the buffer and save the content (if needed)
remove the reference to '_io.BufferedReader' from the object you are trying to pickle
In your case, calling .read() on the http.client.HTTPResponse will empty and remove the buffer, so a function for converting the response into something pickleable could simply do this:
def read_buffer(response):
response.text = response.read()
return response
Example:
r = urllib.request.urlopen('http://www.python.org')
r = read_buffer(r)
pickle.dumps(r)
# Out: b'\x80\x03chttp.client\nHTTPResponse\...
Before you consider this approach, make sure you really want to use multiprocessing instead of multithreading. For I/O-bound tasks like you have it here, multithreading would be sufficient, since most of the time is spend in waiting (no need for cpu-time) for the response anyway. Multiprocessing and the IPC involved also introduces substantial overhead.

Related

Multiprocessing Manager failing on very simple example with pool.apply_async

I'm seeing some unexpected behavior in my code related to python multiprocessing, and the Manager class in particular. I wrote out a super simple example to try and better understand what's going on:
import multiprocessing as mp
from collections import defaultdict
def process(d):
print('doing the process')
d['a'] = []
d['a'].append(1)
d['a'].append(2)
def main():
pool = mp.Pool(mp.cpu_count())
with mp.Manager() as manager:
d = manager.dict({'c': 2})
result = pool.apply_async(process, args=(d))
print(result.get())
pool.close()
pool.join()
print(d)
if __name__ == '__main__':
main()
This fails, and the stack trace printed from result.get() is as follows:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "<string>", line 2, in __iter__
File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 825, in _callmethod
proxytype = self._manager._registry[token.typeid][-1]
AttributeError: 'NoneType' object has no attribute '_registry'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "mp_test.py", line 34, in <module>
main()
File "mp_test.py", line 25, in main
print(result.get())
File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
AttributeError: 'NoneType' object has no attribute '_registry'
I'm still unclear on what's happening here. This seems to me to be a very, very straightforward application of the Manager class. It's nearly a copy of the actual example used in the official python documentation, with the only difference being that i'm using a pool and running the process with apply_async. I'm doing this because that's what i'm using in my actual project.
To clarify, I wouldn't get a stack trace if I didn't have the result = and print(result.get()) in there. I just see {'c': 2} printed when I run the script, which indicated to me that something was going wrong and wasn't being shown.
A couple things to start with: first, this isn't the code you ran. The code you posted has
result = pool.apply_async(process2, args=(d))
but there is no process2() defined. Assuming "process` was intended, the next thing is the
args=(d)
part. That's the same as typing
args=d
but that's not what's needed. You need to pass a sequence of the intended arguments. So you need to change that part to
args=(d,) # build a 1-tuple
or
args=[d] # build a list
Then the output changes, to
{'c': 2, 'a': []}
Why aren't 1 and 2 in the the 'a' list? Because it's only the dict itself that lives on the manager server.
d['a'].append(1)
first gets the mapping for 'a' from the server, which is an empty list. But that empty list is not shared in any way - it's local to process(). You append 1 to it, and then it's thrown away - the server knows nothing about it. Same thing for 2.
To get what you want, you need to "do something" to tell the manager server about what you changed; e.g.,
d['a'] = L = []
L.append(1)
L.append(2)
d['a'] = L

How to properly use dask's upload_file() to pass local code to workers

I have functions in a local_code.py file that I would like to pass to workers through dask. I've seen answers to questions on here saying that this can be done using the upload_file() function, but I can't seem to get it working because I'm still getting a ModuleNotFoundError.
The relevant part of the code is as follows.
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
from local_code import *
helper_file = '/absolute/path/to/local_code.py'
def main():
with SLURMCluster(**slurm_params) as cluster:
cluster.scale(n_workers)
with Client(cluster) as client:
client.upload_file(helper_file)
mapping = client.map(myfunc, data)
client.gather(mapping)
if __name__ == '__main__':
main()
Note, myfunc is imported from local_code, and there's no error importing it to map. The function myfunc also depends on other functions that are defined in local_code.
With this code, I'm still getting this error
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95+\x00\x00\x00\x00\x00\x00\x00\x8c\x11local_code\x94\x8c\x$
Traceback (most recent call last):
File "/home/gallagher.r/.local/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 61, in loads
return pickle.loads(x)
ModuleNotFoundError: No module named 'local_code'
Using upload_file() seems so straightforward that I'm not sure what I'm doing wrong. I must have it in the wrong place or not be understanding correctly what is passed to it.
I'd appreciate any help with this. Please let me know if you need any other information or if there's anything else I can supply from the error file.
The upload_file method only uploads the file to the currently available workers. If a worker arrives after you call upload_file then that worker won't have the provided file.
If your situation the easiest thing to do is probably to wait until all of the workers arrive before you call upload file
cluster.scale(n)
with Client(cluster) as client:
client.wait_for_workers(n)
client.upload_file(...)
Another option when you have workers going in/out is to use the Client.register_worker_callbacks to hook into whenever a new worker is registered/added. The one caveat is you will need to serialize your file(s) in the callback partial:
fname = ...
with open(fname, 'rb') as f:
data = f.read()
client.register_worker_callbacks(
setup=functools.partial(
_worker_upload, data=data, fname=fname,
)
)
def _worker_upload(dask_worker, *, data, fname):
dask_worker.loop.add_callback(
callback=dask_worker.upload_file,
comm=None, # not used
filename=fname,
data=data,
load=True)
This will also upload the file the first time the callback is registered so you can avoid calling client.upload_file entirely.

Why does concurrent.futures executor map throw error when using with futures.as_completed after all the futures are complete?

I am trying to send HTTP requests concurrently. In order to do so, I am using concurrent.futures
Here is simple code:
import requests
from concurrent import futures
data = range(10)
def send_request(item):
requests.get("https://httpbin.org/ip")
print("Request {} complete.".format(item))
executor = futures.ThreadPoolExecutor(max_workers=25)
futures_ = executor.map(send_request, data)
for f in futures.as_completed(futures_):
f.result()
If I run it, I can see requests are sent asynchronously, which is exactly what I want to do. However, when all the requests are complete, I get following error:
Request 0 complete.
Request 6 complete.
...
Request 7 complete.
Request 9 complete.
Request 3 complete.
Traceback (most recent call last):
File "send_thread.py", line 18, in <module>
for f in futures.as_completed(futures_):
File "/usr/local/Cellar/python3/3.6.4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 219, in as_completed
with _AcquireFutures(fs):
File "/usr/local/Cellar/python3/3.6.4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 146, in __enter__
future._condition.acquire()
AttributeError: 'NoneType' object has no attribute '_condition'
This is quite strange error. Here executor.map seems to be problematic. If I replace map with following line, it works as expected.
futures_ = [executor.submit(send_request, x) for x in data]
What am I missing? Tried to find difference between two, but can't seem to understand what could cause above issue. Any input would be highly appreciated.
Executor.map does not return you a list of futures but a generator of results, so instead of:
futures_ = executor.map(send_request, data)
for f in futures.as_completed(futures_):
f.result()
you should run:
results = executor.map(send_request, data)
for r in results:
print(r)

multiprocessing broken pipe after a long time

I develop a crawler using multiprocessing model.
which use multiprocessing.Queue to store url-infos which need to crawl , page contents which need to parse and something more;use multiprocessing.Event to control sub processes;use multiprocessing.Manager.dict to store hash of crawled url;each multiprocessing.Manager.dict instance use a multiprocessing.Lock to control access.
All the three type params are shared between all sub processes and parent process, and all the params are organized in a class, I use the instance of the class to transfer shared params from parent process to sub process. Just like:
MGR = SyncManager()
class Global_Params():
Queue_URL = multiprocessing.Queue()
URL_RESULY = MGR.dict()
URL_RESULY_Mutex = multiprocessing.Lock()
STOP_EVENT = multiprocessing.Event()
global_params = Global_Params()
In my own timeout mechanism, I use process.terminate to stop the process which can't stop by itself for a long time!
In my test case, there are 2500+ target sites(some are unservice, some are huge).
crawl site by site that in the target sites file.
At the begining the crawler could work well, but after a long time( sometime 8 hours, sometime 2 hours, sometime moer then 15 hours), the crawler has crawled moer than 100( which is indeterminate) sites, I'll get error info:"Errno 32 broken pipe"
I have tried the following methods to location and solve the problems:
location the site A which crawler broken on, then use crawler to crawls the site separately, the crawler worked well. Even I get a fragment(such as 20 sites) from all the target sites file which contain the site A, the crawler worked well!
add "-X /tmp/pymp-* 240 /tmp" to /etc/cron.daily/tmpwatch
when Broken occured the file /tmp/pymp-* is still there
use multiprocessing.managers.SyncManager replace multiprocessing.Manager and ignore most signal except SIGKILL and SIGTERM
for each target site, I clear most shared params(Queues,dicts and event),if error occured, create a new instance:
while global_params.Queue_url.qsize()>0:
try:
global_params.Queue_url.get(block=False)
except Exception,e:
print_info(str(e))
print_info("Clear Queue_url error!")
time.sleep(1)
global_params.Queue_url = Queue()
pass
the following is the Traceback info, the print_info function is defined to print and store debug info by myself:
[Errno 32] Broken pipe
Traceback (most recent call last):
File "Spider.py", line 613, in <module>
main(args)
File "Spider.py", line 565, in main
spider.start()
File "Spider.py", line 367, in start
print_info("STATIC_RESULT size:%d" % len(global_params.STATIC_RESULT))
File "<string>", line 2, in __len__
File "/usr/local/python2.7.3/lib/python2.7/multiprocessing/managers.py", line 769, in _callmethod
kind, result = conn.recv()
EOFError
I can't understand why, does anyone knows the reason?
I don't know if that is fixing your problem, but there is one point to mention:
global_params.Queue_url.get(block=False)
... throws an Queue.Empty expeption, if the Queue is empty. It's not worth to recreate the Queue for an empty exception.
The recreation of the queue can lead to race conditions.
From my point of view, you have to possibilities:
get rid of the "queue recreation" code block
switch to an other Queue implementation
use:
from Queue import Queue
instead of:
from multiprocessing import Queue

example urllib3 and threading in python

I am trying to use urllib3 in simple thread to fetch several wiki pages.
The script will
Create 1 connection for every thread (I don't understand why) and Hang forever.
Any tip, advice or simple example of urllib3 and threading
import threadpool
from urllib3 import connection_from_url
HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
def fetch(url, fiedls):
kwargs={'retries':6}
return HTTP_POOL.get_url(url, fields, **kwargs)
pool = threadpool.ThreadPool(5)
requests = threadpool.makeRequests(fetch, iterable)
[pool.putRequest(req) for req in requests]
#Lennart's script got this error:
http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
result = request.callable(*request.args, **request.kwds)
File "crawler.py", line 9, in fetch
print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
After adding import threadpool; import urllib3 and tpool = threadpool.ThreadPool(4) #user318904's code got this error:
Traceback (most recent call last):
File "crawler.py", line 21, in <module>
tpool.map_async(fetch, urls)
AttributeError: ThreadPool instance has no attribute 'map_async'
Here is my take, a more current solution using Python3 and concurrent.futures.ThreadPoolExecutor.
import urllib3
from concurrent.futures import ThreadPoolExecutor
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
]
def download(url, cmanager):
response = cmanager.request('GET', url)
if response and response.status == 200:
print("+++++++++ url: " + url)
print(response.data[:1024])
connection_mgr = urllib3.PoolManager(maxsize=5)
thread_pool = ThreadPoolExecutor(5)
for url in urls:
thread_pool.submit(download, url, connection_mgr)
Some remarks
My code is based on a similar example from the Python Cookbook by Beazley and Jones.
I particularly like the fact that you only need a standard module besides urllib3.
The setup is extremely simple, and if you are only going for side-effects in download (like printing, saving to a file, etc.), there is no additional effort in joining the threads.
If you want something different, ThreadPoolExecutor.submit actually returns whatever download would return, wrapped in a Future.
I found it helpful to align the number of threads in the thread pool with the number of HTTPConnection's in a connection pool (via maxsize). Otherwise you might encounter (harmless) warnings when all threads try to access the same server (as in the example).
Obviously it will create one connection per thread, how should else each thread be able to fetch a page? And you try to use the same connection, made from one url, for all urls. That can hardly be what you intended.
This code worked just fine:
import threadpool
from urllib3 import connection_from_url
def fetch(url):
kwargs={'retries':6}
conn = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
print url, conn.get_url(url)
print "Done!"
pool = threadpool.ThreadPool(4)
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
]
requests = threadpool.makeRequests(fetch, urls)
[pool.putRequest(req) for req in requests]
pool.wait()
Thread programming is hard, so I wrote workerpool to make exactly what you're doing easier.
More specifically, see the Mass Downloader example.
To do the same thing with urllib3, it looks something like this:
import urllib3
import workerpool
pool = urllib3.connection_from_url("foo", maxsize=3)
def download(url):
r = pool.get_url(url)
# TODO: Do something with r.data
print "Downloaded %s" % url
# Initialize a pool, 5 threads in this case
pool = workerpool.WorkerPool(size=5)
# The ``download`` method will be called with a line from the second
# parameter for each job.
pool.map(download, open("urls.txt").readlines())
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
For more sophisticated code, have a look at workerpool.EquippedWorker (and the tests here for example usage). You can make the pool be the toolbox you pass in.
I use something like this:
#excluding setup for threadpool etc
upool = urllib3.HTTPConnectionPool('en.wikipedia.org', block=True)
urls = ['/wiki/2010-11_Premier_League',
'/wiki/List_of_MythBusters_episodes',
'/wiki/List_of_Top_Gear_episodes',
'/wiki/List_of_Unicode_characters',
]
def fetch(path):
# add error checking
return pool.get_url(path).data
tpool = ThreadPool()
tpool.map_async(fetch, urls)
# either wait on the result object or give map_async a callback function for the results

Categories