multiprocessing broken pipe after a long time - python

I develop a crawler using multiprocessing model.
which use multiprocessing.Queue to store url-infos which need to crawl , page contents which need to parse and something more;use multiprocessing.Event to control sub processes;use multiprocessing.Manager.dict to store hash of crawled url;each multiprocessing.Manager.dict instance use a multiprocessing.Lock to control access.
All the three type params are shared between all sub processes and parent process, and all the params are organized in a class, I use the instance of the class to transfer shared params from parent process to sub process. Just like:
MGR = SyncManager()
class Global_Params():
Queue_URL = multiprocessing.Queue()
URL_RESULY = MGR.dict()
URL_RESULY_Mutex = multiprocessing.Lock()
STOP_EVENT = multiprocessing.Event()
global_params = Global_Params()
In my own timeout mechanism, I use process.terminate to stop the process which can't stop by itself for a long time!
In my test case, there are 2500+ target sites(some are unservice, some are huge).
crawl site by site that in the target sites file.
At the begining the crawler could work well, but after a long time( sometime 8 hours, sometime 2 hours, sometime moer then 15 hours), the crawler has crawled moer than 100( which is indeterminate) sites, I'll get error info:"Errno 32 broken pipe"
I have tried the following methods to location and solve the problems:
location the site A which crawler broken on, then use crawler to crawls the site separately, the crawler worked well. Even I get a fragment(such as 20 sites) from all the target sites file which contain the site A, the crawler worked well!
add "-X /tmp/pymp-* 240 /tmp" to /etc/cron.daily/tmpwatch
when Broken occured the file /tmp/pymp-* is still there
use multiprocessing.managers.SyncManager replace multiprocessing.Manager and ignore most signal except SIGKILL and SIGTERM
for each target site, I clear most shared params(Queues,dicts and event),if error occured, create a new instance:
while global_params.Queue_url.qsize()>0:
try:
global_params.Queue_url.get(block=False)
except Exception,e:
print_info(str(e))
print_info("Clear Queue_url error!")
time.sleep(1)
global_params.Queue_url = Queue()
pass
the following is the Traceback info, the print_info function is defined to print and store debug info by myself:
[Errno 32] Broken pipe
Traceback (most recent call last):
File "Spider.py", line 613, in <module>
main(args)
File "Spider.py", line 565, in main
spider.start()
File "Spider.py", line 367, in start
print_info("STATIC_RESULT size:%d" % len(global_params.STATIC_RESULT))
File "<string>", line 2, in __len__
File "/usr/local/python2.7.3/lib/python2.7/multiprocessing/managers.py", line 769, in _callmethod
kind, result = conn.recv()
EOFError
I can't understand why, does anyone knows the reason?

I don't know if that is fixing your problem, but there is one point to mention:
global_params.Queue_url.get(block=False)
... throws an Queue.Empty expeption, if the Queue is empty. It's not worth to recreate the Queue for an empty exception.
The recreation of the queue can lead to race conditions.
From my point of view, you have to possibilities:
get rid of the "queue recreation" code block
switch to an other Queue implementation
use:
from Queue import Queue
instead of:
from multiprocessing import Queue

Related

Why do I get a value error when using multiprocessing on websocket streams validated with pydantic?

Why am I getting a missing value error when I try to run 2 websockets simultaneously using ProcessPoolExecutor. It's an issue I've been banging my head against for a week. I figured someone smarter than I may have an answer or insight readily available.
Courtesy of Alpaca's API I am trying to stream market data and trade updates simultaneously using the alpaca-py sdk. Both streams run on asyncio event loops via a websocket. The code to initialize each stream is similar. Here is what it looks like for live market data.
from alpaca.data.live.stock import StockDataStream
ticker = "SPY"
stock_data_stream = StockDataStream('API_KEY', 'API_SECRET')
# Data handler where each update will arrive
async def stock_data_handler(data_stream):
# Do something with data_stream here
pass
stock_data_stream.subscribe_bars(stock_data_handler, "SPY") # Subscribe to data type
stock_data_stream.run() # Starts asyncio event loop
The above stream runs in an infinite loop. I could not figure out a way to run another stream in parallel to receive live trade updates. I found that the market stream blocks the trade updates stream and vice versa.
I experimented a lot trying to run each stream on its own new event loop. I also tried to run/schedule each streaming data coroutine in its own thread using to_thread() and run_coroutine_threadsafe() to no success. This is when I decided to turn to a more parallel approach.
So, I wrapped each stream implementation in its own function and then created a concurrent future process for each function using the ProcessPoolExecutor. A trading update should fire every time an order is canceled, so I cooked up place_trade_loop() to provide an infinite loop and ran that as a process as well.
def place_trade_loop(order_request):
while True:
order_response = trading_client.submit_order(order_request)
time.sleep(2)
cancel_orders_response = trading_client.cancel_order_by_id(order_id = order_response.id)
def crypto_stream_wrapper():
async def stock_data_handler(data_stream):
print(data_stream)
stock_data_stream.subscribe_quotes(stock_data_handler, ticker)
stock_data_stream.run()
def trading_stream_wrapper():
async def trading_stream_handler(trade_stream):
print(trade_stream)
trading_stream.subscribe_trade_updates(trading_stream_handler)
trading_stream.run()
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
f1 = executor.submit(stock_stream_wrapper)
f2 = executor.submit(trading_stream_wrapper)
f3 = executor.submit(place_trade_loop)
The place trade loop and the market data stream play perfectly well together. However, the following error results when an order is canceled. Again, a canceled order should result in the trading_stream_handler receiving a trade_stream.
error during websocket communication: 1 validation error for TradeUpdate
execution_id
field required (type=value_error.missing)
Traceback (most recent call last):
File "C:\Users\zachm\AppData\Local\Programs\Python\Python310\lib\site-packages\alpaca\trading\stream.py", line 172, in _run_forever
await self._consume()
File "C:\Users\zachm\AppData\Local\Programs\Python\Python310\lib\site-packages\alpaca\trading\stream.py", line 145, in _consume
await self._dispatch(msg)
File "C:\Users\zachm\AppData\Local\Programs\Python\Python310\lib\site-packages\alpaca\trading\stream.py", line 89, in _dispatch
await self._trade_updates_handler(self._cast(msg))
File "C:\Users\zachm\AppData\Local\Programs\Python\Python310\lib\site-packages\alpaca\trading\stream.py", line 103, in _cast
result = TradeUpdate(**msg.get("data"))
File "pydantic\main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for TradeUpdate
execution_id
field required (type=value_error.missing)
For reference:
alpaca-py/alpaca/trading/stream.py - Line 172
alpaca-py/alpaca/trading/stream.py - Line 145
alpaca-py/alpaca/trading/stream.py - Line 89
alpaca-py/alpaca/trading/stream.py - Line 103
alpaca-py/alpaca/trading/models.py - Line 510
pydantic/pydantic/main.py - Line 341
execution_id is part of the TradeUpdate model as shown in link #5. I believe the error is arising when **msg.get("data") is passed to TradeUpdate (link #4). There could also be something going on with the Alpaca platform because I noticed many of my canceled orders are listed as 'pending_cancel,' which seemed to be an issue others are dealing with. Alpaca gave a response near the end of that thread.
Finally, the following is from the ProcessPoolExecutor documentation, which may have something to do with the error?
The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.
I am sorry for such a long post. Thank you in advance for any help or encouragement you can provide!

How to properly use dask's upload_file() to pass local code to workers

I have functions in a local_code.py file that I would like to pass to workers through dask. I've seen answers to questions on here saying that this can be done using the upload_file() function, but I can't seem to get it working because I'm still getting a ModuleNotFoundError.
The relevant part of the code is as follows.
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
from local_code import *
helper_file = '/absolute/path/to/local_code.py'
def main():
with SLURMCluster(**slurm_params) as cluster:
cluster.scale(n_workers)
with Client(cluster) as client:
client.upload_file(helper_file)
mapping = client.map(myfunc, data)
client.gather(mapping)
if __name__ == '__main__':
main()
Note, myfunc is imported from local_code, and there's no error importing it to map. The function myfunc also depends on other functions that are defined in local_code.
With this code, I'm still getting this error
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95+\x00\x00\x00\x00\x00\x00\x00\x8c\x11local_code\x94\x8c\x$
Traceback (most recent call last):
File "/home/gallagher.r/.local/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 61, in loads
return pickle.loads(x)
ModuleNotFoundError: No module named 'local_code'
Using upload_file() seems so straightforward that I'm not sure what I'm doing wrong. I must have it in the wrong place or not be understanding correctly what is passed to it.
I'd appreciate any help with this. Please let me know if you need any other information or if there's anything else I can supply from the error file.
The upload_file method only uploads the file to the currently available workers. If a worker arrives after you call upload_file then that worker won't have the provided file.
If your situation the easiest thing to do is probably to wait until all of the workers arrive before you call upload file
cluster.scale(n)
with Client(cluster) as client:
client.wait_for_workers(n)
client.upload_file(...)
Another option when you have workers going in/out is to use the Client.register_worker_callbacks to hook into whenever a new worker is registered/added. The one caveat is you will need to serialize your file(s) in the callback partial:
fname = ...
with open(fname, 'rb') as f:
data = f.read()
client.register_worker_callbacks(
setup=functools.partial(
_worker_upload, data=data, fname=fname,
)
)
def _worker_upload(dask_worker, *, data, fname):
dask_worker.loop.add_callback(
callback=dask_worker.upload_file,
comm=None, # not used
filename=fname,
data=data,
load=True)
This will also upload the file the first time the callback is registered so you can avoid calling client.upload_file entirely.

Handling IncompleteRead,URLError

it's a piece of web mining script.
def printer(q,missing):
while 1:
tmpurl=q.get()
try:
image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
missing.put(tmpurl)
continue
wf=open(tmpurl[-35:]+".jpg","wb")
wf.write(image)
wf.close()
q is a Queue() composed of Urls and `missing is an empty queue to gather error-raising-urls
it runs in parallel by 10 threads.
and everytime I run this, I got this.
File "C:\Python27\lib\socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "C:\Python27\lib\httplib.py", line 541, in read
return self._read_chunked(amt)
File "C:\Python27\lib\httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "C:\Python27\lib\httplib.py", line 649, in _safe_read
raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(5274 bytes read, 2918 more expected)
but I do use the except...
I tried something else like
httplib.IncompleteRead
urllib2.URLError
even,
image=urllib2.urlopen(tmpurl,timeout=999999).read()
but none of this is working..
how can I catch the IncompleteRead and URLError?
I think the correct answer to this question depends on what you consider an "error-raising URL".
Methods of catching multiple exceptions
If you think any URL which raises an exception should be added to the missing queue then you can do:
try:
image=urllib2.urlopen(tmpurl).read()
except (httplib.HTTPException, httplib.IncompleteRead, urllib2.URLError):
missing.put(tmpurl)
continue
This will catch any of those three exceptions and add that url to the missing queue. More simply you could do:
try:
image=urllib2.urlopen(tmpurl).read()
except:
missing.put(tmpurl)
continue
To catch any exception but this is not considered Pythonic and could hide other possible errors in your code.
If by "error-raising URL" you mean any URL that raises an httplib.HTTPException error but you'd still like to keep processing if the other errors are received then you can do:
try:
image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
missing.put(tmpurl)
continue
except (httplib.IncompleteRead, urllib2.URLError):
continue
This will only add the URL to the missing queue if it raises an httplib.HTTPException but will otherwise catch httplib.IncompleteRead and urllib.URLError and keep your script from crashing.
Iterating over a Queue
As an aside, while 1 loops are always a bit concerning to me. You should be able to loop through the Queue contents using the following pattern, though you're free to continue doing it your way:
for tmpurl in iter(q, "STOP"):
# rest of your code goes here
pass
Safely working with files
As another aside, unless it's absolutely necessary to do otherwise, you should use context managers to open and modify files. So your three file-operation lines would become:
with open(tmpurl[-35:]+".jpg","wb") as wf:
wf.write()
The context manager takes care of closing the file, and will do so even if an exception occurs while writing to the file.

Timed out after 30000ms

When I use SeleniumRC,sometimes I meet a error, but sometimes not. I guess it's related to the time of wait_for_page_to_load(), but I don't know how long will it need?
The error information:
Exception: Timed out after 30000ms
File "C:\Users\Herta\Desktop\test\newtest.py", line 9, in <module>
sel.open(url)
File "C:\Users\Herta\Desktop\test\selenium.py", line 764, in open
self.do_command("open", [url,])
File "C:\Users\Herta\Desktop\test\selenium.py", line 215, in do_command
raise Exception, data
This is my program:
from selenium import selenium
url = 'http://receptome.stanford.edu/hpmr/SearchDB/getGenePage.asp?Param=4502931&ProtId=1&ProtType=Receptor#'
sel = selenium('localhost', 4444, '*firefox', url)
sel.start()
sel.open(url)
sel.wait_for_page_to_load(1000)
f = sel.get_html_source()
sav = open('test.html','w')
sav.write(f)
sav.close()
sel.stop()
Timing is a big issue when automating UI pages. You want to make sure you use timeouts when needed and provide the needed time for certain events. I see that you have
sel.open(url)
sel.wait_for_page_to_load(1000)
The sel.wait_for_page_to_load command after a sel.open call is redundant. All sel.open commands have a built in wait. This may be the cause of your problem because selenium waits as a part of the built in process of the sel.open command. Then selenium is told to wait again for the page to load. Since no page is loaded. It throws an error.
However, this is unlikely since it is throwing the trace on the sel.open command. Wawa's response above may be your best bet.
The "Timed out after 30000ms" message is coming from the sel.open(url) call which uses the selenium default timeout. Try increasing this time using sel.set_timeout("timeout"). I would suggest 60 seconds as a good starting point, if 60 seconds doesn't work, try increasing the timeout. Also make sure that you can get to the page normally.
from selenium import selenium
url = 'http://receptome.stanford.edu/hpmr/SearchDB/getGenePage.asp?Param=4502931&ProtId=1&ProtType=Receptor#'
sel = selenium('localhost', 4444, '*firefox', url)
sel.set_timeout('60000')
sel.start()
sel.open(url)
sel.wait_for_page_to_load(1000)
f = sel.get_html_source()
sav = open('test.html','w')
sav.write(f)
sav.close()
sel.stop()
I had this problem and it was windows firewall blocking selenium server. Have you tried adding an exception to your firewall?

Threading in a django app

I'm trying to create what I call a scrobbler. The task is to read a Delicious user from a queue, fetch all of their bookmarks and put them in the bookmarks queue. Then something should go through that queue, do some parsing and then store the data in a database.
This obviously calls for threading because most of the time is spent waiting for Delicious to respond and then for bookmarked websites to respond and be passed through some API's and it would be silly for everything to wait for that.
However I am having trouble with threading and keep getting strange errors like database tables not being defined. Any help is appreciated :)
Here's the relevant code:
# relevant model #
class Bookmark(models.Model):
account = models.ForeignKey( Delicious )
url = models.CharField( max_length=4096 )
tags = models.TextField()
hash = models.CharField( max_length=32 )
meta = models.CharField( max_length=32 )
# bookmark queue reading #
def scrobble_bookmark(account):
try:
bookmark = Bookmark.objects.all()[0]
except Bookmark.DoesNotExist:
return False
bookmark.delete()
tags = bookmark.tags.split(' ')
user = bookmark.account.user
for concept in Concepts.extract( bookmark.url ):
for tag in tags:
Concepts.relate( user, concept['name'], tag )
return True
def scrobble_bookmarks(account):
semaphore = Semaphore(10)
for i in xrange(Bookmark.objects.count()):
thread = Bookmark_scrobble(account, semaphore)
thread.start()
class Bookmark_scrobble(Thread):
def __init__(self, account, semaphore):
Thread.__init__(self)
self.account = account
self.semaphore = semaphore
def run(self):
self.semaphore.acquire()
try:
scrobble_bookmark(self.account)
finally:
self.semaphore.release()
This is the error I get:
Exception in thread Thread-65:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 525, in __bootstrap_inner
self.run()
File "/home/swizec/Documents/trees/bookmarklet_server/../bookmarklet_server/Scrobbler/Scrobbler.py", line 60, in run
scrobble_bookmark(self.account)
File "/home/swizec/Documents/trees/bookmarklet_server/../bookmarklet_server/Scrobbler/Scrobbler.py", line 28, in scrobble_bookmark
bookmark = Bookmark.objects.all()[0]
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 152, in __getitem__
return list(qs)[0]
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 76, in __len__
self._result_cache.extend(list(self._iter))
File "/usr/local/lib/python2.6/dist-packages/django/db/models/query.py", line 231, in iterator
for row in self.query.results_iter():
File "/usr/local/lib/python2.6/dist-packages/django/db/models/sql/query.py", line 281, in results_iter
for rows in self.execute_sql(MULTI):
File "/usr/local/lib/python2.6/dist-packages/django/db/models/sql/query.py", line 2373, in execute_sql
cursor.execute(sql, params)
File "/usr/local/lib/python2.6/dist-packages/django/db/backends/sqlite3/base.py", line 193, in execute
return Database.Cursor.execute(self, query, params)
OperationalError: no such table: Scrobbler_bookmark
PS: all other tests depending on the same table pass with flying colours.
You can not use threading with in memory databases (sqlite3 in this case) in Django see this bug. This might work with PostgreSQL or MySQL.
I'd recommend something like celeryd instead of threads, message queues are MUCH easier to work with than threading.
This calls for a task queue, though not necessarily threads. You'll have your server process, one or several scrobbler processes, and a queue that lets them communicate. The queue can be in the database, or something separate like a beanstalkd. All this has nothing to do with your error, which sounds like your database is just misconfigured.
1) Does the error persist if you use a real database, not SQLite?
2) If you're using threads, you might need to create separate SQL cursors for use in threads.
i think the table really doesn't exists yew have to create it first by SQL commad
or any other way. As i have a small database for testing different modules i just delete
the database and recreate it using syncdb command

Categories