I have some concurrent code using gevent:
gevent.monkey.patch_all()
....
jobs = [gevent.spawn(self.generate_resource_cache, idx, resource, id_fields_map[resource])
for idx, resource in enumerate(available_resources)]
gevent.joinall(jobs)
My generate_resource_cache method is using requests library which means I have to mankeypatch gevent.
With requests, I'm using custom session:
def get_session(self):
s = Settings.Instance()
if not self.session:
retry = Retry(total=s.TOTAL_RETRIES, backoff_factor=s.BACKOFF_FACTOR,
status_forcelist=(range(400, 421) + range(500, 505)))
size = s.CONCURRENT_SIZE
adapter = requests.adapters.HTTPAdapter(pool_connections=size, pool_maxsize=size, pool_block=True, max_retries=retry)
self.session = requests_cache.CachedSession(s.CACHE_NAME, backend='sqlite',
fast_save=s.FAST_SAVE, allowable_methods=('GET', 'POST')) if s.CACHING else requests.Session()
if s.PROXIES:
self.session.proxies = s.PROXIES
self.session.headers.update({'X_HTTP_METHOD_OVERRIDE' : 'get'})
self.session.mount('http://', adapter)
self.session.mount('https://', adapter)
return self.session
After few hours of running this script it prints to stdout:
WARNING:root:epoll module not found; using select()
And to stderr:
Exception caught: This operation would block forever
I'm wondering what can I do to better understand why I'm getting this exception in the first place and how can I prevent it from happening?
Related
I'm trying to learn how I can use timeout within a session while sending requests. The way I've tried below can fetch the content of a webpage but I'm not sure this is the right way as I could not find the usage of timeout in this documentation.
import requests
link = "https://stackoverflow.com/questions/tagged/web-scraping"
with requests.Session() as s:
r = s.get(link,timeout=5)
print(r.text)
How can I use timeout within session?
According to the Documentation - Quick Start.
You can tell Requests to stop waiting for a response after a given
number of seconds with the timeout parameter. Nearly all production code should use this parameter in nearly all requests.
requests.get('https://github.com/', timeout=0.001)
Or from the Documentation Advanced Usage you can set 2 values (connect and read timeout)
The timeout value will be applied to both the connect and the read
timeouts. Specify a tuple if you would like to set the values
separately:
r = requests.get('https://github.com', timeout=(3.05, 27))
Making Session Wide Timeout
Searched throughout the documentation and it seams it is not possible to set timeout parameter session wide.
But there is a GitHub Issue Opened (Consider making Timeout option required or have a default) which provides a workaround as an HTTPAdapter you can use like this:
import requests
from requests.adapters import HTTPAdapter
class TimeoutHTTPAdapter(HTTPAdapter):
def __init__(self, *args, **kwargs):
if "timeout" in kwargs:
self.timeout = kwargs["timeout"]
del kwargs["timeout"]
super().__init__(*args, **kwargs)
def send(self, request, **kwargs):
timeout = kwargs.get("timeout")
if timeout is None and hasattr(self, 'timeout'):
kwargs["timeout"] = self.timeout
return super().send(request, **kwargs)
And mount on a requests.Session()
s = requests.Session()
s.mount('http://', TimeoutHTTPAdapter(timeout=5)) # 5 seconds
s.mount('https://', TimeoutHTTPAdapter(timeout=5))
...
r = s.get(link)
print(r.text)
or similarly you can use the proposed EnhancedSession by #GordonAitchJay
with EnhancedSession(5) as s: # 5 seconds
r = s.get(link)
print(r.text)
I'm not sure this is the right way as I could not find the usage of timeout in this documentation.
Scroll to the bottom. It's definitely there. You can search for it in the page by pressing Ctrl+F and entering timeout.
You're using timeout correctly in your code example.
You can actually specify the timeout in a few different ways, as explained in the documentation:
If you specify a single value for the timeout, like this:
r = requests.get('https://github.com', timeout=5)
The timeout value will be applied to both the connect and the read timeouts. Specify a tuple if you would like to set the values separately:
r = requests.get('https://github.com', timeout=(3.05, 27))
If the remote server is very slow, you can tell Requests to wait forever for a response, by passing None as a timeout value and then retrieving a cup of coffee.
r = requests.get('https://github.com', timeout=None)
Try using https://httpstat.us/200?sleep=5000 to test your code.
For example, this raises an exception because 0.2 seconds is not long enough to establish a connection with the server:
import requests
link = "https://httpstat.us/200?sleep=5000"
with requests.Session() as s:
try:
r = s.get(link, timeout=(0.2, 10))
print(r.text)
except requests.exceptions.Timeout as e:
print(e)
Output:
HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=0.2)
This raises an exception because the server waits for 5 seconds before sending the response, which is longer than the 2 second read timeout set:
import requests
link = "https://httpstat.us/200?sleep=5000"
with requests.Session() as s:
try:
r = s.get(link, timeout=(3.05, 2))
print(r.text)
except requests.exceptions.Timeout as e:
print(e)
Output:
HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=2)
You specifically mention using a timeout within a session. So maybe you want a session object which has a default timeout. Something like this:
import requests
link = "https://httpstat.us/200?sleep=5000"
class EnhancedSession(requests.Session):
def __init__(self, timeout=(3.05, 4)):
self.timeout = timeout
return super().__init__()
def request(self, method, url, **kwargs):
print("EnhancedSession request")
if "timeout" not in kwargs:
kwargs["timeout"] = self.timeout
return super().request(method, url, **kwargs)
session = EnhancedSession()
try:
response = session.get(link)
print(response)
except requests.exceptions.Timeout as e:
print(e)
try:
response = session.get(link, timeout=1)
print(response)
except requests.exceptions.Timeout as e:
print(e)
try:
response = session.get(link, timeout=10)
print(response)
except requests.exceptions.Timeout as e:
print(e)
Output:
EnhancedSession request
HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=4)
EnhancedSession request
HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=1)
EnhancedSession request
<Response [200]>
I want to use have fault-tolerance for requests.get(url) with a maximum number of 3 retries.
Currently, I create a new session and pass that around between methods, which I would like to avoid:
with requests.Session() as rs:
rs.mount('https://', HTTPAdapter(max_retries=3))
rs.get(url)
...
Is there any way to configure the requests.get(url) call, such that it retries requests to the server in case it fails?
This might be a dirty implementation:
import requests
from time import sleep
While True:
try:
#requests here
break
except:
sleep(1)
But setting a retry of 3:
import requests
from time import sleep
for i in range(3):
try:
#requests here
break
except:
sleep(1)
A good practice:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
def requests_retry_session(
retries=3,
backoff_factor=0.3,
status_forcelist=(500, 502, 504),
session=None,
):
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
response = requests_retry_session().get('https://www.example.com/')
print(response.status_code)
s = requests.Session()
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})
response = requests_retry_session(session=s).get(
'https://www.example.com'
)
Each retry attempt will create a new Retry object with updated values, so they can be safely reused.
Documentation is Here.
Note that:
Patching the HTTPAdapter.init() defaults helps you (very much not
recommended).
You can override HTTPAdapter.__init__ with one that sets a default value for the max_retries argument with functools.partialmethod, so that the desired retry behavior is applied globally:
from functools import partialmethod
requests.adapters.HTTPAdapter.__init__ = partialmethod(
requests.adapters.HTTPAdapter.__init__, max_retries=3)
...
requests.get(url)
I have a tornado server listening to port 6789 for POST requests on "/train" and "/predict". train method might take upto 3 hrs to complete, while predict might return in 2 minutes. I want them to be handled concurrently. So even when "/train" is running, if a POST request for "/predict" arrives, it can handle that concurrently and return its output without waiting for "/train" to complete.
I have tried using ThreadPool but it still doesn't run concurrently.
My present code is as follows. It functions but if request to train is made and then request to predict is made. It waits for train to complete before handling predict. Assume train and predict functions are present and don't take any parameters.
import logging
import time
import threading
from multiprocessing.pool import ThreadPool
import flask
from tornado import wsgi, httpserver, ioloop
from flask import Flask
from train_script import train
from predict_script import predict
app = Flask(__name__)
#app.route("/train", methods=['POST'])
def train_run():
payload = flask.request.get_json(silent=True)
if payload is not None:
try:
async_result = pool.apply_async(train)
response = async_result.get()
resp = flask.jsonify(response)
resp.status_code = 200
except Exception as ex:
resp = flask.jsonify({"status": "Failure"})
resp.status_code = 500
else:
resp = flask.jsonify({"status": "Failure"})
resp.status_code = 500
return resp
#app.route("/predict", methods=['POST'])
def predict_run():
payload = flask.request.get_json(silent=True)
if payload is not None:
try:
async_result = pool.apply_async(predict)
response = async_result.get()
resp = flask.jsonify(response)
resp.status_code = 200
except Exception as ex:
resp = flask.jsonify({"status": "Failure"})
resp.status_code = 500
else:
resp = flask.jsonify({"status": "Failure"})
resp.status_code = 500
return resp
if __name__ == "__main__":
port = 6789
http_server = httpserver.HTTPServer(wsgi.WSGIContainer(app))
pool = ThreadPool(processes=10)# Expects max concurrent requests to be 10
http_server.listen(port)
logging.info("Tornado server starting on port {}".format(port))
ioloop.IOLoop.instance().start()
Tornado's WSGIContainer does not support any kind of concurrency. Either use Tornado's RequestHandler interfaces without Flask or WSGI, or use Flask with gunicorn or uwsgi. You gain almost nothing and lose a lot by combining Tornado with WSGI frameworks, so this is only useful in certain specialized situations.
I'm using multiprocessing library (not new in python but new in multiprocessing). It seems that I lack of understanding how it works.
What I try to do: I send a lot of http requests to server and if I receive connection error it means that remote service is down and I restart it using paramiko and then resend a request. I use multiprocessing to load all available processors because there are about 70000 requests and it takes about 24 hours to process them all using one processor.
My code:
# Send request here
def send_request(server, url, data, timeout):
try:
return requests.post(server + url, json=data, timeout=(timeout or 60))
except Exception:
return None
# Try to get json from response
def do_gw_requests(request, data):
timeout = 0
response = send_request(server, request, data, timeout)
if response is not None:
response_json = json.loads(response.text)
else:
response_json = None
return response_json
# Function that recall itself if service is down
def safe_build(data):
exception_message = ""
response = {}
try:
response = do_gw_requests("/rgw_find_route", data)
if response is None:
# Function that uses paramiko to start service
# It will not end until service is up
start_service()
while response is None:
safe_build(data)
--some other work here--
return response, exception_message
# Multiprocessing lines in main function
pool = Pool(2)
# build_single_route prepares data, calls safe_build once and write logs
result = pool.map_async(build_single_route, args)
pool.close()
pool.join()
My problem is if service already down at the start of script (and potentially if service got down in the middle of script's work) I can't get non-empty response for two first requests. Script starts, send two first requests (I send them in loop by two), finds out that service is down (response become None), restarts service, resends requests and seems gets None again and again and again (in endless loop). If I remove loop while response is None: then first two requests will process as if they was None and other requests will process as expected. But I need every request result that's why I resend bad requests.
So it recall function with same data again and again but without success. It's very strange as for me. Can anyone please explain what am I doing wrong here?
It seems that problem not with behavior of Pool workers as I expected. response is a local variable of function and thus it become not None after reviving of service at the second call of safe_build, it's still None in the first call. response, _ = safe_build(data) seems work.
The code below is an HTTP proxy for content filtering. It uses GET to send the URL of the current site to the server, where it processes it and responds. It runs VERY, VERY, VERY slow. Any ideas on how to make it faster?
Here is the code:
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest
from Tkinter import *
#import win32api
import urllib2
import urllib
import os
import webbrowser
cwd = os.path.abspath(sys.argv[0])[0]
proxies = {}
user = "zachb"
class BlockingProxyRequest(ProxyRequest):
def process(self):
params = {}
params['Location']= self.uri
params['User'] = user
params = urllib.urlencode(params)
req = urllib.urlopen("http://weblock.zbrowntechnology.info/ProgFiles/stats.php?%s" % params, proxies=proxies)
resp = req.read()
req.close()
if resp == "allow":
pass
else:
self.transport.write('''BLOCKED BY ADMIN!''')
self.transport.loseConnection()
ProxyRequest.process(self)
class BlockingProxy(Proxy):
requestFactory = BlockingProxyRequest
factory = http.HTTPFactory()
factory.protocol = BlockingProxy
reactor.listenTCP(8000, factory)
reactor.run()
Anyone have any ideas on how to make this run faster? Or even a better way to write it?
The main cause of slowness in this proxy is probably these three lines:
req = urllib.urlopen("http://weblock.zbrowntechnology.info/ProgFiles/stats.php?%s" % params, proxies=proxies)
resp = req.read()
req.close()
A normal Twisted-based application is single threaded. You have to go out of your way to get threads involved. That means that whenever a request comes in, you are blocking the one and only processing thread on this HTTP request. No further requests are processed until this HTTP request completes.
Try using one of the APIs in twisted.web.client, (eg Agent or getPage). These APIs don't block, so your server will handle concurrent requests concurrently. This should translate into much smaller response times.