Cleanly setting max_retries on Python requests get or post method - python

Related to older version of requests question: Can I set max_retries for requests.request?
I have not seen an example to cleanly incorporate max_retries in a requests.get() or requests.post() call.
Would love a
requests.get(url, max_retries=num_max_retries))
implementation

A quick search of the python-requests docs will reveal exactly how to set max_retries when using a Session.
To pull the code directly from the documentation:
import requests
s = requests.Session()
a = requests.adapters.HTTPAdapter(max_retries=3)
b = requests.adapters.HTTPAdapter(max_retries=3)
s.mount('http://', a)
s.mount('https://', b)
s.get(url)
What you're looking for, however, is not configurable for several reasons:
Requests no longer provides a means for configuration
The number of retries is specific to the adapter being used, not to the session or the particular request.
If one request needs one particular maximum number of requests, that should be sufficient for a different request.
This change was introduced in requests 1.0 over a year ago. We kept it for 2.0 purposefully because it makes the most sense. We also will not be introducing a parameter to configure the maximum number of retries or anything else, in case you were thinking of asking.
Edit Using a similar method you can achieve a much finer control over how retries work. You can read this to get a good feel for it. In short, you'll need to import the Retry class from urllib3 (see below) and tell it how to behave. We pass that on to urllib3 and you will have a better set of options to deal with retries.
from requests.packages.urllib3 import Retry
import requests
# Create a session
s = requests.Session()
# Define your retries for http and https urls
http_retries = Retry(...)
https_retries = Retry(...)
# Create adapters with the retry logic for each
http = requests.adapters.HTTPAdapter(max_retries=http_retries)
https = requests.adapters.HTTPAdapter(max_retries=https_retries)
# Replace the session's original adapters
s.mount('http://', http)
s.mount('https://', https)
# Start using the session
s.get(url)

Related

History of retries using Request Library

I'm building a new retry feature in my Orchestrate script and I want to know how many times, and, if possible, what error my request method got when trying to connect to a specific URL.
For now, I need this for logging purposes, because I'm working on a messaging system and I may need this 'retry' information to understand when and why I'm facing any kind of problem in HTTP requests, once I work in a micro-service environment.
So far, I debugged and certify that retries are working as expected (I have a mocked flask server for all micro services that we use), but I couldn't find a way to got the 'retries history' data.
In other words, for example, I want to see if a specific micro-service may respond only after the third request, and those kind of thing.
Below is the code that I'm using now:
from requests import exceptions, Session
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
def open_request_session():
# Default retries configs
toggle = True #loaded from config file
session = Session()
if toggle:
# loaded from config file as
parameters = {'total': 5,
'backoff_factor': 0.0,
'http_error_to_retry': [400]}well
retries = Retry(total=parameters['total'],
backoff_factor=parameters['backoff_factor'],
status_forcelist=parameters['http_error_to_retry'],
# Do not force an exception when False
raise_on_status=False)
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))
return session
# Request
with open_request_session() as request:
my_response = request.get(url, timeout=10)
I see in the urllib3 documentation that Retry has a history attribute, but when I try to consult the attribute it is empty.
I don't know if I'm doing wrong or forgetting something, once Software Development is not my best skill.
So, I have two questions:
Does anyone know a way to got this history information?
How can I do create tests to verify if the retries behavior is working as expected? (So far I only test in debug mode)
I'm using Python 3.6.8.
I know that I can create a while statement to 'control' this, but I'm trying to avoid complexity. And this is why I'm here, I'm looking for an alternative based on Python and community best practices.
A bit late, but I just figured this out myself so thought I'd share what I found.
Short answer:
response.raw.retries.history
will get you what you are looking for.
Long answer:
You cannot get the history off the original Retry instance created. Under the covers, urllib3 creates a new Retry instance for every attempt.
urllib3 does store the last Retry instance on the response when one is returned. However, the response from the requests library is a wrapper around the urllib3 response. Luckily, requests stores the original urllib3 response on the raw field.

Get HTTP request message from Request object in Scrapy

I need to somehow extract plain HTTP request message from a Request object in Scrapy (so that I could, for example, copy/paste this request and run from Burp).
So given a scrapy.http.Request object, I would like to get the corresponding request message, such as e.g.
POST /test/demo_form.php HTTP/1.1
Host: w3schools.com
name1=value1&name2=value2
Clearly I have all the information I need in the Request object, however trying to reconstruct the message manually is error-prone as I could miss some edge cases. My understanding is that Scrapy first converts this Request into Twisted object, which then writes headers and body into a TCP transport. So maybe there's away to do something similar, but write to a string instead?
UPDATE
I could use the following code to get HTTP 1.0 request message, which is based on http.py. Is there a way to do something similar with HTTP 1.1 requests / http11.py, which is what's actually being sent? I would obviously like to avoid duplicating code from Scrapy/Twisted frameworks as much as possible.
factory = webclient.ScrapyHTTPClientFactory(request)
transport = StringTransport()
protocol = webclient.ScrapyHTTPPageGetter()
protocol.factory = factory protocol.makeConnection(transport)
request_message = transport.value()
print(request_message.decode("utf-8"))
As scrapy is open source and also has plenty of extension points, this should be doable.
The requests are finally assembled and sent out in scrapy/core/downloader/handlers/http11.py in ScrapyAgent.download_request ( https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py#L270 )
If you place your hook there you can dump the request type, request headers, and request body.
To place your code there you can either try monkey patching ScrapyAgent.download_request or to subclass ScrapyAgent to do the request logging, then subclass HTTP11DownloadHandler to use your Scrapy Agent and then set HTTP11DownloadHandler as new DOWNLOAD_HANDLER for http / https requests in your project's settings.py (for details see: https://doc.scrapy.org/en/latest/topics/settings.html#download-handlers)
In my opinion this is the closest you can get to logging the requests going out without using a packet sniffer or a logging proxy (which might be a bit overkill for your scenario).

Python - Using nonces with multithreading

I am using python 2 with requests. This question is more of a curiosity of how I can improve this performance.
The issue now is that I must send a cryptographic signature in the header of the request to a HTTPS server. This signature includes a "nonce" which must be a timestamp, and ALWAYS must increase (on the server side).
Obviously this can wreak havoc on running multiple HTTP sessions on multiple threads. Requests ended up sent out not in order because they get interrupted between generating the headers and sending the HTTPS POST request.
The solution is to lock the thread from before creating the signature till the end of recieving HTTPS data. Ideally, I would like to release the LOCK after the HTTP request was SENT, and not have to wait for the data to be recieved. Is there any way I can release the lock, using requests, after just the HTTP headers are SENT? See code sample:
self.lock is a Threading.Lock. This instance of this class (self) is shared amongst multiple Threads.
def get_nonce(self):
return int(1000*time.time())
def do_post_request(self, endpoint, parameters):
with self.lock:
url = self.base + endpoint
urlpath = endpoint
parameters['nonce'] = self.get_nonce()
postdata = urllib.urlencode(parameters)
message = urlpath + hashlib.sha256(str(parameters['nonce']) + postdata).digest()
signature = hmac.new(base64.b64decode(self.secret_key), message, hashlib.sha512)
headers = {
'API-Key': self.api_key,
'API-Sign': base64.b64encode(signature.digest())
}
data = urllib.urlencode(parameters)
response = requests.post(url, data=data, headers=headers, verify=True).json()
return response
It sounds like the requests library doesn't have any support for sending asynchronously.
With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.
If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python’s asynchronicity frameworks. Two excellent examples are grequests and requests-futures.
I saw in a comment that you hesitate to add more dependencies, so the only suggestions I have are:
Add retry logic when your nonce is rejected. This seems like the most pythonic solution, and should work fine as long as the nonce isn't rejected very often.
Throttle the nonce generator. Hold the timestamp used for the previous nonce, and sleep if it hasn't been long enough when the next nonce is requested.
Batch the messages. If the protocol allows it, you may find that throughput actually goes up when you add a delay to wait for other messages and send them as a batch.
Change the server so the nonce values don't have to increase. If you control the server, making the messages independent of each other will give you a much more flexible protocol.
Use a session pool. I'm guessing that the nonce values only have to increase within a single session. If you create a thread pool and have each thread open its own session, you could still have reasonable throughput without the timing problems you currently have.
Obviously, you'd have to measure the performance results of making these changes.
Even if you do decide to add a dependency that lets you release the lock after sending the headers, you may still find that you occasionally have timing issues. The message packets with the headers could be delayed on their way to the server.

Create url without request execution

I am currently using the python requests package to make JSON requests. Unfortunately, the service which I need to query has a daily maximum request limit. Right know, I cache the executed request urls, so in case I come go beyond this limit, I know where to continue the next day.
r = requests.get('http://someurl.com', params=request_parameters)
log.append(r.url)
However, for using this log the the next day I need to create the request urls in my program before actually doing the requests so I can match them against the strings in the log. Otherwise, it would decrease my daily limit. Does anybody of you have an idea how to do this? I didn't find any appropriate method in the request package.
You can use PreparedRequests.
To build the URL, you can build your own Request object and prepare it:
from requests import Session, Request
s = Session()
p = Request('GET', 'http://someurl.com', params=request_parameters).prepare()
log.append(p.url)
Later, when you're ready to send, you can just do this:
r = s.send(p)
The relevant section of the documentation is here.

how to use two level proxy setting in Python?

I am working on web-crawler [using python].
Situation is, for example, I am behind server-1 and I use proxy setting to connect to the Outside world. So in Python, using proxy-handler I can fetch the urls.
Now thing is, I am building a crawler so I cannot use only one IP [otherwise I will be blocked]. To solve this, I have bunch of Proxies, I want to shuffle through.
My question is: This is two level proxy, one to connect to main server-1, I use proxy and then after to shuffle through proxies, I want to use proxy. How can I achieve this?
Update Sounds like you're looking to connect to proxy A and from there initiate HTTP connections via proxies B, C, D which are outside of A. You might look into the proxychains project which says it can "tunnel any protocol via a user-defined chain of TOR, SOCKS 4/5, and HTTP proxies".
Version 3.1 is available as a package in Ubuntu Lucid. If it doesn't work directly for you, the proxychains source code may provide some insight into how this capability could be implemented for your app.
Orig answer:
Check out the urllib2.ProxyHandler. Here is an example of how you can use several different proxies to open urls:
import random
import urllib2
# put the urls for all of your proxies in a list
proxies = ['http://localhost:8080/']
# construct your list of url openers which each use a different proxy
openers = []
for proxy in proxies:
opener = urllib2.build_opener(urllib2.ProxyHandler({'http': proxy}))
openers.append(opener)
# select a url opener randomly, round-robin, or with some other scheme
opener = random.choice(openers)
req = urllib2.Request(url)
res = opener.open(req)
I recommend you take a look at CherryProxy. It lets you send a proxy request to an intermediate server (where CherryProxy is running) and then forward your HTTP request to a proxy on a second level machine (e.g. squid proxy on another server) for processing. Viola! A two-level proxy chain.
http://www.decalage.info/python/cherryproxy

Categories