Set DNS timeout for HTTP requests using requests library - python

I have a function that is meant to check if a specific HTTP(S) URL is a redirect and if so return the new location (but not recursively). It uses the requests library. It looks like this:
try:
response = http_session.head(sent_url, timeout=(1, 1))
if response.is_redirect:
return response.headers["location"]
return sent_url
except requests.exceptions.Timeout:
return sent_url
Here, the URL I am checking is sent_url. For reference, this is how I create the session:
http_session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(max_retries=0)
http_session.mount("http://", http_adapter)
http_session.mount("https://", http_adapter)
However, one of the requirements of this program is that this must work for dead links. Based off of this, I set a connection timeout (and read timeout for good measures). After playing around with the values, it still takes about 5-10 seconds for the request to fail with this stacktrace no matter what value I choose. (Maybe relevant: in the browser, it gives DNS_PROBE_POSSIBLE.)
Now, my problem is: 5-10 seconds is too long to wait for if a link is dead. There are many links that this program needs to check, and I do not want a few dead links to be such a large bottleneck, hence I want to configure this DNS lookup timeout.
I found this post which seems to be relevant (OP wants to increase the timeout, I want to decrease it) however the solution does not seem applicable. I do not know the IP addresses that these URLs point to. In addition, this feature request from years ago seems relevant, but it did not help me further.
So far, the best solution to me seems to just spin up a coroutine for each link/a batch of links and then suck up the timeout asynchronously.
I am on Windows 10, however this code will be deployed on an Ubuntu server. Both use Python 3.8.
So, how can I best give my HTTP requests a very low DNS resolution timeout in the case that it is being fed a dead link?

So, how can I best give my HTTP requests a very low DNS resolution timeout in the case that it is being fed a dead link?
Separate things.
Use urllib.parse to extract the hostname from the URL, and then use dnspython to resolve that name, with whatever timeout you want.
Then, and only if the resolution was correct, fire up requests to grab the HTTP data.
#blurfus: in requests you can only use the timeout parameter in the HTTP call, you can't attach it to a session. It is not spelled out explicitly in the documentation, but the code is quite clear on that.
There are many links that this program needs to check,
That is a completely separate problem in fact, and exists even if all links are ok, it is just a problem of volume.
The typical solutions fell in two cases:
use asynchronous libraries (they exist for both DNS and HTTP), where your calls are not blocking, you get the data later, so you are able to do something else
use multiprocessing or multithreading to parallelize things and have multiple URLs being tested at the same time by separate instances of your code.
They are not completely mutually exclusive, you can find a lot of pros and cons for each, asynchronous codes may be more complicated to write and understand later, so multiprocessing/multithreading is often the first step for a "quick win" (especially if you do not need to share anything between the processes/threads, otherwise it becomes quickly a problme), yet asynchronous handling of everything makes the code scales more nicely with the volume.

Related

Is it a bad practice to use sleep() in a web server in production?

I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.
Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)
Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.

How do I customize this twisted code?

I am new to python, and even newer to twisted. I am trying to use twisted to download a few hundred thousand files but am having trouble trying to add an errback. I'd like to print the bad url if the download fails. I've misspelled one of my urls on purpose in order to throw an error. However, the code I have just hangs and python doesn't finish (it finishes fine if I remove the errback call).
Also, how to I process each file individually? From my understanding, "finish" is called when everything completes. I'd like to gzip each file when it's downloaded so that it's removed from memory.
Here's what I have:
urls = [
'http://www.python.org',
'http://stackfsdfsdfdsoverflow.com', # misspelled on purpose to generate an error
'http://www.twistedmatrix.com',
'http://www.google.com',
'http://launchpad.net',
'http://github.com',
'http://bitbucket.org',
]
def finish(results):
for result in results:
print 'GOT PAGE', len(result), 'bytes'
reactor.stop()
def print_badurls(err):
print err # how do I just print the bad url????????
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish).addErrback(print_badurls)
reactor.run()
Welcome to Python and Twisted!
There are a few problems with the code you pasted. I'll go through them one at a time.
First, if you do want to download thousands of urls, and will have thousands of items in the urls list, then this line:
waiting = [client.getPage(url) for url in urls]
is going to cause problems. Do you want to try to download every page in the list simultaneously? By default, in general, things you do in Twisted happen concurrently, so this loop starts downloading every URL in the urls list at once. Most likely, this isn't going to work. Your DNS server is going to drop some of the domain lookup requests, your DNS client is going to drop some of the domain lookup responses. The TCP connection attempts to whatever addresses you do get back will compete for whatever network resources are still available, and some of them will time out. The rest of the connections will all trickle along, sharing available bandwidth between dozens or perhaps hundreds of different downloads.
Instead, you probably want to limit the degree of concurrency to perhaps 10 or 20 downloads at a time. I wrote about one approach to this on my blog a while back.
Second, gatherResults returns a Deferred that fires as soon as any one Deferred passed to it fires with a failure. So as soon as any one client.getPage(url) fails - perhaps because of one of the problems I mentioned above, or perhaps because the domain has expired, or the web server happens to be down, or just because of an unfortunate transient network condition, the Deferred returned by gatherResults will fail. finish will be skipped and print_badurls will be called with the error describing the single failed getPage call.
To handle failures from individual HTTP requests, add the callbacks and errbacks to the Deferreds returned from the getPage calls. After adding those callbacks and errbacks, you can use defer.gatherResults to wait for all of the downloads and processing of the download results to be complete.
Third, you might want to consider using a higher-level tool for this - scrapy is a web crawling framework (based on Twisted) that provides lots of cool useful helpers for this kind of application.

Speeding up socket send behavior (in Python)

I have a script which sends 5-10 requests a second to the server. The most crucial requirement i have is a limit of requests per second. It must always be the specific figure, not more and no less. To do it i send requests after a given interval of time (minus time required to send previous request).
Problem: some requests are sent fast enough however others take too much time at sock.sendall() step. I believe this is because send buffer is full and execution is blocked until buffer is cleared.
What can i do to flush that buffer quicker?
One of the options i tried is to disable Nagle:
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
but it didn't seem to improve things.
Another option which even sounds too wrong to try it is to set send buffer to the length of the request before each sendall() call.
Is there anything i can do to get more predictable requests per second?
One more option i just thought about: have several processes which will do a small amount of requests per second each, hopefully it will make results more predictable.
OS in question is Centos.
Update: It seems that my error in setting socket options after connect. Looks like size buffer can only be set prior connect() call. Same with TCP_NODELAY. Haven't yet had time to test if it makes any difference.
The most crucial requirement i have is a limit of requests per second.
It must always be the specific figure, not more and no less.
That requirement is completely unimplementable via TCP. You would also need real-time guarantees of service times at the peer.
(From How can I force a socket to send the data in its buffer?)
You can't force it. Period. TCP makes up its own mind as to when it can send data. Now, normally when you call write() on a TCP socket, TCP will indeed send a segment, but there's no guarantee and no way to force this. There are lots of reasons why TCP will not send a segment: a closed window and the Nagle algorithm are two things to come immediately to mind.
Read the full post, it is quite in-depth and clarified some of the things for me eg when disabling Nagle algorithm makes sense and so on.

Better ways to handle AppEngine requests that time out?

Sometimes, with requests that do a lot, Google AppEngine returns an error. I have been handling this by some trickery: memcaching intermediate processed data and just requesting the page again. This often works because the memcached data does not have to be recalculated and the request finishes in time.
However... this hack requires seeing an error, going back, and clicking again. Obviously less than ideal.
Any suggestions?
inb4: "optimize your process better", "split your page into sub-processes", and "use taskqueue".
Thanks for any thoughts.
Edit - To clarify:
Long wait for requests is ok because the function is administrative. I'm basically looking to run a data-mining function. I'm searching over my datastore and modifying a bunch of objects. I think the correct answer is that AppEngine may not be the right tool for this. I should be exporting the data to a computer where I can run functions like this on my own. It seems AppEngine is really intended for serving with lighter processing demands. Maybe the quota/pricing model should offer the option to increase processing timeouts and charge extra.
If interactive user requests are hitting the 30 second deadline, you have bigger problems: your user has almost certainly given up and left anyway.
What you can do depends on what your code is doing. There's a lot to be optimized by batching datastore operations, or reducing them by changing how you model your data; you can offload work to the Task Queue; for URLFetches, you can execute them in parallel. Tell us more about what you're doing and we may be able to provide more concrete suggestions.
I have been handling something similar by building a custom automatic retry dispatcher on the client. Whenever an ajax call to the server fails, the client will retry it.
This works very well if your page is ajaxy. If your app spits entire HTML pages then you can use a two pass process: first send an empty page containing only an ajax request. Then, when AppEngine receives that ajax request, it outputs the same HTML you had before. If the ajax call succeeds it fills the DOM with the result. If it fails, it retries once.

Alternatives to ApacheBench for profiling my code speed

I've done some experiments using Apache Bench to profile my code response times, and it doesn't quite generate the right kind of data for me. I hope the good people here have ideas.
Specifically, I need a tool that
Does HTTP requests over the network (it doesn't need to do anything very fancy)
Records response times as accurately as possible (at least to a few milliseconds)
Writes the response time data to a file without further processing (or provides it to my code, if a library)
I know about ab -e, which prints data to a file. The problem is that this prints only the quantile data, which is useful, but not what I need. The ab -g option would work, except that it doesn't print sub-second data, meaning I don't have the resolution I need.
I wrote a few lines of Python to do it, but the httplib is horribly inefficient and so the results were useless. In general, I need better precision than pure Python is likely to provide. If anyone has suggestions for a library usable from Python, I'm all ears.
I need something that is high performance, repeatable, and reliable.
I know that half my responses are going to be along the lines of "internet latency makes that kind of detailed measurements meaningless." In my particular use case, this is not true. I need high resolution timing details. Something that actually used my HPET hardware would be awesome.
Throwing a bounty on here because of the low number of answers and views.
I have done this in two ways.
With "loadrunner" which is a wonderful but pretty expensive product (from I think HP these days).
With combination perl/php and the Curl package. I found the CURL api slightly easier to use from php. Its pretty easy to roll your own GET and PUT requests. I would also recommend manually running through some sample requests with Firefox and the LiveHttpHeaders add on to captute the exact format of the http requests you need.
JMeter is pretty handy. It has a GUI from which you can set up your requests and threadpools and it also can be run from the command line.
If you can code in Java, you can look at the combination of JUnitPerf + HttpUnit.
The downside is that you will have to do more things yourself. But at the price of this you will get unlimited flexibility and arguably more preciseness than with GUI tools, not to mention HTML parsing, JavaScript execution, etc.
There's also another project called Grinder which seems to be purposed for a similar task but I don't have any experience with it.
A good reference of opensource perfomance testing tools: http://www.opensourcetesting.org/performance.php
You will find descriptions and a "most popular" list
httperf is very powerful.
I've used a script to drive 10 boxes on the same switch to generate load by "replaying" requests to 1 server. I had my web app logging response time (server only) to the granularity I needed, but I didn't care about the response time to the client. I'm not sure you care to include the trip to and from the client in your calculations, but if you did it shouldn't be to difficult to code up. I then processed my log with a script which extracted the times per url and did scatter plot graphs, and trend graphs based on load.
This satisfied my requirements which were:
Real world distribution of calls to different urls.
Trending performance based on load.
Not influencing the web app by running other intensive ops on the same box.
I did controller as a shell script that foreach server started a process in the background to loop over all the urls in a file calling curl on each one. I wrote the log processor in Perl since I was doing more Perl at that time.

Categories