I have some Ring routes which I'm running one of two ways.
lein ring server, with the lein-ring plugin
using org.httpkit.server, like (hs/run-server app {:port 3000}))
It's a web app (being consumed by an Angular.js browser client).
I have some API tests written in Python using the Requests library:
my_r = requests.post(MY_ROUTE,
data=MY_DATA,
headers={"Content-Type": "application/json"},
timeout=10)
When I use lein ring server, this request works fine in the JS client and the Python tests.
When I use httpkit, this works fine in the JS client but the Python client times out with
socket.timeout: timed out
I can't figure out why the Python client is timing out. It happens with httpkit but not with lein-ring, so I can only assume that the cause is related to the difference.
I've looked at the traffic in WireShark and both look like they give the correct response. Both have the same Content-Length field (15 bytes).
I've raised the number of threads to 10 (shouldn't need to) and no change.
Any ideas what's wrong?
I found how to fix this, but no satisfactory explanation.
I was using wrap-json-response Ring middleware to take a HashMap and convert it to JSON. I switched to doing my own conversion in my handler with json/write-str, and this fixes it.
At a guess it might be something to do with the server handling output buffering, but that's speculation.
I've combed through the Wireshark dumps and I can't see any relevant differences between the two. The sent Content-Length fields are identical. The 'bytes in flight' differ, at 518 and 524.
No clue as to why the web browser was happy with this but Python Requests wasn't, and whether or this is a bug in Requests, httpkit, ring-middleware-format or my own code.
Related
Openstack-Swift is using evenlet.green.httplib for BufferedHttpconnections.
When I do performance benchmark of it for write operations, I could observer that write throughput drops even only one replica node is overloaded.
As I know write quorum is 2 out of 3 replicas, therefore overloading only one replica cannot affect for the throughput.
When I dig deeper what I observed was, the consequent requests are blocked until the responses are reached for the previous requests. Its mainly because of the BufferedHttpConnection which stops issuing new request until the previous response is read.
Why Openstack-swift use such a method?
Is this the usual behaviour of evenlet.green.httplib.HttpConnection?
This does not make sense in write quorum point of view, because its like waiting for all the responses not a quorum.
Any ideas, any workaround to stop this behaviour using the same library?
Its not a problem of the library but a limitation due to the Openstack Swift configuration where the "Workers" configuration in all Account/Container/Object config of Openstack Swift was set to 1
Regarding the library
When new connections are made using evenlet.green.httplib.HttpConnection
it does not block.
But if requests are using the same connection, subsequent requests are blocked until the response is fully read.
I always had the idea that doing a HEAD request instead of a GET request was faster (no matter the size of the resource) and therefore had it advantages in certain solutions.
However, while making a HEAD request in Python (to a 5+ MB dynamic generated resource) I realized that it took the same time as making a GET request (almost 27 seconds instead of the 'less than 2 seconds' I was hoping for).
Used some urllib2 solutions to make a HEAD request found here and even used pycurl (setting headers and nobody to True). Both of them took the same time.
Am I missing something conceptually? is it possible, using Python, to do a 'quick' HEAD request?
The server is taking the bulk of the time, not your requester or the network. If it's a dynamic resource, it's likely that the server doesn't know all the header information - in particular, Content-Length - until it's built it. So it has to build the whole thing whether you're doing HEAD or GET.
The response time is dominated by the server, not by your request. The HEAD request returns less data (just the headers) so conceptually it should be faster, but in practice, many static resources are cached so there is almost no measureable difference (just the time for the additional packets to come down the wire).
Chances are, the bulk of that request time is actually whatever process generates the 5+MB response on the server rather than the time to transfer it to you.
In many cases, a web application will still execute the full script when responding to a HEAD request--it just won't send the full body back to the requester.
If you have access to the code that is processing that request, you may be able to add a condition in there to make it handle the request differently depending on the the method, which could speed it up dramatically.
I have some test code (as a part of a webapp) that uses urllib2 to perform an operation I would usually perform via a browser:
Log in to a remote website
Move to another page
Perform a POST by filling in a form
I've created 4 separate, clean virtualenvs (with --no-site-packages) on 3 different machines, all with different versions of python but the exact same packages (via pip requirements file), and the code only works on the two virtualenvs on my local development machine(2.6.1 and 2.7.2) - it won't work on either of my production VPSs
In the failing cases, I can log in successfully, move to the correct page but when I submit the form, the remote server replies telling me that there has been an error - it's an application server error page ('we couldn't complete your request') and not a webserver error.
because I can successfully log in and maneuver to a second page, this doesn't seem to be a session or a cookie problem - it's particular to the final POST
because I can perform the operation on a particular machine with the EXACT same headers and data, this doesn't seem to be a problem with what I am requesting/posting
because I am trying the code on two separate VPS rented from different companies, this doesn't seem to be a problem with the VPS physical environment
because the code works on 2 different python versions, I can't imagine it being an incompabilty problem
I'm completely lost at this stage as to why this wouldn't work. I've even 'turned-it-off-and-turn-it-on-again' because I just can't see what the problem could be.
I think it has to be something to do with the final POST coming from a VPS that the remote server doesn't like, but I can't figure out what that could be. I feel like there is something going on under the hood of URLlib that is causing the remote server to dislike the reply.
EDIT
I've installed the exact same Python version (2.6.1) on the VPS as is on my working local copy and it doesn't work remotely, so it must be something to do with originating from a VPS. How could this effect the Http request? Is it something lower level?
You might try setting the debuglevel=1 for urllib2 and see what it comes up with:
import urllib2
h=urllib2.HTTPHandler(debuglevel=1)
opener = urllib2.build_opener(h)
...
This is a total shot in the dark, but are your VPSs 64-bit and your home computer 32-bit, or vice versa? Maybe a difference in default sizes or accuracies of something could be freaking out the server.
Barring that, can you try to find out any information on the software stack the web server is using?
I had similar issues with urllib2 (working with Zimbra's REST api), in the end switched to pycurl with success.
PS
for operations of login/navigate/post, I usually find Mechanize useful and easier to use. Maybe you can give it a show.
Well, it looks like I know why the problem was happening, but I'm not 100% the reason for it.
I simply had to make the server wait (time.sleep()) after it sent the 2nd request (Move to another page) before doing the 3rd request (Perform a POST by filling in a form).
I don't know is it because of a condition with the 3rd party server, or if it's some sort of odd issue with URLlib? The reason it seemed to work on my development machine is presumably because it was slower then the server at running the code?
I want to make an HTTPS request to a real-time stream and keep the connection open so that I can keep reading content from it and processing it.
I want to write the script in python. I am unsure how to keep the connection open in my script. I have tested the endpoint with curl which keeps the connection open successfully. But how do I do it in Python. Currently, I have the following code:
c = httplib.HTTPSConnection('userstream.twitter.com')
c.request("GET", "/2/user.json?" + req.to_postdata())
response = c.getresponse()
Where do I go from here?
Thanks!
It looks like your real-time stream is delivered as one endless HTTP GET response, yes? If so, you could just use python's built-in urllib2.urlopen(). It returns a file-like object, from which you can read as much as you want until the server hangs up on you.
f=urllib2.urlopen('https://encrypted.google.com/')
while True:
data = f.read(100)
print(data)
Keep in mind that although urllib2 speaks https, it doesn't validate server certificates, so you might want to try and add-on package like pycurl or urlgrabber for better security. (I'm not sure if urlgrabber supports https.)
Connection keep-alive features are not available in any of the python standard libraries for https. The most mature option is probably urllib3
httplib2 supports this. (I'd have thought this the most mature option, didn't know urllib3 yet, so TokenMacGuy may still be right)
EDIT: while httplib2 does support persistent connections, I don't think you can really consume streams with it (ie. one long response vs. multiple requests over the same connection), which I now realise you may need.
I have a large scraping job to do -- most of the script's time is spent blocking due to a lot of network latency. I'm trying to multi-thread the script so I can make multiple requests simultaneously, but about 10% of my threads die with the following error
URLError: <urlopen error [Errno -2] Name or service not known>
The other 90% complete successfully. I am requesting multiple pages from the same domain, so it seems like there may be some DNS issue. I make 25 requests at a time (25 threads). Everything works fine if i limit myself to 5 requests at a time, but once I get to around 10 requests, I start seeing this error sometimes.
I have read Repeated host lookups failing in urllib2
which describes the same issue I have and followed the suggestions therein, but to no avail.
I have also tried using the multiprocessing module instead of multi-threading, I get the same behaviour -- about 10% of the processes die with the same error -- which leads me to believe this is not an issue with urllib2 but something else.
Can someone explain what is going on and suggest how to fix?
UPDATE
If I manually code the ip address of the site into my script everything works perfectly, so this error happens sometime during the DNS lookup.
Suggestion: Try enabling a DNS cache in your system, such as nscd. This should eliminate DNS lookup problems if your scraper always makes requests to the same domain.
Make sure that the file objects returned by urllib2.urlopen are properly closed after being read, in order to free resources. Otherwise, you may reach the limit of max open sockets in your system.
Also, take into account the politeness policy web crawlers should have to avoid overloading a server with multiple requests.