I'm now familiar with the general cause of this problem from another SO answer and from the uWSGI documentation, which states:
If an HTTP request has a body (like a POST request generated by a
form), you have to read (consume) it in your application. If you do
not do this, the communication socket with your webserver may be
clobbered.
However, I don't understand what exactly is happening at the TCP level for this problem to occur. Not knowing the details of this process, I would assume the server can simply discard what remains in the stream, but that's obviously not the case.
If I consume only part of the request body in my application and ultimately return a 200 response, a web browser will report a connection reset error. Who reset the connection? The webserver or the client? It seems like all the data has been sent by the client already, but the application has just not exhausted the stream. Is there something that happens when the stream is exhausted in the application that triggers the webserver to indicate it has finished reading?
My application is Python/Flask, but I've seen questions about this from several languages and frameworks. For example, this fails if exhaust() is not called on the request stream:
#app.route('/upload', methods=['POST'])
def handle-upload():
file = request.stream
pandas.read_csv(file, nrows=100)
response = # Do stuff
file.exhaust()
return jsonify(response)
While there is some buffering throughout the chain, large file transfers are not going to complete until the receiver has consumed them. The buffers will fill up, and packets will be dropped until the buffers are drained. Eventually, the browser will give up trying to send the file and drop the connection.
Related
I am fairly new to computer networking and want to use the python requests library for downloading large files from an external FTP server. I have a conceptual question as to when the content of a large file is received and how the client tells the server when to send over the content.
My code looks somewhat like
import requests
...
response = requests.get(url_to_very_large_file, stream=True)
...
with open(save_path, "wb") as file:
for chunk in response.iter_chunks(chunk_size):
file.write(chunk)
Now response arrives back from the server very quickly (less than a second), but the content of the file (say 2 GB heavy for the sake of argument) surely cannot arrive that fast. I'm also confused that response already has a content attribute. What happens under the hood?
More precisely:
What is in response.content?
Does the server now bombard my client with the 2 GB content right away, or is another request sent to the server when I ask for response.iter_chunks or response.content.read()? At which point does the server start sending over the 2GB of content?
Does the server know in which chunk_size I am reading /expecting the files?
Where are the chunks stored in the meantime, if they are received by the client but not read into memory?
response.content attribute contains the returned bytes from the remote server. This attribute is a property, so if you sent the request with stream=True option, it won't contain the content upon creation, until you access it- which is the moment where it'll pull all the data from the server.
When you send a request to a server, you're establishing a connection which the server will send data through. This doesn't have to happen at once, and if your underlying client is not pulling a data to its RAM, server will wait for you for a while. By using .iter_chunks method you're slowly pulling data from the server few bytes at a time.
They don't, and considering how TCP connection works it isn't necessary either.
Server doesn't send us a data until we got a room for it, hence they're not on our machine unless they're on our memory.
If you have already learnt other languages like Java, you could think of property as getter/setter but in more integrated way. Check the post I linked above for better explanations.
It might be helpful to learn how TCP connection and socket works, since those are the ones that does all the stuff under the hood.
I'm looking to start a web project using Flask and its SocketIO plugin, which depends on gevent (something something greenlets), but I don't understand how gevent relates to the webserver. Does using gevent restrict my server choice at all? How does it relate to the different levels of web servers that we have in python (e.g. Nginx/Apache, Gunicorn)?
Thanks for the insight.
First, lets clarify what we are talking about:
gevent is a library to allow the programming of event loops easily. It is a way to immediately return responses without "blocking" the requester.
socket.io is a javascript library create clients that can maintain permanent connections to servers, which send events. Then, the library can react to these events.
greenlet think of this a thread. A way to launch multiple workers that do some tasks.
A highly simplified overview of the entire process follows:
Imagine you are creating a chat client.
You need a way to notify the user's screens when anyone types a message. For this to happen, you need someway to tell all the users when a new message is there to be displayed. That's what socket.io does. You can think of it like a radio that is tuned to a particular frequency. Whenever someone transmits on this frequency, the code does something. In the case of the chat program, it adds the message to the chat box window.
Of course, if you have a radio tuned to a frequency (your client), then you need a radio station/dj to transmit on this frequency. Here is where your flask code comes in. It will create "rooms" and then transmit messages. The clients listen for these messages.
You can also write the server-side ("radio station") code in socket.io using node, but that is out of scope here.
The problem here is that traditionally - a web server works like this:
A user types an address into a browser, and hits enter (or go).
The browser reads the web address, and then using the DNS system, finds the IP address of the server.
It creates a connection to the server, and then sends a request.
The webserver accepts the request.
It does some work, or launches some process (depending on the type of request).
It prepares (or receives) a response from the process.
It sends the response to the client.
It closes the connection.
Between 3 and 8, the client (the browser) is waiting for a response - it is blocked from doing anything else. So if there is a problem somewhere, like say, some server side script is taking too long to process the request, the browser stays stuck on the white page with the loading icon spinning. It can't do anything until the entire process completes. This is just how the web was designed to work.
This kind of 'blocking' architecture works well for 1-to-1 communication. However, for multiple people to keep updated, this blocking doesn't work.
The event libraries (gevent) help with this because they accept and will not block the client; they immediately send a response and when the process is complete.
Your application, however, still needs to notify the client. However, as the connection is closed - you don't have a way to contact the client back.
In order to notify the client and to make sure the client doesn't need to "refresh", a permanent connection should be open - that's what socket.io does. It opens a permanent connection, and is always listening for messages.
So work request comes in from one end - is accepted.
The work is executed and a response is generated by something else (it could be a the same program or another program).
Then, a notification is sent "hey, I'm done with your request - here is the response".
The person from step 1, listens for this message and then does something.
Underneath is all is WebSocket a new full-duplex protocol that enables all this radio/dj functionality.
Things common between WebSockets and HTTP:
Work on the same port (80)
WebSocket requests start off as HTTP requests for the handshake (an upgrade header), but then shift over to the WebSocket protocol - at which point the connection is handed off to a websocket-compatible server.
All your traditional web server has to do is listen for this handshake request, acknowledge it, and then pass the request on to a websocket-compatible server - just like any other normal proxy request.
For Apache, you can use mod_proxy_wstunnel
For nginx versions 1.3+ have websocket support built-in
I have a cherrypy api that is intended to run for a long time on the server.
I have an unreliable client that can die or close connection for various reasons that are out of my control.
During the time my server api runs, I want to periodically check the status of the connection, making sure the client is still listening and abort my operation if the client has gone away.
I could not find any good place describing how to poll the connection status while serving a cherrypy request.
One example of such a long run is computing md5 of multiple big files (of tens of GBs) in chunks of small buffer (limited memory).
I don't need any solutions that shorten the runtime since that is not my goal here. I want to keep this connection open for as long as I can, but abort if it is closed.
Here is the simple sample of my code:
#cherrypy.expose
def foo(self):
cherrypy.response.headers['Content-Type'] = 'text/plain'
def run():
for result in get_results(): # get_results() is a heavy method mentioned
yield json.dumps(result)
return run()
foo._cp_config = {'response.stream': True}
The only reliable way to know that the client has died is to try to write some data to the socket, which for CherryPy can be done with yield. You must yield non-empty strings, so you'd have to be returning a Content-Type that can handle some filler text, like some extra spaces after the opening <head> tag of an HTML document. If the client closes the connection, the CherryPy server will stop requesting additional yielded data from the handler (and call any close method on the generator so you can clean up).
As far as I know, CherryPy doesn't provide you with any mechanism to detect that a client died. It will only tell you if a response took too long to complete (and therefore be sent out).
You may refer to this SO thread for more information.
I am using a server to send some piece of information to another server every second. The problem is that the other server response is few kilobytes and this consumes the bandwidth on the first server ( about 2 GB in an hour ). I would like to send the request and ignore the return ( not even receive it to save bandwidth ) ..
I use a small python script for this task using (urllib). I don't mind using any other tool or even any other language if this is going to make the request only.
A 5K reply is small stuff and is probably below the standard TCP window size of your OS. This means that even if you close your network connection just after sending the request and checking just the very first bytes of the reply (to be sure that request has been really received) probably the server already sent you the whole answer and the packets are already on the wire or on your computer.
If you cannot control (i.e. trim down) what is the server reply for your notification the only alternative I can think to is to add another server on the remote machine waiting for a simple command and doing the real request locally and just sending back to you the result code. This can be done very easily may be even just with bash/perl/python using for example netcat/wget locally.
By the way there is something strange in your math as Glenn Maynard correctly wrote in a comment.
For HTTP, you can send a HEAD request instead of GET or POST:
import urllib2
request = urllib2.Request('https://stackoverflow.com/q/5049244/')
request.get_method = lambda: 'HEAD' # override get_method
response = urllib2.urlopen(request) # make request
print response.code, response.url
Output
200 https://stackoverflow.com/questions/5049244/how-can-i-ignore-server-response-t
o-save-bandwidth
See How do you send a HEAD HTTP request in Python?
Sorry but this does not make much sense and is likely a violation of the HTTP protocol. I consider such an idea as weird and broken-by-design. Either make the remote server shut up or configure your application or whatever is running on the remote server on a different protocol level using a smarter protocol with less bandwidth usage. Everything else is hard being considered as nonsense.
The fun part of websockets is sending essentially unsolicited content from the server to the browser right?
Well, I'm using django-websocket by Gregor Müllegger. It's a really wonderful early crack at making websockets work in Django.
I have accomplished "hello world." The way this works is: when a request is a websocket, an object, websocket, is appended to the request object. Thus, I can, in the view interpreting the websocket, do something like:
request.websocket.send('We are the knights who say ni!')
That works fine. I get the message back in the browser like a charm.
But what if I want to do that without issuing a request from the browser at all?
OK, so first I save the websocket in the session dictionary:
request.session['websocket'] = request.websocket
Then, in a shell, I go and grab the session by session key. Sure enough, there's a websocket object in the session dictionary. Happy!
However, when I try to do:
>>> session.get_decoded()['websocket'].send('With a herring!')
I get:
Traceback (most recent call last):
File "<console>", line 1, in <module>
error: [Errno 9] Bad file descriptor
Sad. :-(
OK, so I don't know much of anything about sockets, but I know enough to sniff around in a debugger, and lo and behold, I see that the socket in my debugger (which is tied to the genuine websocket from the request) has fd=6, while the one that I grabbed from the session-saved websocket has fd=-1.
Can a socket-oriented person help me sort this stuff out?
I'm the author of django-websocket. I'm not a real expert in the topic of websockets and networking, however I think I have a decent understanding of whats going on. Sorry for going into great detail. Even if most of the answer isn't specific to your question it might help you at some other point. :-)
How websockets work
Let me explain shortly what a websocket is. A websocket starts as something that really looks like a plain HTTP request, established from the browser. It indicates through a HTTP header that it wants to "upgrade" the protocol to be a websocket instead of a HTTP request. If the server supports websockets, it agrees on the handshake and both - server and client - now know that they will use the established tcp socket formerly used for the HTTP request as a connection to interchange websocket messages.
Beside sending and waiting for messages, they have also of course the ability to close the connection at any time.
How django-websocket abuses the python's wsgi request environment to hijack the socket
Now lets get into the details of how django-websocket implements the "upgrading" of the HTTP request in a django request-response cylce.
Django usually uses the WSGI specification to talk to the webserver like apache or gunicorn etc. This specification was designed just with the very limited communication model of HTTP in mind. It assumes that it gets a HTTP request (only incoming data) and returns the response (only outgoing data). This makes it tricky to force django into the concept of a websocket where bidirectional communication is allowed.
What I'm doing in django-websocket to achieve this is that I dig very deeply into the internals of WSGI and django's request object to retrieve the underlaying socket. This tcp socket is then used to handle the upgrade the HTTP request to a websocket instance directly.
Now to your original question ...
I hope the above makes it obvious that when a websocket is established, there is no point in returning a HttpResponse. This is why you usually don't return anything in a view that is handled by django-websocket.
However I wanted to stick close to the concept of a view that holds the logic and returns data based on the input. This is why you should only use the code in your view to handle the websocket.
After you return from the view, the websocket is automatically closed. This is done for a reason: We don't want to keep the socket open for an undefined amount of time and relying on the client (the browser) to close it.
This is why you cannot access a websocket with django-websocket outside of your view. The file descriptor is then of course set to -1 indicating that its already closed.
Disclaimer
I explained above that I'm digging in the surrounding environment of django to get somehow -- in a very hackish way -- access to the underlaying socket. This is very fragile and also not supposed to work since WSGI is not designed for this! I also explained above that the websocket is closed after the view ends - however after the websocket closed down (AND closed the tcp socket), django's WSGI implementation tries to send a HTTP response - it doesn't know about websockets and thinks it is in a normal HTTP request-response cycle. But the socket is already closed an the sending will fail. This usually causes an exception in django.
This didn't affected my testings with the development server. The browser will never notice (you know .. the socket is already closed ;-) - but raising an unhandled error in every request is not a very good concept and may leak memory, doesn't handle database connection shutdown correctly and many athor things that will break at some point if you use django-websocket for more than experimenting.
This is why I would really advise you not to use websockets with django yet. It doesn't work by design. Django and especially WSGI would need a total overhaul to solve these problems (see this discussion for websockets and WSGI). Since then I would suggest using something like eventlet. Eventlet has a working websocket implementation (I borrowed some code from eventlet for the initial version of django-websocket) and since its just plain python code you can import your models and everything else from django. The only drawback is that you need a second webserver running just to handle websockets.
As Gregor Müllegger pointed out, Websockets can't be properly handled by WSGI, because that protocol never was designed to handle such a feature.
uWSGI, since version 1.9.11, can handle Websockets out of the box. Here uWSGI communicates with the application server using raw HTTP rather than the WSGI protocol. A server written that way, can therefore handle the protocol internals and keep the connection open over a long period. Having long living connections handled by a Django view is not a good idea either, because they then would block a worker thread, which is a limited resource.
The main purpose of Websockets, is to have the server push messages to the client in an asynchronous way. This can be a Django view triggered by other browsers (ex.: chat clients, multiplayer games), or an event triggered by, say django-celery (ex.: sport results). It therefore is fundamental for these Django services, to use a message queue for pushing messages to the client.
To handle this in a scalable way, I wrote django-websocket-redis, a Django module which can keep open all those long living Websocket connections in one single thread/process using Redis as the backend message queue.
You could give stargate a bash: http://boothead.github.com/stargate/ and http://pypi.python.org/pypi/stargate/.
It's built on top of pyramid and eventlet (I also contributed a fair bit of the websocket support and tests to eventlet). The big advantage of pyramid for this sort of thing is that it's got the concept of a resource which the url maps to, rather than just the result of a callable. So you end up with a graph of persistent resources that maps to your url structure and websocket connections are simply routed and connected to those resources.
So you end up only needing to do two things:
class YourView(WebSocketView):
def handler(self, websocket):
self.request.context.add_listener(websocket)
while True:
msg = websocket.wait()
# Do something with message
To receive messages
and
resource.send(some_other_message)
Here resource is an instance of a stargate.resource.WebSocketAwareContext (as is self.request.context) above and the send method sends the message to all clients connected with the add_listener method.
To publish a message to all of the connected clients you just call node.send(message)
I'm hopefully going to write up a little example app in the next week or two to demonstrate this a little better.
Feel free to ping me on github if you want some help with it.
request.websocket is probably get closed when you return from the request handler (view). The simple solution is to keep the handler alive (by not returning from the view). If your server is not multi-threaded you won't be able to accept any other simultaneous requests.