Persistent HTTP connections with httplib - python

I'm trying to write an application where I send an initial HTTP post message to server and leave the connection open. The application then sits around until the server sends data back. Once the server sends data back I want to read it and write it to a file (easy enough).
The part I'm having trouble with is actua
Basically I do this:
h=http.HTTPConnection(sever, port, timeout)
h.putrequest('POST', selector)
h.putheaders(...)
h.endheaders()
h.send(body)
buffering = False
while 1:
r = h.getresponse(buffering)
f=open(unique_filename, 'w')
f.write(r.read())
f.close()
What I expect is that the app should block in the loop and when data arrives it gets written to the file. I suspect I'm using read the wrong way, but looking at the httplib source didn't help.
Also, the python documentation site mentions a httplib.fileno() that returns the socket httplib uses. I'm using 2.7.0 and website doc is for 2.7.2, I can't find the fileno() method. I suspect taking the socket over httplib and calling recv myself is the best way to go, is that a good idea?
Any help is appreciated with one exception: please don't tell me to use some other library.

Related

When is the content of a GET request received in python?

I am fairly new to computer networking and want to use the python requests library for downloading large files from an external FTP server. I have a conceptual question as to when the content of a large file is received and how the client tells the server when to send over the content.
My code looks somewhat like
import requests
...
response = requests.get(url_to_very_large_file, stream=True)
...
with open(save_path, "wb") as file:
for chunk in response.iter_chunks(chunk_size):
file.write(chunk)
Now response arrives back from the server very quickly (less than a second), but the content of the file (say 2 GB heavy for the sake of argument) surely cannot arrive that fast. I'm also confused that response already has a content attribute. What happens under the hood?
More precisely:
What is in response.content?
Does the server now bombard my client with the 2 GB content right away, or is another request sent to the server when I ask for response.iter_chunks or response.content.read()? At which point does the server start sending over the 2GB of content?
Does the server know in which chunk_size I am reading /expecting the files?
Where are the chunks stored in the meantime, if they are received by the client but not read into memory?
response.content attribute contains the returned bytes from the remote server. This attribute is a property, so if you sent the request with stream=True option, it won't contain the content upon creation, until you access it- which is the moment where it'll pull all the data from the server.
When you send a request to a server, you're establishing a connection which the server will send data through. This doesn't have to happen at once, and if your underlying client is not pulling a data to its RAM, server will wait for you for a while. By using .iter_chunks method you're slowly pulling data from the server few bytes at a time.
They don't, and considering how TCP connection works it isn't necessary either.
Server doesn't send us a data until we got a room for it, hence they're not on our machine unless they're on our memory.
If you have already learnt other languages like Java, you could think of property as getter/setter but in more integrated way. Check the post I linked above for better explanations.
It might be helpful to learn how TCP connection and socket works, since those are the ones that does all the stuff under the hood.

python3 sockets never stops trying to read data

i'm using a tcp socket to read data from a website, HTTP requests to be exact. I want to use sockets and not requests or pycurl so please do not suggest me any higher level library.
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = wrap_socket(s)
response_bytes = b""
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
s.connect((website))
s.send(all of this works good)
#this is where my problems occur
while True:
response_bytes+=s.recv(4096)
if not response_bytes: break
this solution should work perfectly according to multiple stack posts. I want to use the most efficient way without a timeout. If i use try/except and set a socket timeout it works fine but thats not very good imo. This seems to make the code hang forever and make it try to read infinitely. Is there any reason it is doing this?
s.send(all of this works good)
Let me guess: this is doing a HTTP request with an explicit or implicit Connection: keep-alive. This header is implicit when doing a HTTP/1.1 request. Because of this the server decides to keep the TCP connection open because it is awaiting the next request of the client.
I want to use the most efficient way without a timeout.
The correct way is to properly understand the HTTP protocol, extract the size of the response body from the response header and read exactly as much data as specified by the size. The easy way is to just do a HTTP/1.0 request without enabling HTTP keep-alive. In this case the server will close the TCP connection immediately after the response was sent.
I want to use sockets and not requests or pycurl so please do not suggest me any higher level library.
It looks like you want to implement HTTP yourself. There is a standard you should read in this case which describes the fairly complex behavior of HTTP. Don't try to guess a protocol but read the actual specification.
this solution should work perfectly according to multiple stack posts
No, you missed an important detail.
while True:
response_bytes+=s.recv(4096)
if not response_bytes: break
If response_bytes is ever non-empty then it stays non-empty, and this becomes an infinite loop. Instead, do something like
while True:
buf = s.recv(2048)
if not buf:
break
response_bytes+=buf

Connect to python server using sockets in python 2.7

I created a python server on port 8000 using python -m SimpleHTTPServer.
When I visit this url from my web browser it shows the below content
Now, I want to get the above content using python. So, for that what I did is
>>> import socket
>>> s = socket.socket(
... socket.AF_INET, socket.SOCK_STREAM)
>>> s.connect(("localhost", 8000))
>>> s.recv(1024)
But after s.recv(1024) nothing happens it just wait there and prints nothing.
So, my question is how to get above directory content output using python. Also can someone suggest me a tutorial on socket programming with python. I didn't liked the official tutorial that much.
I also observed a strange thing when I try to receive contents using python and nothing happens at that time I cannot access localhost:8000 from my web browser but as soon as I kill my python program I can access it.
Arguably the simplest way to get content over http in python is to use the urllib2 module. For example:
from urllib2 import urlopen
f = urlopen('http://localhost:8000')
for line in f:
print line
This will print out the file hosted by SimpleHTTPServer.
But after s.recv(1024) nothing happens it just wait there and prints nothing.
You simply open a socket and waiting for the data, but it's not how HTTP protocol works. You have to send a request first if you want to receive a response (basically, you have to tell the server which directory you want to list or which file to download). If you really want to, you can send the request using raw sockets to train your skills, but the proper library is highly recommended (see Matthew Adams' response and urllib2 example).
I also observed a strange thing when I try to receive contents using python and nothing happens at that time I cannot access localhost:8000 from my web browser but as soon as I kill my python program I can access it.
This is because SimpleHTTServer is single-threaded and doesn't support multiple connections simultaneously. If you would like to fix it, take a look at the answers here: BasicHTTPServer, SimpleHTTPServer and concurrency.

How can i ignore server response to save bandwidth?

I am using a server to send some piece of information to another server every second. The problem is that the other server response is few kilobytes and this consumes the bandwidth on the first server ( about 2 GB in an hour ). I would like to send the request and ignore the return ( not even receive it to save bandwidth ) ..
I use a small python script for this task using (urllib). I don't mind using any other tool or even any other language if this is going to make the request only.
A 5K reply is small stuff and is probably below the standard TCP window size of your OS. This means that even if you close your network connection just after sending the request and checking just the very first bytes of the reply (to be sure that request has been really received) probably the server already sent you the whole answer and the packets are already on the wire or on your computer.
If you cannot control (i.e. trim down) what is the server reply for your notification the only alternative I can think to is to add another server on the remote machine waiting for a simple command and doing the real request locally and just sending back to you the result code. This can be done very easily may be even just with bash/perl/python using for example netcat/wget locally.
By the way there is something strange in your math as Glenn Maynard correctly wrote in a comment.
For HTTP, you can send a HEAD request instead of GET or POST:
import urllib2
request = urllib2.Request('https://stackoverflow.com/q/5049244/')
request.get_method = lambda: 'HEAD' # override get_method
response = urllib2.urlopen(request) # make request
print response.code, response.url
Output
200 https://stackoverflow.com/questions/5049244/how-can-i-ignore-server-response-t
o-save-bandwidth
See How do you send a HEAD HTTP request in Python?
Sorry but this does not make much sense and is likely a violation of the HTTP protocol. I consider such an idea as weird and broken-by-design. Either make the remote server shut up or configure your application or whatever is running on the remote server on a different protocol level using a smarter protocol with less bandwidth usage. Everything else is hard being considered as nonsense.

When does urllib2 actually download a file from a url?

url = "http://example.com/file.xml"
data = urllib2.urlopen(url)
data.read()
The question is, when exactly will the file be downloaded from the internet? When i do urlopen or .read()? On my network interface I see high traffic both times.
Witout looking at the code, I'd expect that the following happens:
urlopen() opens the connection, and sends the query. Then the server starts feeding the reply. At this point, the data accumulates in buffers until they are full and the operating system tells the server to hold on for a while.
Then data.read() empties the buffer, so the operating system tells the server to go on, and the rest of the reply gets downloaded.
Naturally, if the reply is short enough, or if the .read() happens quickly enough, then the buffers do not have time to fill up and the download happens in one go.
I agree with ddaa. However, if you want to understand this sort of thing, you can set up a dummy server using something like nc (in *nix) and then open the URL in the interactive Python interpreter.
In one terminal, run nc -l 1234 which will open a socket and listen for connections on port 1234 of the local machine. nc will accept an incoming connection and display whatever it reads from the socket. Anything you type into nc will be sent over the socket to the remote connection, in this case Python's urlopen().
Run Python in another terminal and enter your code, i.e.
data = urllib2.urlopen('http://127.0.0.1:1234')
data.read()
The call to urlopen() will establish the connection to the server, send the request and then block waiting for a response. You will see that nc prints the HTTP request into it's terminal.
Now type something into the terminal that is running nc. The call to urlopen() will still block until you press ENTER in nc, that is, until it receives a new line character. So urlopen() will not return until it has read at least one new line character. (For those concerned about possible buffering by nc, this is not an issue. urlopen() will block until it sees the first new line character.)
So it should be noted that urlopen() will block until the first new line character is received, after which data can be read from the connection. In practice, HTTP responses are short multiline responses, so urlopen() should return quite quickly.

Categories