Using python sockets to receive large http requests

Using python sockets to receive large http requests - python

I am using python sockets to receive web style and soap requests. The code I have is
import socket
svrsocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
host = socket.gethostname();
svrsocket.bind((host,8091))
svrsocket.listen(1)
clientSocket, clientAddress = svrsocket.accept()
message = clientSocket.recv(4096)
Some of the soap requests I receive, however, are huge. 650k huge, and this could become several Mb. Instead of the single recv I tried
message = ''
while True:
data = clientSocket.recv(4096)
if len(data) == 0:
break;
message = message + data
but I never receive a 0 byte data chunk with firefox or safari, although the python socket how to says I should.
What can I do to get round this?

Unfortunately you can't solve this on the TCP level - HTTP defines its own connection management, see RFC 2616. This basically means you need to parse the stream (at least the headers) to figure out when a connection could be closed.
See related questions here - https://stackoverflow.com/search?q=http+connection

Hiya
Firstly I want to reinforce what the previous answer said
Unfortunately you can't solve this on the TCP level
Which is true, you can't. However you can implement an http parser on top of your tcp sockets. And that's what I want to explore here.
Let's get started
Problem and Desired Outcome
Right now we are struggling to find the end to a datastream. We expected our stream to end with a fixed ending but now we know that HTTP does not define any message suffix
And yet, we move forward.
There is one question we can now ask, "Can we ever know the length of the message in advance?" and the answer to that is YES! Sometimes...
You see HTTP/1.1 defines a header called Content-Length and as you'd expect it has exactly what we want, the content length; but there is something else in the shadows: Transfer-Encoding: chunked. unless you really want to learn about it, we'll stay away from it for now.
Solution
Here is a solution. You're not gonna know what some of these functions are at first, but if you stick with me, I'll explain. Alright... Take a deep breath.
Assuming conn is a socket connection to the desired HTTP server
...
rawheaders = recvheaders(conn,end=CRLF)
headers = dict_headers(io.StringIO(rawheaders))
l_content = headers['Content-Length']
#okay. we've got content length by magic
buffersize = 4096
while True:
if l_content <= 0: break
data = clientSocket.recv(buffersize)
message += data
l_content -= len(data)
...
As you can see, we enter the loop already knowing the Content-Length as l_content
While we iterate we keep track of the remaining content by subtracting the length of clientSocket.recv(buff) from l_content.
When we've read at least as much data as l_content, we are done
if l_content <= 0: break
Frustration
Note: For some these next bits I'm gonna give psuedo code because the code can be a bit dense
So now you're asking, what is rawheaders = recvheaders(conn), what is headers = dict_headers(io.StringIO(rawheaders)),
and HOW did we get headers['Content-Length']?!
For starters, recvheaders. The HTTP/1.1 spec doesn't define a message suffix, but it does define something useful: a suffix for the http headers! And that suffix is CRLF aka \r\n.That means we know when we've recieved the headers when we read CRLF. So we can write a function like
def recvheaders(sock):
rawheaders = ''
until we read crlf:
rawheaders = sock.recv()
return rawheaders
Next, parsing the headers.
def dict_header(ioheaders:io.StringIO):
"""
parses an http response into the status-line and headers
"""
#here I expect ioheaders to be io.StringIO
#the status line is always the first line
status = ioheaders.readline().strip()
headers = {}
for line in ioheaders:
item = line.strip()
if not item:
break
//headers look like this
//'Header-Name' : 'Value'
item = item.split(':', 1)
if len(item) == 2:
key, value = item
headers[key] = value
return status, headers
Here we read the status line then we continue to iterate over every remaining line
and build [key,value] pairs from Header: Value with
item = line.strip()
item = item.split(':', 1)
# We do split(':',1) to avoid cases like
# 'Header' : 'foo:bar' -> ['Header','foo','bar']
# when we want ---------> ['Header','foo:bar']
then we take that list and add it to the headers dict
#unpacking
#key = item[0], value = item[1]
key, value = item
header[key] = value
BAM, we've created a map of headers
From there headers['Content-Length'] falls right out.
So,
This structure will work as long as you can guarantee that you will always recieve Content-Length
If you've made it this far WOW, thanks for taking the time and I hope this helped you out!
TLDR; if you want to know the length of an http message with sockets, write an http parser

Related

Telnet send command and then read response

This shouldn't be that complicated, but it seems that both the Ruby and Python Telnet libs have awkward APIs. Can anyone show me how to write a command to a Telnet host and then read the response into a string for some processing?
In my case "SEND" with a newline retrieves some temperature data on a device.
With Python I tried:
tn.write(b"SEND" + b"\r")
str = tn.read_eager()
which returns nothing.
In Ruby I tried:
tn.puts("SEND")
which should return something as well, the only thing I've gotten to work is:
tn.cmd("SEND") { |c| print c }
which you can't do much with c.
Am I missing something here? I was expecting something like the Socket library in Ruby with some code like:
s = TCPSocket.new 'localhost', 2000
while line = s.gets # Read lines from socket
puts line # and print them
end

I found out that if you don't supply a block to the cmd method, it will give you back the response (assuming the telnet is not asking you for anything else). You can send the commands all at once (but get all of the responses bundled together) or do multiple calls, but you would have to do nested block callbacks (I was not able to do it otherwise).
require 'net/telnet'
class Client
# Fetch weather forecast for NYC.
#
# #return [String]
def response
fetch_all_in_one_response
# fetch_multiple_responses
ensure
disconnect
end
private
# Do all the commands at once and return everything on one go.
#
# #return [String]
def fetch_all_in_one_response
client.cmd("\nNYC\nX\n")
end
# Do multiple calls to retrieve the final forecast.
#
# #return [String]
def fetch_multiple_responses
client.cmd("\r") do
client.cmd("NYC\r") do
client.cmd("X\r") do |forecast|
return forecast
end
end
end
end
# Connect to remote server.
#
# #return [Net::Telnet]
def client
#client ||= Net::Telnet.new(
'Host' => 'rainmaker.wunderground.com',
'Timeout' => false,
'Output_log' => File.open('output.log', 'w')
)
end
# Close connection to the remote server.
def disconnect
client.close
end
end
forecast = Client.new.response
puts forecast

Efficiently reading lines from compressed, chunked HTTP stream as they arrive

I've written a HTTP-Server that produces endless HTTP streams consisting of JSON-structured events. Similar to Twitter's streaming API. These events are separated by \n (according to Server-sent events with Content-Type:text/event-stream) and can vary in length.
The response is
chunked (HTTP 1.1 Transfer-Encoding:chunked) due to the endless stream
compressed (Content-Encoding: gzip) to save bandwidth.
I want to consume these lines in Python as soon as they arrive and as resource-efficient as possible, without reinventing the wheel.
As I'm currently using python-requests, do you know how to make it work?
If you think, python-requests cannot help here, I'm totally open for alternative frameworks/libraries.
My current implementation is based on requests and uses iter_lines(...) to receive the lines. But the chunk_size parameter is tricky. If set to 1 it is very cpu-intense, since some events can be several kilobytes. If set to any value above 1, some events got stuck until the next arrive and the whole buffer "got filled". And the time between events can be several seconds.
I expected that the chunk_size is some sort of "maximum number of bytes to receive" as in unix's recv(...). The corresponding man-page says:
The receive calls normally return any data available, up to the
requested amount, rather than waiting for receipt of the full amount
requested.
But this is obviously not how it works in the requests-library. They use it more or less as an "exact number of bytes to receive".
While looking at their source code, I couldn't identify which part is responsible for that. Maybe httplib's Response or ssl's SSLSocket.
As a workaround I tried padding my lines on the server to a multiple of the chunk-size.
But the chunk-size in the requests-library is used to fetch bytes from the compressed response stream.
So this won't work until I can pad my lines so that their compressed byte-sequence is a multiple of the chunk-size. But this seems far too hacky.
I've read that Twisted could be used for non-blocking, non-buffered processing of http streams on the client, but I only found code for creating stream responses on the server.

Thanks to Martijn Pieters answer I stopped working around python-requests behavior and looked for a completely different approach.
I ended up using pyCurl. You can use it similar to a select+recv loop without inverting the control flow and giving up control to a dedicated IO-loop as in Tornado, etc. This way it is easy to use a generator that yields new lines as soon as they arrive - without further buffering in intermediate layers that could introduce delay or additional threads that run the IO-loop.
At the same time, it is high-level enough, that you don't need to bother about chunked transfer encoding, SSL encryption or gzip compression.
This was my old code, where chunk_size=1 resulted in 45% CPU load and chunk_size>1 introduced additional lag.
import requests
class RequestsHTTPStream(object):
def __init__(self, url):
self.url = url
def iter_lines(self):
headers = {'Cache-Control':'no-cache',
'Accept': 'text/event-stream',
'Accept-Encoding': 'gzip'}
response = requests.get(self.url, stream=True, headers=headers)
return response.iter_lines(chunk_size=1)
Here is my new code based on pyCurl:
(Unfortunately the curl_easy_* style perform blocks completely, which makes it difficult to yield lines in between without using threads. Thus I'm using the curl_multi_* methods)
import pycurl
import urllib2
import httplib
import StringIO
class CurlHTTPStream(object):
def __init__(self, url):
self.url = url
self.received_buffer = StringIO.StringIO()
self.curl = pycurl.Curl()
self.curl.setopt(pycurl.URL, url)
self.curl.setopt(pycurl.HTTPHEADER, ['Cache-Control: no-cache', 'Accept: text/event-stream'])
self.curl.setopt(pycurl.ENCODING, 'gzip')
self.curl.setopt(pycurl.CONNECTTIMEOUT, 5)
self.curl.setopt(pycurl.WRITEFUNCTION, self.received_buffer.write)
self.curlmulti = pycurl.CurlMulti()
self.curlmulti.add_handle(self.curl)
self.status_code = 0
SELECT_TIMEOUT = 10
def _any_data_received(self):
return self.received_buffer.tell() != 0
def _get_received_data(self):
result = self.received_buffer.getvalue()
self.received_buffer.truncate(0)
self.received_buffer.seek(0)
return result
def _check_status_code(self):
if self.status_code == 0:
self.status_code = self.curl.getinfo(pycurl.HTTP_CODE)
if self.status_code != 0 and self.status_code != httplib.OK:
raise urllib2.HTTPError(self.url, self.status_code, None, None, None)
def _perform_on_curl(self):
while True:
ret, num_handles = self.curlmulti.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
return num_handles
def _iter_chunks(self):
while True:
remaining = self._perform_on_curl()
if self._any_data_received():
self._check_status_code()
yield self._get_received_data()
if remaining == 0:
break
self.curlmulti.select(self.SELECT_TIMEOUT)
self._check_status_code()
self._check_curl_errors()
def _check_curl_errors(self):
for f in self.curlmulti.info_read()[2]:
raise pycurl.error(*f[1:])
def iter_lines(self):
chunks = self._iter_chunks()
return self._split_lines_from_chunks(chunks)
#staticmethod
def _split_lines_from_chunks(chunks):
#same behaviour as requests' Response.iter_lines(...)
pending = None
for chunk in chunks:
if pending is not None:
chunk = pending + chunk
lines = chunk.splitlines()
if lines and lines[-1] and chunk and lines[-1][-1] == chunk[-1]:
pending = lines.pop()
else:
pending = None
for line in lines:
yield line
if pending is not None:
yield pending
This code tries to fetch as many bytes as possible from the incoming stream, without blocking unnecessarily if there are only a few. In comparison, the CPU load is around 0.2%

It is not requests' fault that your iter_lines() calls are blocking.
The Response.iter_lines() method calls Response.iter_content(), which calls urllib3's HTTPResponse.stream(), which calls HTTPResponse.read().
These calls pass along a chunk-size, which is what is passed on to the socket as self._fp.read(amt). This is the problematic call, as self._fp is a file object produced by socket.makefile() (as done by the httplib module); and this .read() call will block until amt (compressed) bytes are read.
This low-level socket file object does support a .readline() call that will work more efficiently, but urllib3 cannot make use of this call when handling compressed data; line terminators are not going to be visible in the compressed stream.
Unfortunately, urllib3 won't call self._fp.readline() when the response isn't compressed either; the way the calls are structured it'd be hard to pass along you want to read in line-buffering mode instead of in chunk-buffering mode as it is.
I must say that HTTP is not the best protocol to use for streaming events; I'd use a different protocol for this. Websockets spring to mind, or a custom protocol for your specific use-case.

Am I using ioctl correctly?

I'm writing an http server in python3.3, just to learn how to do this sort of thing. In my function that parses a request, I want to use fcntl.ioctl to get the number of bytes that I can read in the socket, and I only do this when I see a kevent in the result of checking a kqueue that says there is stuff to read on the socket. But whenever I try to call fcntl.ioctl, I get OSError: [Errno 14] Bad address. What am I doing wrong? Also, this seems to be happening on the first call. Here is the relevant code:
def client_thread(kq, client_socket, methods):
while True:
events = kq.control([], 2, POLLTIME) #we pass an empty list of changes, because we don't have any changes to make to the events we are interested in.
#we want a list that is at most two long. We listen for POLLTIME seconds.
for event in events:
if event != KILL_KEV: #there are only two events in our kqueue
handle_client(client_socket, methods)
else: #KILL_SOCK has a connection
break
client_socket.close()
client_socket.shutdown()
def handle_client(client_socket, methods):
request = parse_request(client_socket) #parse the request data in the client socket
handlers = methods[request["request"]["method"]] #retrieve the appropriate list of handlers from the methods dict
for path_match_pred, handler_func in handlers:
if path_match_pred(path): #if the path matches whatever path predicate you've created...
break
response = handler_func(request) #... then call the appropriate handler function to handle the request
send_response(client_socket, response) #and finally, send the response.
def parse_request(client_socket):
"""Returns the request data, parsed into a dictionary like this:
{
"request": {
"method": method,
"path": path,
"version": HTTP version
},
"headers": header dictionary,
"body": body data as a string
}
This should only be called if the client socket is ready for reading!
"""
client_fd = client_socket.fileno() #get the file descriptor for the socket
bytes_in_socket = 0
fcntl.ioctl(client_fd, termios.FIONREAD, bytes_in_socket) #count the bytes in it
#^^^^^^^^^THIS IS WHERE IT BREAKS
print(bytes_in_socket, "bytes in socket")
msg = bytearray() #make empty byte array
while bytes_in_socket:
msg.extend(client_socket.recv(bytes_in_socket)) #read the bytes we counted earlier
fcntl.ioctl(client_fd, termios.FIONREAD, bytes_in_socket) #check for more bytes
print(bytes_in_socket, "bytes left to read")

Note that in fcntl.ioctl's documentation, they mention the acceptable types for arg (the argument to the ioctl operation). In some cases, (like this), you need to pass a buffer, or an object that supports its interface. In particular, when you want to receive a value back. You're just passing an integer.

Twisted IMAP4 Client QUOTA family of commands

Update It seems to be the way untagged responses are handled by twisted, the only example I have found seem to iterate through the data received and somehow collect the response to their command though I am not sure how...
I am trying to implement the IMAP4 quota commands as defined in RFC 2087 ( https://www.rfc-editor.org/rfc/rfc2087 ).
Code - ImapClient
class SimpleIMAP4Client(imap4.IMAP4Client):
"""
A client with callbacks for greeting messages from an IMAP server.
"""
greetDeferred = None
def serverGreeting(self, caps):
self.serverCapabilities = caps
if self.greetDeferred is not None:
d, self.greetDeferred = self.greetDeferred, None
d.callback(self)
def lineReceived(self, line):
print "<" + str(line)
return imap4.IMAP4Client.lineReceived(self, line)
def sendLine(self, line):
print ">" + str(line)
return imap4.IMAP4Client.sendLine(self, line)
Code - QUOTAROOT Implementation
def cbExamineMbox(result, proto):
"""
Callback invoked when examine command completes.
Retrieve the subject header of every message in the mailbox.
"""
print "Fetching storage space"
cmd = "GETQUOTAROOT"
args = _prepareMailboxName("INBOX")
resp = ("QUOTAROOT", "QUOTA")
d = proto.sendCommand(Command(cmd, args, wantResponse=resp))
d.addCallback(cbFetch, proto)
return d
def cbFetch(result, proto):
"""
Finally, display headers.
"""
print "Got Quota"
print result
Output
Fetching storage space
>0005 GETQUOTAROOT INBOX
<* QUOTAROOT "INBOX" ""
<* QUOTA "" (STORAGE 171609 10584342)
<0005 OK Success
Got Quota
([], 'OK Success')
So I am getting the data but the result doesn't contain it, I am thinking it is because they are untagged responses?

Since the IMAP4 protocol mixes together lots of different kinds of information as "untagged responses", you probably also need to update some other parts of the parsing code in the IMAP4 client implementation.
Specifically, take a look at twisted.mail.imap4.Command and its finish method. Also look at twisted.mail.imap4.IMAP4Client._extraInfo, which is what is passed as the unusedCallback to Command.finish.
To start, you can check to see if the untagged responses to the QUOTA command are being sent to _extraInfo (and then dropped (well, logged)).
If so, I suspect you want to teach Command to recognize QUOTA and QUOTAROOT untagged responses to the QUOTA command, so that it collects them and sends them as part of the result it fires its Deferred with.
If not, you may need to dig a bit deeper into the logic of Command.finish to see where the data does end up.
You may also want to actually implement the Command.wantResponse feature, which appears to be only partially formed currently (ie, lots of client code tries to send interesting values into Command to initialize that attribute, but as far as I can tell nothing actually uses the value of that attribute).

Decode HTTP packet content in python as seen in wireshark

Ok, so bascially what I want to do is intercept some packets that I know contains some JSON data. But HTTP packets aren't human-readable, so that's my problem, I need to make the entire packet (not just the header, which is already plain text), human-readable. I have no experience with networking at all.
import pcap
from impacket import ImpactDecoder, ImpactPacket
def print_packet(pktlen, data, timestamp):
if not data:
return
decoder = ImpactDecoder.EthDecoder()
ether = decoder.decode(data)
iphdr = ether.child()
tcphdr = iphdr.child()
if iphdr.get_ip_src() == '*******':
print tcphdr
p = pcap.pcapObject()
dev = 'wlan0'
p.open_live(dev, 1600, 0, 100)
try:
p.setfilter('tcp', 0, 0)
while 1:
p.loop(1, print_packet)
except KeyboardInterrupt:
print 'shutting down'
I've found tools like libpcap-python, scapy, Impacket pcapy and so on. They all seem good, but I can't figure out how to decode the packets properly with them.
Wireshark has this thing called "Line-based text data: text/html" which basically displays the information I'm after, so I thought it would be trivial to get the same info with python, it turns out it was not.

Both HTTP and JSON are human readable. On Wireshark, select a packet that relates to your HTTP transaction and right-click, select Follow TCP Stream, which should display the transaction in a Human readable form.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.