Efficiently reading lines from compressed, chunked HTTP stream as they arrive

Efficiently reading lines from compressed, chunked HTTP stream as they arrive - python

I've written a HTTP-Server that produces endless HTTP streams consisting of JSON-structured events. Similar to Twitter's streaming API. These events are separated by \n (according to Server-sent events with Content-Type:text/event-stream) and can vary in length.
The response is
chunked (HTTP 1.1 Transfer-Encoding:chunked) due to the endless stream
compressed (Content-Encoding: gzip) to save bandwidth.
I want to consume these lines in Python as soon as they arrive and as resource-efficient as possible, without reinventing the wheel.
As I'm currently using python-requests, do you know how to make it work?
If you think, python-requests cannot help here, I'm totally open for alternative frameworks/libraries.
My current implementation is based on requests and uses iter_lines(...) to receive the lines. But the chunk_size parameter is tricky. If set to 1 it is very cpu-intense, since some events can be several kilobytes. If set to any value above 1, some events got stuck until the next arrive and the whole buffer "got filled". And the time between events can be several seconds.
I expected that the chunk_size is some sort of "maximum number of bytes to receive" as in unix's recv(...). The corresponding man-page says:
The receive calls normally return any data available, up to the
requested amount, rather than waiting for receipt of the full amount
requested.
But this is obviously not how it works in the requests-library. They use it more or less as an "exact number of bytes to receive".
While looking at their source code, I couldn't identify which part is responsible for that. Maybe httplib's Response or ssl's SSLSocket.
As a workaround I tried padding my lines on the server to a multiple of the chunk-size.
But the chunk-size in the requests-library is used to fetch bytes from the compressed response stream.
So this won't work until I can pad my lines so that their compressed byte-sequence is a multiple of the chunk-size. But this seems far too hacky.
I've read that Twisted could be used for non-blocking, non-buffered processing of http streams on the client, but I only found code for creating stream responses on the server.

Thanks to Martijn Pieters answer I stopped working around python-requests behavior and looked for a completely different approach.
I ended up using pyCurl. You can use it similar to a select+recv loop without inverting the control flow and giving up control to a dedicated IO-loop as in Tornado, etc. This way it is easy to use a generator that yields new lines as soon as they arrive - without further buffering in intermediate layers that could introduce delay or additional threads that run the IO-loop.
At the same time, it is high-level enough, that you don't need to bother about chunked transfer encoding, SSL encryption or gzip compression.
This was my old code, where chunk_size=1 resulted in 45% CPU load and chunk_size>1 introduced additional lag.
import requests
class RequestsHTTPStream(object):
def __init__(self, url):
self.url = url
def iter_lines(self):
headers = {'Cache-Control':'no-cache',
'Accept': 'text/event-stream',
'Accept-Encoding': 'gzip'}
response = requests.get(self.url, stream=True, headers=headers)
return response.iter_lines(chunk_size=1)
Here is my new code based on pyCurl:
(Unfortunately the curl_easy_* style perform blocks completely, which makes it difficult to yield lines in between without using threads. Thus I'm using the curl_multi_* methods)
import pycurl
import urllib2
import httplib
import StringIO
class CurlHTTPStream(object):
def __init__(self, url):
self.url = url
self.received_buffer = StringIO.StringIO()
self.curl = pycurl.Curl()
self.curl.setopt(pycurl.URL, url)
self.curl.setopt(pycurl.HTTPHEADER, ['Cache-Control: no-cache', 'Accept: text/event-stream'])
self.curl.setopt(pycurl.ENCODING, 'gzip')
self.curl.setopt(pycurl.CONNECTTIMEOUT, 5)
self.curl.setopt(pycurl.WRITEFUNCTION, self.received_buffer.write)
self.curlmulti = pycurl.CurlMulti()
self.curlmulti.add_handle(self.curl)
self.status_code = 0
SELECT_TIMEOUT = 10
def _any_data_received(self):
return self.received_buffer.tell() != 0
def _get_received_data(self):
result = self.received_buffer.getvalue()
self.received_buffer.truncate(0)
self.received_buffer.seek(0)
return result
def _check_status_code(self):
if self.status_code == 0:
self.status_code = self.curl.getinfo(pycurl.HTTP_CODE)
if self.status_code != 0 and self.status_code != httplib.OK:
raise urllib2.HTTPError(self.url, self.status_code, None, None, None)
def _perform_on_curl(self):
while True:
ret, num_handles = self.curlmulti.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
return num_handles
def _iter_chunks(self):
while True:
remaining = self._perform_on_curl()
if self._any_data_received():
self._check_status_code()
yield self._get_received_data()
if remaining == 0:
break
self.curlmulti.select(self.SELECT_TIMEOUT)
self._check_status_code()
self._check_curl_errors()
def _check_curl_errors(self):
for f in self.curlmulti.info_read()[2]:
raise pycurl.error(*f[1:])
def iter_lines(self):
chunks = self._iter_chunks()
return self._split_lines_from_chunks(chunks)
#staticmethod
def _split_lines_from_chunks(chunks):
#same behaviour as requests' Response.iter_lines(...)
pending = None
for chunk in chunks:
if pending is not None:
chunk = pending + chunk
lines = chunk.splitlines()
if lines and lines[-1] and chunk and lines[-1][-1] == chunk[-1]:
pending = lines.pop()
else:
pending = None
for line in lines:
yield line
if pending is not None:
yield pending
This code tries to fetch as many bytes as possible from the incoming stream, without blocking unnecessarily if there are only a few. In comparison, the CPU load is around 0.2%

It is not requests' fault that your iter_lines() calls are blocking.
The Response.iter_lines() method calls Response.iter_content(), which calls urllib3's HTTPResponse.stream(), which calls HTTPResponse.read().
These calls pass along a chunk-size, which is what is passed on to the socket as self._fp.read(amt). This is the problematic call, as self._fp is a file object produced by socket.makefile() (as done by the httplib module); and this .read() call will block until amt (compressed) bytes are read.
This low-level socket file object does support a .readline() call that will work more efficiently, but urllib3 cannot make use of this call when handling compressed data; line terminators are not going to be visible in the compressed stream.
Unfortunately, urllib3 won't call self._fp.readline() when the response isn't compressed either; the way the calls are structured it'd be hard to pass along you want to read in line-buffering mode instead of in chunk-buffering mode as it is.
I must say that HTTP is not the best protocol to use for streaming events; I'd use a different protocol for this. Websockets spring to mind, or a custom protocol for your specific use-case.

Related

Does downloading files and writing in streaming byte mode create two requests in Python?

Say I have the below function. (Sorry if I included too much.)
#backoff.on_exception(
backoff.expo,
Exception,
max_tries=10,
factor=3600,
)
#sleep_and_retry
#limits(calls=10, period=60)
def download_file(url: str, filename: str, speedlimit: bool) -> requests.Response:
"""Download PDF files."""
response = requests.get(url, headers=pdf_headers, stream=True, timeout=None)
response.raise_for_status()
if response.status_code != 200:
response.close
raise Exception
else:
fin = f"{config.Paths.pdf_path}{filename}.pdf"
with tqdm.wrapattr(
open(fin, "wb"),
"write",
total=int(response.headers.get("content-length", 0)),
) as fout:
try:
for chunk in response.iter_content(chunk_size=8):
fout.write(chunk)
if speedlimit:
time.sleep(0.000285) # Limits bandwith to ~20kB/s
time.sleep(random.uniform(6, 7.2))
except Exception:
fout.flush()
response.close
return response
Does separating the response code check and adding stream, instead of doing it in one single shot, cause two requests instead of a single one?
I would think not, given the connection should already remain active (pdf_headers = {"Connection": "keep-alive"}) from the requests.get(). But I am troubleshooting rate limiting and seeing what I feel is a really low rate limit (less than 10 calls a minute), possibly even less than 500 calls an hour. Unfortunately, it's a government (state) website (IIS/ASP) that doesn't send retry-after headers with a 429 status code. And no rate limit listed in their robots.txt. (TY for making my job harder, government IT people.)
I am not using threading/multi-processing. So I know it's not an issue with multiple calls to a single server.

Google Cloud Functions returning 401 error after thousands of correct responses

I'm writing a genetic algorithm that uses a Cloud Function as its evaluation function and have been running into 401 response codes about 1/3 of the way into the process. I'm not entirely certain what is going on here given the numerous successful invocations, and there are no indications in the Google Cloud logs that anything is amiss (either in the CF logs or the generic cloud-wide logs).
The intention is to use it as a generic evaluation function for more 'rigorous' projects, however for this one I am passing a list of strings, along with the 'correct' string, and returning the ASCII distance between each string to the correct string. This information is coming in as a JSON packet. The genetic algorithm basically just needs to discover the correct string for completion. This is basically a play on the one-max optimization problem.
For reference, this has only really happened once I scaled up the number of invocations and passed strings. The process ran fine with a smaller number of evaluations and amount of strings passed, however when I scale it up a bit it chokes halfway through). Note that the entire purpose of using CFs is to try and scale exponentially upwards for evaluation calls, otherwise I'd just run this locally.
The Cloud Function is fairly trivial (evaluating the string optimization problem):
import json
# Evaluate distance from expected to individual
def fitness(bitstring, expected, dna_size):
f = 0
for c in range(dna_size):
f += abs(ord(bitstring[c]) - ord(expected[c]))
return f
def evaluateBitstrings(request):
resp = []
request_json = request.get_json()
if request_json and 'bitstrings' in request_json and 'expected' in request_json and 'dna_size' in request_json:
for bitstring in request_json['bitstrings']:
f = fitness(bitstring, request_json['expected'], int(request_json['dna_size']))
resp.append((bitstring, f))
return str(json.dumps(resp))
else:
return f'Error: missing JSON information'
The JSON packet I'm sending comprises a list of 1000 strings, so it's really just doing a loop over those and creating a JSON packet of distances to return.
It is configured to use 512mb of memory, has a 180 second timeout, and uses authentication to protect from anonymous calls. I trigger the call locally via Python, asyncio, and aiohttp with the authorization included in the header (authenticating locally in Windows Subsystem for Linux (Ubuntu) via gcloud).
Here is the relevant bit of Python code (using 3.6). One of the issues was being locally-bound with the large number of aiohttp calls, and I came across this post on using semaphores to increase the amount of calls used with semaphores.
import aiohttp
import asyncio
...
base_url = "https://<GCF:CF>"
headers = {'Content-Type': 'application/json'}
...
token = sys.argv[2] # call to `gcloud auth print-identity-token` as parameter
headers['Authorization'] = 'bearer ' + token
async def fetch(bitstrings,expected,session):
b = {'bitstrings':bitstrings,
'expected':expected,
'dna_size':len(expected)}
async with session.post(base_url, data=json.dumps(b), headers=headers) as response:
assert response.status == 200
data = await response.read()
try:
return json.loads(data)
except:
print("An error occurred: {0}".format(data))
async def bound_fetch(sem, bitstrings, expected, session):
async with sem:
return await fetch(bitstrings, expected, session)
async def run(iterable, expected, token):
tasks = []
sem = asyncio.Semaphore(1000)
async with aiohttp.ClientSession(trust_env=True) as session:
chunks = [iterable[x:x+1000] for x in range(0,len(population), 1000)]
# build up JSON array
for chunk in chunks:
task = asyncio.ensure_future(bound_fetch(sem, chunk, expected, session))
tasks.append(task)
...
# Within the GA code
for generation in range(ga.GENERATIONS):
...
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(population, ga.OPTIMAL, token))
responses = []
results = loop.run_until_complete(future)
for res in results: # loop through task results
for r in res: # json coming in as a list
responses.append((r[0], float(r[1]))) # string, fitness
Further reference, I have done this locally via the functions-framework and did not run into this issue. It only happens when I reach out to the cloud.
Edit: I resolved the Forbidden issue (the token needed to be refreshed, and I simply fired off a subprocess call to the relevant gcloud command), however now I am seeing a new issue:
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host us-central1-cloud-function-<CF>:443 ssl:default [Connect call failed ('216.239.36.54', 443)]
This now happens sporadically (10 minutes in, 70 minutes in, etc.). I'm starting to wonder if I'm fighting a losing battle here.

Twisted Server receive data stream via POST, read request.content.read() byte by byte in a deferred for over an hour

I'm intending to receive a binary stream of data via a http POST call.
I believe the client side is working, that is, it writes chunks of bytes to the server, I can see the amount of data being sent with tcpdump, yet Twisted's request.content file-like object only starts producing output once the client disconnects.
This is what the server handler looks like:
def render(self, request):
if request.path == '/incoming-stream':
d = deferLater(reactor, 0, lambda: request)
d.addCallback(self.async_read)
return NOT_DONE_YET
def async_read(self, request):
sys.stdout.write('\nasync_read ' + str(request) + '\n')
sys.stdout.flush()
while True:
byte = request.content.read(1) # <--- read one byte
if len(byte) > 0:
sys.stdout.write(repr(byte))
sys.stdout.flush()
else:
break
sys.stdout.write('\nfinished ' + str(request) + '\n')
sys.stdout.flush()
request.write(b"finished")
request.finish()
If I can't do this with POST, I have no problem with switching over to WebSocket, but I'd first like to try to get this done via POST. The data posted are long running (one new POST request every hour, with it being alive and receiving data for an hour), relatively high bandwidth sensor data at approx 1kbps.
I am aware that there are better methods of transferring the data (WebSocket, MQTT, AMQP), but POST and WebSocket will give me least amount of trouble when receiving the data through an NGINX SSL endpoint. Currently NGINX is not being used (to discard any buffering it could be causing).

Twisted Web does not support streaming uploads in its IResource abstraction.
See https://twistedmatrix.com/trac/ticket/288

How do i handle streaming messages with Python gRPC

I'm following this Route_Guide sample.
The sample in question fires off and reads messages without replying to a specific message. The latter is what i'm trying to achieve.
Here's what i have so far:
import grpc
...
channel = grpc.insecure_channel(conn_str)
try:
grpc.channel_ready_future(channel).result(timeout=5)
except grpc.FutureTimeoutError:
sys.exit('Error connecting to server')
else:
stub = MyService_pb2_grpc.MyServiceStub(channel)
print('Connected to gRPC server.')
this_is_just_read_maybe(stub)
def this_is_just_read_maybe(stub):
responses = stub.MyEventStream(stream())
for response in responses:
print(f'Received message: {response}')
if response.something:
# okay, now what? how do i send a message here?
def stream():
yield my_start_stream_msg
# this is fine, i receive this server-side
# but i can't check for incoming messages here
I don't seem to have a read() or write() on the stub, everything seems to be implemented with iterators.
How do i send a message from this_is_just_read_maybe(stub)?
Is that even the right approach?
My Proto is a bidirectional stream:
service MyService {
rpc MyEventStream (stream StreamingMessage) returns (stream StreamingMessage) {}
}

What you're trying to do is perfectly possible and will probably involve writing your own request iterator object that can be given responses as they arrive rather than using a simple generator as your request iterator. Perhaps something like
class MySmarterRequestIterator(object):
def __init__(self):
self._lock = threading.Lock()
self._responses_so_far = []
def __iter__(self):
return self
def _next(self):
# some logic that depends upon what responses have been seen
# before returning the next request message
return <your message value>
def __next__(self): # Python 3
return self._next()
def next(self): # Python 2
return self._next()
def add_response(self, response):
with self._lock:
self._responses.append(response)
that you then use like
my_smarter_request_iterator = MySmarterRequestIterator()
responses = stub.MyEventStream(my_smarter_request_iterator)
for response in responses:
my_smarter_request_iterator.add_response(response)
. There will probably be locking and blocking in your _next implementation to handle the situation of gRPC Python asking your object for the next request that it wants to send and your responding (in effect) "wait, hold on, I don't know what request I want to send until after I've seen how the next response turned out".

Instead of writing a custom iterator, you can also use a blocking queue to implement send and receive like behaviour for client stub:
import queue
...
send_queue = queue.SimpleQueue() # or Queue if using Python before 3.7
my_event_stream = stub.MyEventStream(iter(send_queue.get, None))
# send
send_queue.push(StreamingMessage())
# receive
response = next(my_event_stream) # type: StreamingMessage
This makes use of the sentinel form of iter, which converts a regular function into an iterator that stops when it reaches a sentinel value (in this case None).

Using python sockets to receive large http requests

I am using python sockets to receive web style and soap requests. The code I have is
import socket
svrsocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
host = socket.gethostname();
svrsocket.bind((host,8091))
svrsocket.listen(1)
clientSocket, clientAddress = svrsocket.accept()
message = clientSocket.recv(4096)
Some of the soap requests I receive, however, are huge. 650k huge, and this could become several Mb. Instead of the single recv I tried
message = ''
while True:
data = clientSocket.recv(4096)
if len(data) == 0:
break;
message = message + data
but I never receive a 0 byte data chunk with firefox or safari, although the python socket how to says I should.
What can I do to get round this?

Unfortunately you can't solve this on the TCP level - HTTP defines its own connection management, see RFC 2616. This basically means you need to parse the stream (at least the headers) to figure out when a connection could be closed.
See related questions here - https://stackoverflow.com/search?q=http+connection

Hiya
Firstly I want to reinforce what the previous answer said
Unfortunately you can't solve this on the TCP level
Which is true, you can't. However you can implement an http parser on top of your tcp sockets. And that's what I want to explore here.
Let's get started
Problem and Desired Outcome
Right now we are struggling to find the end to a datastream. We expected our stream to end with a fixed ending but now we know that HTTP does not define any message suffix
And yet, we move forward.
There is one question we can now ask, "Can we ever know the length of the message in advance?" and the answer to that is YES! Sometimes...
You see HTTP/1.1 defines a header called Content-Length and as you'd expect it has exactly what we want, the content length; but there is something else in the shadows: Transfer-Encoding: chunked. unless you really want to learn about it, we'll stay away from it for now.
Solution
Here is a solution. You're not gonna know what some of these functions are at first, but if you stick with me, I'll explain. Alright... Take a deep breath.
Assuming conn is a socket connection to the desired HTTP server
...
rawheaders = recvheaders(conn,end=CRLF)
headers = dict_headers(io.StringIO(rawheaders))
l_content = headers['Content-Length']
#okay. we've got content length by magic
buffersize = 4096
while True:
if l_content <= 0: break
data = clientSocket.recv(buffersize)
message += data
l_content -= len(data)
...
As you can see, we enter the loop already knowing the Content-Length as l_content
While we iterate we keep track of the remaining content by subtracting the length of clientSocket.recv(buff) from l_content.
When we've read at least as much data as l_content, we are done
if l_content <= 0: break
Frustration
Note: For some these next bits I'm gonna give psuedo code because the code can be a bit dense
So now you're asking, what is rawheaders = recvheaders(conn), what is headers = dict_headers(io.StringIO(rawheaders)),
and HOW did we get headers['Content-Length']?!
For starters, recvheaders. The HTTP/1.1 spec doesn't define a message suffix, but it does define something useful: a suffix for the http headers! And that suffix is CRLF aka \r\n.That means we know when we've recieved the headers when we read CRLF. So we can write a function like
def recvheaders(sock):
rawheaders = ''
until we read crlf:
rawheaders = sock.recv()
return rawheaders
Next, parsing the headers.
def dict_header(ioheaders:io.StringIO):
"""
parses an http response into the status-line and headers
"""
#here I expect ioheaders to be io.StringIO
#the status line is always the first line
status = ioheaders.readline().strip()
headers = {}
for line in ioheaders:
item = line.strip()
if not item:
break
//headers look like this
//'Header-Name' : 'Value'
item = item.split(':', 1)
if len(item) == 2:
key, value = item
headers[key] = value
return status, headers
Here we read the status line then we continue to iterate over every remaining line
and build [key,value] pairs from Header: Value with
item = line.strip()
item = item.split(':', 1)
# We do split(':',1) to avoid cases like
# 'Header' : 'foo:bar' -> ['Header','foo','bar']
# when we want ---------> ['Header','foo:bar']
then we take that list and add it to the headers dict
#unpacking
#key = item[0], value = item[1]
key, value = item
header[key] = value
BAM, we've created a map of headers
From there headers['Content-Length'] falls right out.
So,
This structure will work as long as you can guarantee that you will always recieve Content-Length
If you've made it this far WOW, thanks for taking the time and I hope this helped you out!
TLDR; if you want to know the length of an http message with sockets, write an http parser

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.