Compressing JSON sent via CGI in Python - python

The webApp I'm currently developing requires large JSON files to requested by the client, built on the server using Python and sent back to the client. The solution is implemented via CGI, and is working correctly in every way.
At this stage I'm just employing various techniques to minimize the size of the resulting JSON objects sent back to the client which are around 5-10mb ( Without going into detail, this is more or less fixed, and cannot be lazy loaded in any way).
The host we're using doesn't support mod_deflate or mod_gzip, so while we can't configure Apache to automatically create gzipped content on the server with .htaccess, I figure we'll still be able to receive it and decode on the client side as long as the Content-encoding header is set correctly.
What I was wondering, is what is the best way to achieve this. Gzipping something in Python is trivial. I already know how to do that, but the problem is:
How do I compress the data in such a way, that printing it to the output stream to send via CGI will be both compressed, and readable to the client?
The files have to be created on the fly, based upon input data, so storing premade and prezipped files is not an option, and they have to be received via xhr in the webApp.
My initial experiments with compressing the JSON string with gzip and io.stringIO, then printing it to the output stream caused it to be printed in Python's normal byte format eg: b'\n\x91\x8c\xbc\xd4\xc6\xd2\x19\x98\x14x\x0f1q!\xdc|C\xae\xe0 and such, which bloated the request to twice it's normal size...
I was wondering if someone could point me in the right direction here with how I could accomplish this, if it is indeed possible.
I hope I've articulated my problem correctly.
Thank you.

I guess you use print() (which first converts its argument to a string before sending it to stdout) or sys.stdout (which only accepts str objects).
To write directly on stdout, you can use sys.stdout.buffer, a file-like object that supports bytes objects:
import sys
import gzip
s = 'foo'*100
sys.stdout.buffer.write(gzip.compress(s.encode()))
Which gives valid gzip data:
$ python3 foo.py | gunzip
foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo

Thanks for the answers Valentin and Phillip!
I managed to solve the issue, both of you contributed to the final answer. Turns out it was a combination of things.
Here's the final code that works:
response = json.JSONEncoder().encode(loadData)
sys.stdout.write('Content-type: application/octet-stream\n')
sys.stdout.write('Content-Encoding: gzip\n\n')
sys.stdout.flush()
sys.stdout.buffer.write(gzip.compress(response.encode()))
After switching over to sys.stdout instead of using print to print the headers, and flushing the stream it managed to read correctly. Which is pretty curious... Always something more to learn.
Thanks again!

Related

Can I stream a GZIP dataset directly from a server through HTTP using python?

I want to read data from a GZIP dataset file directly the internet without downloading the complete file. Considering the size of the dataset, is it possible in python to stream the data directly from the server through HTTP and read the data? I took a look at zlib and gzip packages. I'm a beginner to python, I want to know if this is possible using python or any other language, if possible any references to such code. Thanks in Advance!

Using Python urllib2, How can I stream between a GET and a POST?

I want to write code to transfer a file from one site to another. This can be a large file, and I'd like to do it without creating a local temporary file.
I saw the trick of using mmap to upload a large file in Python: "HTTP Post a large file with streaming", but what I really need is a way to link up the response from the GET to creating the POST.
Anyone done this before?
You can't, or at least shouldn't.
urllib2 request objects have no way to stream data into them on the fly, period. And in the other direction, response objects are file-like objects, so in theory you can read(8192) out of them instead of read(), but for most protocols—including HTTP—it will either often or always read the whole response into memory and serve your read(8192) calls out of its buffer, making it pointless. So, you have to intercept the request, steal the socket out of it, and deal with it manually, at which point urllib2 is getting in your way more than it's helping.
urllib2 makes some things easy, some things much harder than they should be, and some things next to impossible; when it isn't making things easy, stop using it.
One solution is to use a higher-level third-party library. For example, requests gets you half-way there (it makes it very easy to stream from a response, but can only stream into a response in limited situations), and requests-toolbelt gets you the rest of the way there (it adds various ways to stream-upload).
The other solution is to use a lower-level library. And here, you don't even have to leave the stdlib. httplib forces you to think in terms of sending and receiving things bit by bit, but that's exactly what you want. On the get request, you can just call connect and request, and then call read(8192) repeatedly on the response object. On the post request, you call connect, putrequest, putheader, endheaders, then repeatedly send each buffer from the get request, then getresponse when you're done.
In fact, in Python 3.2+'s http.client (the equivalent of 2.x's httplib), HTTPClient.request doesn't have to be a string, it can be any iterable or any file-like object with read and fileno methods… which includes an response object. So, it's this simple:
import http.client
getconn = httplib.HTTPConnection('www.example.com')
getconn.request('GET', 'http://www.example.com/spam')
getresp = getconn.getresponse()
getconn = httplib.HTTPConnection('www.example.com')
getconn.request('POST', 'http://www.example.com/eggs', body=getresp)
getresp = getconn.getresponse()
… except, of course, that you probably want to craft appropriate headers (you can actually use urllib.request, the 3.x version of urllib2, to build a Request object and not send it…), and pull the host and port out of the URL with urlparse instead of hardcoding them, and you want to exhaust or at least check the response from the POST request, and so on. But this shows the hard part, and it's not hard.
Unfortunately, I don't think this works in 2.x.
Finally, if you're familiar with libcurl, there are at least three wrappers for it (including one that comes with the source distribution). I'm not sure whether to call libcurl higher-level or lower-level than urllib2, it's sort of on its own weird axis of complexity. :)
urllib2 may be too simple for this task. You might want to look into pycurl. I know it supports streaming.

Upload a file with a file.write interface

I've never seen this in Python and I'm interested if there is anything out there that will allow you to send files (for example HTTP PUT or POST) with a write interface? I've only ever seen a read interface where you pass a filename or file object (urllib, requests etc)
Of course I may have never seen this for good reason, which I'd also be interested to know.
While it might look like it would make sense at a high level, let's try to map the file interface to HTTP verbs:
file interface http
------------------------
read GET
HEAD
------------------------
write POST
PUT
PATCH
------------------------
? DELETE
OPTIONS
As you can see, there's no clear mapping between the file interface and the set of HTTP verbs that are required for any RESTful interface. Of course, you could likely hack an implementation together that only uses GET (read) and POST (write), but that will break the second you need to extend it to support any other HTTP verbs.
Edit based on comments:
I haven't tried it myself, but it seems like deep down (http/client.py), if data implements read, it will read it as such:
while 1:
datablock = data.read(blocksize)
if not datablock:
break
if encode:
datablock = datablock.encode("iso-8859-1")
self.sock.sendall(datablock)
Do note the possible performance hit in doing this though:
# If msg and message_body are sent in a single send() call,
# it will avoid performance problems caused by the interaction
# between delayed ack and the Nagle algorithm. However,
# there is no performance gain if the message is larger
# than MSS (and there is a memory penalty for the message
# copy).
So yes, you should be able to pass in a file object as the data param.

Check size of HTTP POST without saving to disk

Is there a way to check the size of the incoming POST in Pyramid, without saving the file to disk and using the os module?
You should be able to check the request.content_length. WSGI does not support streaming the request body so content length must be specified. If you ever access request.body, request.params or request.POST it will read the content and save it to disk.
The best way to handle this, however, is as close to the client as possible. Meaning if you are running behind a proxy of any sort, have that proxy reject requests that are too large. Once it gets to Python, something else may have already stored the request to disk.

Listening to the Output of a Web Application

I have been trying, in vain, to make a program that reads text out loud using the web application found here (http://www.ispeech.org/text.to.speech.demo.php). It is a demo text-to-speech program, that works very well, and is relatively fast. What I am trying to do is make a Python program that would input text to the application, then output the result. The result, in this case, would be sound. Is there any way in Python to do this, like, say, a library? And if not, is it possible to do this through any other means? I have looked into the iSpeech API (found here), but the only problem with it is that there is a limited number of free uses (I believe that it is 200). While this program is only meant to be used a couple of times, I would rather it be able to use the service more then 200 times. Also, if this solution is impractical, could anyone direct me towards another alternative?
# AKX I am currently using eSpeak, and it works well. It just, well, doesn't sound too good, and it is hard to tell at times what is being said.
If using iSpeech is not required, there's a decent (it's surely not as beautifully articulated as many commercial solutions) open-source text-to-speech solution available called eSpeak.
It's usable from the command line (subprocess with Python), or as a shared library. It seems there's also a Python wrapper (python-espeak) for it.
Hope this helps.
OK. I found a way to do it, seems to work fine. Thanks to everyone who helped! Here is the code I'm using:
from urllib import quote_plus
def speak(text):
import pydshow
words = text.split()
temp = []
stuff = []
while words:
temp.append(words.pop(0))
if len(temp) == 24:
stuff.append(' '.join(temp))
temp = []
stuff.append(' '.join(temp))
for i in stuff:
pydshow.PlayFileWait('http://api.ispeech.org/api/rest?apikey=8d1e2e5d3909929860aede288d6b974e&format=mp3&action=convert&voice=ukenglishmale&text='+quote_plus(i))
if __name__ == '__main__':
speak('Hello. This is a text-to speech test.')
I find this ideal because it DOES use the API, but it uses the API key that is used for the demo program. Therefore, it never runs out. The key is 8d1e2e5d3909929860aede288d6b974e.
You can actually test this at work without the program, by typing the following into your address bar:
http://api.ispeech.org/api/rest?apikey=8d1e2e5d3909929860aede288d6b974e&format=mp3&action=convert&voice=ukenglishmale&text=
Followed by the text you want to speak. You can also adjust the language, by changing, in this case, the ukenglishmale to something else that iSpeech offers. For example, ukenglishfemale. This will speak the same text, but in a feminine voice.
NOTE: Pydshow is my wrapper around DirectShow. You can use yours instead.
The flow of your application would be like this:
Client-side: User inputs text into form, and form submits a request to server
Server: may be python or whatever language/framework you want. Receives http request with text.
Server: Runs text-to-speech either with pure python library or by running a subprocess to a utility that can generate speech as a wav/mp3/aiff/etc
Server: Sends HTTP response back by streaming file with a mime type to Client
Client: Receives the http response and plays the content
Specifically about step 3...
I don't have any particular advise on the most articulate open source speech synthesizing software available, but I can say that it does not have to necessarily be pure python, or even python at all for that matter. Most of these packages have some form of a command line utility to take stdin or a file and produce an audio file as output. You would simply launch this utility as a subprocess to generate the file, and then stream the file back in your http response.
If you decide to make use of an existing web service that provides text-to-speech via an API (iSpeech), then step 3 would be replaced with making your own server-side http request out to iSpeech, receiving the response and pretty much forwarding that response back to the original client request, like a proxy. I would say the benefit is not having to maintain your own speech synthesis solution or getting better quality that you could from an open source... but the downside is that you probably will have a bit more latency in your response time since your server has to make its own external http request and download the data first.

Categories