I inherited a python3 project where we are trying to parse a 70 MB file with python 3.5.6 . I am using cgi.FieldStorage
File (named: paketti.ipk) I'm trying to send:
kissakissakissa
kissakissakissa
kissakissakissa
Headers:
X-FILE: /tmp/nginx/0000000001
Host: localhost:8082
Connection: close
Content-Length: 21
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------264635460442698726183359332565
Origin: http://172.16.8.12
Referer: http://172.16.8.12/
DNT: 1
Sec-GPC: 1
Temporary file /tmp/nginx/0000000001:
-----------------------------264635460442698726183359332565
Content-Disposition: form-data; name="file"; filename="paketti.ipk"
Content-Type: application/octet-stream
kissakissakissa
kissakissakissa
kissakissakissa
-----------------------------264635460442698726183359332565--
Code:
class S(BaseHTTPRequestHandler):
def do_POST(self):
temp_filename = self.headers['X-FILE']
temp_file_pointer=open(temp_filename,"rb")
form = cgi.FieldStorage( fp=temp_file_pointer, headers=self.headers, environ={'REQUEST_METHOD':'POST', 'CONTENT_TYPE':self.headers['Content-Type'], 'CONTENT_LENGTH':self.headers['Content-Length'] }, )
actual_filename = form['file'].filename
logging.info("ACTUAL FILENAME={}".format(actual_filename))
open("/tmp/nginx/{}".format(actual_filename), "wb").write(form['file'].file.read())
logging.info("FORM={}".format(form))
Now the strangest things. Logs show:
INFO:root:ACTUAL FILENAME=paketti.ipk
INFO:root:FORM=FieldStorage(None, None, [FieldStorage('file', 'paketti.ipk', b'')])
Look at the /tmp/nginx directory:
root#am335x-evm:/tmp# ls -la /tmp/nginx/*
-rw------- 1 www www 286 May 18 20:48 /tmp/nginx/0000000001
-rw-r--r-- 1 root root 0 May 18 20:48 /tmp/nginx/paketti.ipk
So, it is like partially working because the name is got. But why it does not parse the data contents? What am I missing?
Is this even doable on python or should I just write a C utility? The file is 70 MB and if I read it in memory, OOM-killer kills the python3 process (and rightfully so, I'd say). But yeah, where does the data contents go?
Instead of the cgi module need a multipart parser that can stream the data instead of reading all of it to RAM. AFAIK there is nothing useful in the standard library but this module could be of use: https://github.com/defnull/multipart
Alternatively, DIY something along these lines should work:
boundary = b"-----whatever"
# Begin and end lines (as per your example, I didn't check the RFCs)
begin = b"\r\n%b\r\n" % boundary
end = b"\r\n%b--\r\n" % boundary
# Prefer with blocks to open files so that they are also closed properly
with open(temp_filename, "rb") as f:
buf = bytearray()
# Search for the boundary
while begin not in buf:
block = f.read(4096)
if not block: raise ValueError("EOF without boundary begin")
buf = buf[-1024:] + block # Keep up to 5 KiB buffered
# Delete buffer contents until the end of the boundary
del buf[:buf.find(begin) + len(begin)]
# Copy data to another file (or do what you need to do with it)
with open("output.dat", "wb") as f2:
while end not in buf:
f2.write(buf[:-1024])
del buf[:-1024]
buf += f.read(4096)
if not buf: raise ValueError("EOF without boundary end")
f2.write(buf[:buf.find(end)])
It is taken that the boundaries are only up to 1024 bytes. You could use the actual lengths instead for perfection.
There were more issues at play than I first thought.
First, /tmp was coming from tmpfs having maximum size of 120MB.
Secondly, my nginx.conf was problematic. I needed to comment out stuff like this to clean it up:
#client_body_in_file_only on
#proxy_set_header X-FILE $request_body_file;
#proxy_set_body $request_body_file;
Then I needed to add these
proxy_redirect off; # Maybe not that importnat
proxy_request_buffering off; # Very important
After this the code
form = cgi.FieldStorage( fp=self.rfile, headers=self.headers, environ={'REQUEST_METHOD':'POST', 'CONTENT_TYPE':self.headers['Content-Type'], })
started to "work". I'm monitoring /tmp usage and it uses first 70MB and then full 120 MB. The uploaded file is truncated to 50 MB.
So, when I am reading and writing parsed cgi.FieldStorage even in a loop of 4096 characters, the system reads it automatically FULLY to somewhere in /tmp once and then tries to write the final file and encounters "No space left on device" error.
To fix this I keep the nginx.conf additions and just read the self.rfile manually myself in a loop, totally reading ['Content-Length'] (anything other makes it go bonkers). This is able to save it cleanly with one pass; there is no more than single time 70MB usage of /tmp .
I am currently working on a python program that will filter out some keywords in the "text" tag of a JSON file. The conversion for my system is the following: .gz --> open with gzip in mode rb --> transform the b'' into a str --> json.load(str)
def gzworker(fullpath, condition):
"""Worker opens one .gz file"""
print('Opening {}'.format(fullpath))
buffer = []
with gzip.open(fullpath, 'rb') as infile:
for _line in infile:
result = filter(json.loads(str(_line).split('|',1)[1][:-5]), condition)
if result:
buffer.append(result)
print('Closing {}'.format(fullpath))
return buffer
With the filter function taking into argument a JSON file
After running through this code multiple times I realised that actually the reason for it not working is that some commas seem to disappear. Does anybody know if it is possible that in the process some information is lost?
Result of what I get using the previous method (invalid JSON) [same result if I use decode]
{"created_at":"Thu Apr 17 04:45:03 +0000 2014","id":456654551114735616,"id_str":"456654551114735616","text":"#cam_clay1 come visit us soon plz \\ud83d\\ude18","source":"\\u003ca href=\\"http:\\/\\/twitter.com\\/download\\/iphone\\" rel=\\"nofollow\\"\\u003eTwitter for iPhone\\u003c\\/a\\u003e","truncated":false,"in_reply_to_status_id":456654343781892098,"in_reply_to_status_id_str":"456654343781892098","in_reply_to_user_id":427007607,"in_reply_to_user_id_str":"427007607","in_reply_to_screen_name":"cam_clay1","user":{"id":335107310,"id_str":"335107310","name":"Roger Krick","screen_name":"roger_krick","location":"Atlanta GA","url":null,"description":"I pushed Regina George in front of the bus.","protected":false,"followers_count":772,"friends_count":235,"listed_count":3,"created_at":"Thu Jul 14 04:49:29 +0000 2011","favourites_count":7192,"utc_offset":-18000,"time_zone":"Quito","geo_enabled":true,"verified":false,"statuses_count":9518,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\\/\\/pbs.twimg.com\\/profile_background_images\\/378800000021719152\\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_background_images\\/378800000021719152\\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_tile":true,"profile_image_url":"http:\\/\\/pbs.twimg.com\\/profile_images\\/453031044393222144\\/7vIvMWvk_normal.jpeg","profile_image_url_https":"https:\\/\\/pbs.twimg.com\\/profile_images\\/453031044393222144\\/7vIvMWvk_normal.jpeg","profile_banner_url":"https:\\/\\/pbs.twimg.com\\/profile_banners\\/335107310\\/1352964715","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.75781394,-84.38479358]},"coordinates":{"type":"Point","coordinates":[-84.38479358,33.75781394]},"place":{"id":"8173485c72e78ca5","url":"https:\\/\\/api.twitter.com\\/1.1\\/geo\\/id\\/8173485c72e78ca5.json","place_type":"city","name":"Atlanta","full_name":"Atlanta, GA","country_code":"US","country":"United States","contained_within":[],"bounding_box":{"type":"Polygon","coordinates":[[[-84.5464728,33.647845],[-84.5464728,33.8868859],[-84.289385,33.8868859],[-84.289385,33.647845]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"cam_clay1","name":"Cameron Clay","id":427007607,"id_str":"427007607","indices":[0,10]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}
What I should be getting (valid JSON):
{"created_at":"Thu Apr 17 04:45:03 +0000 2014","id":456654551114735616,"id_str":"456654551114735616","text":"#cam_clay1 come visit us soon plz \ud83d\ude18","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":456654343781892098,"in_reply_to_status_id_str":"456654343781892098","in_reply_to_user_id":427007607,"in_reply_to_user_id_str":"427007607","in_reply_to_screen_name":"cam_clay1","user":{"id":335107310,"id_str":"335107310","name":"Roger Krick","screen_name":"roger_krick","location":"Atlanta GA","url":null,"description":"I pushed Regina George in front of the bus.","protected":false,"followers_count":772,"friends_count":235,"listed_count":3,"created_at":"Thu Jul 14 04:49:29 +0000 2011","favourites_count":7192,"utc_offset":-18000,"time_zone":"Quito","geo_enabled":true,"verified":false,"statuses_count":9518,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/335107310\/1352964715","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.75781394,-84.38479358]},"coordinates":{"type":"Point","coordinates":[-84.38479358,33.75781394]},"place":{"id":"8173485c72e78ca5","url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/8173485c72e78ca5.json","place_type":"city","name":"Atlanta","full_name":"Atlanta, GA","country_code":"US","country":"United States","contained_within":[],"bounding_box":{"type":"Polygon","coordinates":[[[-84.5464728,33.647845],[-84.5464728,33.8868859],[-84.289385,33.8868859],[-84.289385,33.647845]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"cam_clay1","name":"Cameron Clay","id":427007607,"id_str":"427007607","indices":[0,10]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}
You are decoding your bytes wrong:
str(_line)
That converts the object to a representation, that is useful for debugging but not for handling the data:
>>> 'Føo'.encode('utf8')
b'F\xc3\xb8o'
>>> str('Føo'.encode('utf8'))
"b'F\\xc3\\xb8o'"
Note the b' prefix, the ' suffix, and the escape sequences!
Decode bytes objects:
_line.decode('utf8')
I'm assuming that since this is JSON data, it is using the UTF-8 encoding (the JSON standard states that that is the default, the only other permitted options being UTF-16 and UTF-32).
Better yet, use a io.TextIOWrapper() object to handle the decoding for you.
Next, you appear to have reversed your condition and data. filter() takes a condition first, data sequence second.
Corrected code:
def gzworker(fullpath, condition):
"""Worker opens one .gz file"""
print('Opening {}'.format(fullpath))
buffer = []
with gzip.open(fullpath, 'rb') as infile:
decoded = io.TextIOWrapper(infile, encoding='utf8')
for line in decoded:
json_data = line.split('|', 1)[1][:-4]
result = filter(condition, json.loads(json_data))
if result:
buffer.append(result)
print('Closing {}'.format(fullpath))
return buffer
I adjusted your slicing operation, assuming you previously sliced off the ' character introduced by the str() call.
I use the request-module of python 2.7 to post a bigger chunk of data to a service I can't change. Since the data is mostly text, it is large but would compress quite well. The server would accept gzip- or deflate-encoding, however I do not know how to instruct requests to do a POST and encode the data correctly automatically.
Is there a minimal example available, that shows how this is possible?
# Works if backend supports gzip
additional_headers['content-encoding'] = 'gzip'
request_body = zlib.compress(json.dumps(post_data))
r = requests.post('http://post.example.url', data=request_body, headers=additional_headers)
I've tested the solution proposed by Robᵩ with some modifications and it works.
PSEUDOCODE (sorry I've extrapolated it from my code so I had to cut out some parts and haven't tested, anyway you can get your idea)
additional_headers['content-encoding'] = 'gzip'
s = StringIO.StringIO()
g = gzip.GzipFile(fileobj=s, mode='w')
g.write(json_body)
g.close()
gzipped_body = s.getvalue()
request_body = gzipped_body
r = requests.post(endpoint_url, data=request_body, headers=additional_headers)
For python 3:
from io import BytesIO
import gzip
def zip_payload(payload: str) -> bytes:
btsio = BytesIO()
g = gzip.GzipFile(fileobj=btsio, mode='w')
g.write(bytes(payload, 'utf8'))
g.close()
return btsio.getvalue()
headers = {
'Content-Encoding': 'gzip'
}
zipped_payload = zip_payload(payload)
requests.post(url, zipped_payload, headers=headers)
I needed my posts to be chunked, since I had several very large files being uploaded in parallel. Here is a solution I came up with.
import requests
import zlib
"""Generator that reads a file in chunks and compresses them"""
def chunked_read_and_compress(file_to_send, zlib_obj, chunk_size):
compression_incomplete = True
with open(file_to_send,'rb') as f:
# The zlib might not give us any data back, so we have nothing to yield, just
# run another loop until we get data to yield.
while compression_incomplete:
plain_data = f.read(chunk_size)
if plain_data:
compressed_data = zlib_obj.compress(plain_data)
else:
compressed_data = zlib_obj.flush()
compression_incomplete = False
if compressed_data:
yield compressed_data
"""Post a file to a url that is content-encoded gzipped compressed and chunked (for large files)"""
def post_file_gzipped(url, file_to_send, chunk_size=5*1024*1024, compress_level=6, headers={}, requests_kwargs={}):
headers_to_send = {'Content-Encoding': 'gzip'}
headers_to_send.update(headers)
zlib_obj = zlib.compressobj(compress_level, zlib.DEFLATED, 31)
return requests.post(url, data=chunked_read_and_compress(file_to_send, zlib_obj, chunk_size), headers=headers_to_send, **requests_kwargs)
resp = post_file_gzipped('http://httpbin.org/post', 'somefile')
resp.raise_for_status()
I can't get this to work, but you might be able to insert the gzip data into a prepared request:
#UNPROVEN
r=requests.Request('POST', 'http://httpbin.org/post', data={"hello":"goodbye"})
p=r.prepare()
s=StringIO.StringIO()
g=gzip.GzipFile(fileobj=s,mode='w')
g.write(p.body)
g.close()
p.body=s.getvalue()
p.headers['content-encoding']='gzip'
p.headers['content-length'] = str(len(p.body)) # Not sure about this
r=requests.Session().send(p)
The accepted answer is probably wrong due to incorrect or missing headers:
additional_headers['content-encoding'] = 'gzip'
request_body = zlib.compress(json.dumps(post_data))
Using the zlib module's compressobj method that provides the wbits argument to specify the header format should work.
The default value is MAX_WBITS=15 which means zlib header format. This is correct for Content-Encoding: deflate.
For the compress method this argument is not available and the documentation does not mention which header (if any) is used unfortunately.
For Content-Encoding: gzip wbits should be something between 16 + (9 to 15), so 16+zlib.MAX_WBITS would be a good choice.
I checked how urllib3 decodes the response for these two cases and it implements a try-and-error mechanism for deflate (it tries raw and zlib header formats). That could explain why some people had problems with the solution from the accepted answer which others didn't have.
tl;dr
gzip
additional_headers['Content-Encoding'] = 'gzip'
compress = zlib.compressobj(wbits=16+zlib.MAX_WBITS)
body = compress.compress(data) + compress.flush()
deflate
additional_headers['Content-Encoding'] = 'deflate'
compress = zlib.compressobj()
body = compress.compress(data) + compress.flush()
I am using below code to save an html file with a time stamp in its name:
import contextlib
import datetime
import urllib2
import lxml.html
import os
import os.path
timestamp=''
filename=''
for dirs, subdirs, files in os.walk("/home/test/Desktop/"):
for f in files:
if "_timestampedfile.html" in f.lower():
timestamp=f.split('_')[0]
filename=f
break
if timestamp is '':
timestamp=datetime.datetime.now()
with contextlib.closing(urllib2.urlopen(urllib2.Request(
"http://www.google.com",
headers={"If-Modified-Since": timestamp}))) as u:
if u.getcode() != 304:
myfile="/home/test/Desktop/"+str(datetime.datetime.now())+"_timestampedfile.html"
file(myfile, "w").write(urllib2.urlopen("http://www.google.com").read())
if os.path.isfile("/home/test/Desktop/"+filename):
os.remove("/home/test/Desktop/"+filename)
html = lxml.html.parse(myfile)
else:
html = lxml.html.parse("/home/test/Desktop/"+timestamp+"_timestampedfile.html")
links=html.xpath("//a/#href")
print u.getcode()
When I run this code every time I get the code 200 from If-Modified-since header. Where am I doing mistake? My goal here is to save and use an html file and if it is modified after last time it is accessed, html file should be overwritten.
The problem is that If-Modified-Since is supposed to be a formatted date string:
If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT
but you're passing in a datetime tuple.
Try something like this:
timestamp = time.time()
...
time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime(timestamp))
The second reason your code isn't working as you expect:
http://www.google.com/ does not seem to honor If-modified-since. That's allowed per the RFC, and they may have various reasons for choosing that behavior.
c) If the variant has not been modified since a valid If-
Modified-Since date, the server SHOULD return a 304 (Not
Modified) response.
If you try http://www.stackoverflow.com/, for example, you'll see a 304. (I just tried it.)
I'm generating some pdfs using ReportLab in Django. I followed and experimented with the answer given to this question, and realised that the double-quotes therein don't make sense:
response['Content-Disposition'] = 'inline; filename=constant_"%s_%s".pdf'\
% ('foo','bar')
gives filename constant_-foo_bar-.pdf
response['Content-Disposition'] = 'inline; filename=constant_%s_%s.pdf' \
% ('foo','bar')
gives filename constant_foo_bar.pdf
Why is this? Is it just to do with slug-esque sanitisation for filesystems?
It seems from the research in this question that it's actually the browser doing the encoding/escaping. I used cURL to confirm that Django itself does not escape these headers. First, I set up a minimal test view:
# views.py
def index(request):
response = render(request, 'template.html')
response['Content-Disposition'] = 'inline; filename=constant"a_b".html'
return response
then ran:
carl#chaffinch:~$ HEAD http://localhost:8003
200 OK
Date: Thu, 16 Aug 2012 19:28:54 GMT
Server: WSGIServer/0.1 Python/2.7.3
Vary: Cookie
Content-Type: text/html; charset=utf-8
Client-Date: Thu, 16 Aug 2012 19:28:54 GMT
Client-Peer: 127.0.0.1:8003
Client-Response-Num: 1
Content-Disposition: inline; filename=constant"a_b".html
Check out the header: filename=constant"a_b".html. The quotes are still there!
Python does not convert double quotes to hyphens in filenames:
>>> with open('constant_"%s_%s".pdf' % ('foo', 'bar'), 'w'): pass
$ ls
...
constant_"foo_bar".pdf
...
Probably it's django that will not allow you to use too strange names.
Anyway I'd recommend to use only the following characters in filenames, to avoid portability issues:
Letters [a-z][A-Z]
digits [0-9]
hyphen(-), underscore(_), plus(+)
Note: I've excluded the whitespace in the list, because there are a lot of scripts that don't use proper quoting, and break with such filenames.
If you restrict yourself to this set of characters you probably wont ever have any problems with pathnames. Obviously other people or other programs may still not follow this "guideline" so you shouldn't assume this convention is shared by paths you obtain from users or other external sources.
Your usage is slightly incorrect. You would want the quotes around the entire filename in order to account for spaces, etc.
change:
response['Content-Disposition'] = 'inline; filename=constant_"%s_%s".pdf'\
% ('foo','bar')
to:
response['Content-Disposition'] = 'inline; filename="constant_%s_%s.pdf"'\
% ('foo','bar')