Am I parsing this HTTP POST request properly?

Am I parsing this HTTP POST request properly? - python

Let me start off by saying, I'm using the twisted.web framework. Twisted.web's file uploading didn't work like I wanted it to (it only included the file data, and not any other information), cgi.parse_multipart doesn't work like I want it to (same thing, twisted.web uses this function), cgi.FieldStorage didn't work ('cause I'm getting the POST data through twisted, not a CGI interface -- so far as I can tell, FieldStorage tries to get the request via stdin), and twisted.web2 didn't work for me because the use of Deferred confused and infuriated me (too complicated for what I want).
That being said, I decided to try and just parse the HTTP request myself.
Using Chrome, the HTTP request is formed like this:
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="upload_file_nonce"
11b03b61-9252-11df-a357-00266c608adb
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename="login.html"
Content-Type: text/html
<!DOCTYPE html>
<html>
<head>
...
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename=""
------WebKitFormBoundary7fouZ8mEjlCe92pq--
Is this always how it will be formed? I'm parsing it with regular expressions, like so (pardon the wall of code):
(note, I snipped out most of the code to show only what I thought was relevant (the regular expressions (yeah, nested parentheses), this is an __init__ method (the only method so far) in an Uploads class I built. The full code can be seen in the revision history (I hope I didn't mismatch any parentheses)
if line == "--{0}--".format(boundary):
finished = True
if in_header == True and not line:
in_header = False
if 'type' not in current_file:
ignore_current_file = True
if in_header == True:
m = re.match(
"Content-Disposition: form-data; name=\"(.*?)\"; filename=\"(.*?)\"$", line)
if m:
input_name, current_file['filename'] = m.group(1), m.group(2)
m = re.match("Content-Type: (.*)$", line)
if m:
current_file['type'] = m.group(1)
else:
if 'data' not in current_file:
current_file['data'] = line
else:
current_file['data'] += line
you can see that I start a new "file" dict whenever a boundary is reached. I set in_header to True to say that I'm parsing headers. When I reach a blank line, I switch it to False -- but not before checking if a Content-Type was set for that form value -- if not, I set ignore_current_file since I'm only looking for file uploads.
I know I should be using a library, but I'm sick to death of reading documentation, trying to get different solutions to work in my project, and still having the code look reasonable. I just want to get past this part -- and if parsing an HTTP POST with file uploads is this simple, then I shall stick with that.
Note: this code works perfectly for now, I'm just wondering if it will choke on/spit out requests from certain browsers.

My solution to this Problem was parsing the content with cgi.FieldStorage like:
class Root(Resource):
def render_POST(self, request):
self.headers = request.getAllHeaders()
# For the parsing part look at [PyMOTW by Doug Hellmann][1]
img = cgi.FieldStorage(
fp = request.content,
headers = self.headers,
environ = {'REQUEST_METHOD':'POST',
'CONTENT_TYPE': self.headers['content-type'],
}
)
print img["upl_file"].name, img["upl_file"].filename,
print img["upl_file"].type, img["upl_file"].type
out = open(img["upl_file"].filename, 'wb')
out.write(img["upl_file"].value)
out.close()
request.redirect('/tests')
return ''

You're trying to avoid reading documentation, but I think the best advice is to actually read:
rfc 2388 Returning Values from Forms: multipart/form-data
rfc 1867 Form-based File Upload in HTML
to make sure you don't miss any cases. An easier route might be to use the poster library.

The content-disposition header has no defined order for fields, plus it may contain more fields than just the filename. So your match for filename may fail - there may not even be a filename!
See rfc2183 (edit that's for mail, see rfc1806, rfc2616 and maybe more for http)
Also I would suggest in these kind of regexps to replace every space by \s*, and not to rely on character case.

Related

How To Remove (%0D) in python requests

import requests
nexmokey = 'mykey'
nexmosec = 'mysecretkey'
nexmoBal = 'https://rest.nexmo.com/account/get-balance?api_key={}&api_secret={}'.format(nexmokey,nexmosec)
rr = requests.get(nexmoBal)
print(rr.url)
I would like to send a request to post at
https://rest.nexmo.com/account/get-balance?api_key=mykey&api_secret=mysecretkey
but why does %0D appear?
https://rest.nexmo.com/account/get-balance?api_key=mykey%0D&api_secret=mysecretkey%0D

requests.get expects parameters like api_secret=my_secret to be provided through the params argument, not as part of the URL, which is URL-encoded for you.
Use this:
nexmoBal = 'https://rest.nexmo.com/account/get-balance'
rr = requests.get(nexmoBal, params={'api_key': nexmokey, 'api_secret': nexmosec})
The fact that %0D ends up in there, indicates you have a character #13 (0D hexadecimal) in there, which is a carriage return (part of the end of line on Windows systems) - probably because you are reading the key and secret from some file and didn't include them in the example code.
Also, note that you mention you want to post, but you're calling .get().

How to get HTML from URL that returns "junk" data?

I want to get the html source code of a given url. I had tried using this
import urllib2
url = 'http://mp3.zing.vn' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
But the returned data is not in HTML format for some pages. I tried with another link like http://phuctrancs.info and it works (as this page is plain html based). I have also tried using BeautifulSoup library but it didn't work also. Any suggestion?

You're getting the HTML you expect, but it's compressed. I tried this URL by hand and got back a binary mess with this in the headers:
Content-Encoding: gzip
I saved the response body to a file and was able to gunzip it on the command line. You should also be able to decompress it in your program with the functions in the standard library's zlib module.
Update for anyone having trouble with zlib.decompress...
The compressed data you will get (or at least that I got in Python 2.6) apparently has a "gzip header and trailer" like you'd expect in *.gz files, while zlib.decompress expects a "zlib wrapper"... probably. I kept getting an unhelpful zlib.error exception:
Traceback (most recent call last):
File "./fixme.py", line 32, in <module>
text = zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check
The solution is entirely undocumented in the Python standard library, but can be found in Greg Hewgill's answer to a question about gzip streams: You have to feed zlib.decompress a wbits argument, created by adding a magic number to an undocumented module-level constant <grumble, mutter...>:
text = zlib.decompress(data, 16 + zlib.MAX_WBITS)
If you feel this isn't obfuscated enough, note that a 32 here would be every bit as magical as the 16.
The only hint of this is buried in the original zlib's manual, under the deflateInit2 function:
windowBits can also be greater than 15 for optional gzip decoding. Add 16 to windowBits to write a simple gzip header and trailer around the compressed data instead of a zlib wrapper.
...and the inflateInit2 function:
windowBits can also be greater than 15 for optional gzip decoding. Add 32 to windowBits to enable zlib and gzip decoding with automatic header detection, or add 16 to decode only the gzip format [...]
Note that the zlib.decompress docs explicitly tell you that you can't do this:
The default value is therefore the highest value, 15.
But this is... the opposite of true.
<fume, curse, rant...>

have you look into the response code, urllib2 may need you to handle the response such as 301 redirect and so on.
you should print the response code like:
data = usock.read()
if usock.getcode() != 200:
print "something unexpected"
updated:
if the response contains None-localized or none-readable text, then you might need to specify the request character set in the request header.
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.opener(urllib2.HTTPCookieProcessor(cookie))
opener.addheaders = [("Content-type: text/html; charset=UTF-8")]
urllib2.install_opener(opener)
PS: untested.

use beautiful soup from python
import request
from bs4 import BeautifulSoup
url = 'http://www.google.com'
r=request.get(url)
b=BeautifulSoup(r.text)
b will contain all the html tags and also provides you iteractor to traverse elements/tags. To know more link is https://pypi.python.org/pypi/beautifulsoup4/4.3.2

How to add per-file custom header to multipart upload with requests?

python-requests appears to support custom collections as a list/mapping of files to upload.
This in turn should allow me to add custom header fields with each file being uploading within one multipart upload.
How can I actually do that?

python-requests appears to support custom collections as a list/mapping of files to upload.
This in turn should allow me to add custom header fields with each file being uploading within one multipart upload.
Nope. Each file in the list/mapping must be a file object (or contents string), a 2-tuple of a filename and file object (or contents), or a 3-tuple of a filename, file object (or contents) and file type. Anything else is illegal.
Unless #1640 gets accepted upstream, in which case all you'll have to do is use a 4-tuple of a filename, file object (or contents), file type, and header dictionary.
The right thing to do at this point is probably to use a different library. For example, if you use urllib3 directly instead of through the request wrappers, it makes what you want to do reasonably easy. You just have to deal with all the extra verbosity of using urllib3 instead of requests.
And meanwhile, you can file a feature request against requests and an easier way may get added in the future.
But it's a bit frustrating that the functionality you need is all there under the covers, you just can't get to it.
It looks like doing things cleanly would be a nightmare. But it's all pretty simple code, and we can hack our way down to it pretty easily, so let's do that.
Take a look at requests.PreparedRequest.prepare_body. It expects each file in files to be a tuple of filename, contents or file object, and optionally content-type. It basically just reads in any file objects to convert them to contents, and passes everything straight through to urllib3.filepost.encode_multipart_formdata. So, unless we want to replace this code, we're going to need to smuggle the headers in with one of these values. Let's do that by passing (filename, contents, (content_type, headers_dict)). So, requests itself is unchanged.
What about that urllib3.filepost.encode_multipart_formdata that it calls? As you can see, if you pass it tuples for the files, it calls a function called iter_field_objects, which ultimately calls urllib3.fields.RequestField.from_tuples on each one. But if you look at the from_tuples alternate constructor, it says it's there to handle an "old-style" way of constructing RequestField objects, and the normal constructor is there for a "new-style" way, which actually does let you pass in headers.
So, all we need to do is monkeypatch iter_field_objects, replacing its last line with one that uses the new-style way, and we should be done. Let's try:
import requests
import requests.packages.urllib3
from requests.packages.urllib3.fields import RequestField, guess_content_type
import six
old_iter_field_objects = requests.packages.urllib3.filepost.iter_field_objects
def iter_field_objects(fields):
if isinstance(fields, dict):
i = six.iteritems(fields)
else:
i = iter(fields)
for field in i:
if isinstance(field, RequestField):
yield field
else:
name, value = field
filename = value[0]
data = value[1]
content_type = value[2] if len(value)>2 else guess_content_type(filename)
headers = None
if isinstance(content_type, (tuple, list)):
content_type, headers = content_type
rf = RequestField(name, data, filename, headers)
rf.make_multipart(content_type=content_type)
yield rf
requests.packages.urllib3.filepost.iter_field_objects = iter_field_objects
And now:
>>> files = {'file': ('foo.txt', 'foo\ncontents\n'),
... 'file2': ('bar.txt', 'bar contents', 'text/plain'),
... 'file3': ('baz.txt', 'baz contents', ('text/plain', {'header': 'value'}))}
>>. r = request.Request('POST', 'http://example.com', files=files)
>>> print r.prepare().body
--1ee28922d26146e7a2ee201e5bf22c44
Content-Disposition: form-data; name="file3"; filename="baz.txt"
Content-Type: text/plain
header: value
baz contents
--1ee28922d26146e7a2ee201e5bf22c44
Content-Disposition: form-data; name="file2"; filename="bar.txt"
Content-Type: text/plain
bar contents
--1ee28922d26146e7a2ee201e5bf22c44
Content-Disposition: form-data; name="file"; filename="foo.txt"
Content-Type: text/plain
foo
Tada!
Note that you need relatively up-to-date requests/urllib3 for this to work. I think requests 2.0.0 is sufficient.

Getting an empty file when served by django

I have the following code for managing file download through django.
def serve_file(request, id):
file = models.X.objects.get(id=id).file #FileField
file.open('rb')
wrapper = FileWrapper(file)
mt = mimetypes.guess_type(file.name)[0]
response = HttpResponse(wrapper, content_type=mt)
import unicodedata, os.path
filename = unicodedata.normalize('NFKD', os.path.basename(file.name)).encode("utf8",'ignore')
filename = filename.replace(' ', '-') #Avoid browser to ignore any char after the space
response['Content-Length'] = file.size
response['Content-Disposition'] = 'attachment; filename={0}'.format(filename)
#print response
return response
Unfortunately, my browser get an empty file when downloading.
The printed response seems correct:
Content-Length: 3906
Content-Type: text/plain
Content-Disposition: attachment; filename=toto.txt
blah blah ....
I have similar code running ok. I don't see what can be the problem. Any idea?
PS: I have tested the solution proposed here and get the same behavior
Update:
Replacing wrapper = FileWrapper(file) by wrapper = file.read() seems to fix the problem
Update: If I comment the print response, I get similar issue:. the file is empty. Only difference: FF detects a 20bytes size. (the file is bigger than this)

File object is an interable, and a generator. It can be read only once before being exausted. Then you have to make a new one, of use a method to start at the begining of the object again (e.g: seek()).
read() returns a string, which can be read multiple times without any problem, this is why it solves your issue.
So just make sure that if you use a file like object, you don't read it twice in a row. E.G: don't print it, then returns it.

From django documentation:
FieldFile.open(mode='rb') Behaves like the standard Python open()
method and opens the file associated with this instance in the mode
specified by mode.
If it works like pythons open then it should return a file-object, and should be used like this:
f = file.open('rb')
wrapper = FileWrapper(f)

How do I handle file upload via PUT request in Django?

I'm implementing a REST-style interface and would like to be able to create (via upload) files via a HTTP PUT request. I would like to create either a TemporaryUploadedFile or a InMemoryUploadedFile which I can then pass to my existing FileField and .save() on the object that is part of the model, thereby storing the file.
I'm not quite sure about how to handle the file upload part. Specifically, this being a put request, I do not have access to request.FILES since it does not exist in a PUT request.
So, some questions:
Can I leverage existing functionality in the HttpRequest class, specifically the part that handles file uploads? I know a direct PUT is not a multipart MIME request, so I don't think so, but it is worth asking.
How can I deduce the mime type of what is being sent? If I've got it right, a PUT body is simply the file without prelude. Do I therefore require that the user specify the mime type in their headers?
How do I extend this to large amounts of data? I don't want to read it all into memory since that is highly inefficient. Ideally I'd do what TemporaryUploadFile and related code does - write it part at a time?
I've taken a look at this code sample which tricks Django into handling PUT as a POST request. If I've got it right though, it'll only handle form encoded data. This is REST, so the best solution would be to not assume form encoded data will exist. However, I'm happy to hear appropriate advice on using mime (not multipart) somehow (but the upload should only contain a single file).
Django 1.3 is acceptable. So I can either do something with request.raw_post_data or request.read() (or alternatively some other better method of access). Any ideas?

Django 1.3 is acceptable. So I can
either do something with
request.raw_post_data or
request.read() (or alternatively some
other better method of access). Any
ideas?
You don't want to be touching request.raw_post_data - that implies reading the entire request body into memory, which if you're talking about file uploads might be a very large amount, so request.read() is the way to go. You can do this with Django <= 1.2 as well, but it means digging around in HttpRequest to figure out the the right way to use the private interfaces, and it's a real drag to then ensure your code will also be compatible with Django >= 1.3.
I'd suggest that what you want to do is to replicate the existing file upload behaviour parts of the MultiPartParser class:
Retrieve the upload handers from request.upload_handlers (Which by default will be MemoryFileUploadHandler & TemporaryFileUploadHandler)
Determine the request's content length (Search of Content-Length in HttpRequest or MultiPartParser to see the right way to do this.)
Determine the uploaded file's filename, either by letting the client specify this using the last path part of the url, or by letting the client specify it in the "filename=" part of the Content-Disposition header.
For each handler, call handler.new_file with the relevant args (mocking up a field name)
Read the request body in chunks using request.read() and calling handler.receive_data_chunk() for each chunk.
For each handler call handler.file_complete(), and if it returns a value, that's the uploaded file.
How can I deduce the mime type of what
is being sent? If I've got it right, a
PUT body is simply the file without
prelude. Do I therefore require that
the user specify the mime type in
their headers?
Either let the client specify it in the Content-Type header, or use python's mimetype module to guess the media type.
I'd be interested to find out how you get on with this - it's something I've been meaning to look into myself, be great if you could comment to let me know how it goes!
Edit by Ninefingers as requested, this is what I did and is based entirely on the above and the django source.
upload_handlers = request.upload_handlers
content_type = str(request.META.get('CONTENT_TYPE', ""))
content_length = int(request.META.get('CONTENT_LENGTH', 0))
if content_type == "":
return HttpResponse(status=400)
if content_length == 0:
# both returned 0
return HttpResponse(status=400)
content_type = content_type.split(";")[0].strip()
try:
charset = content_type.split(";")[1].strip()
except IndexError:
charset = ""
# we can get the file name via the path, we don't actually
file_name = path.split("/")[-1:][0]
field_name = file_name
Since I'm defining the API here, cross browser support isn't a concern. As far as my protocol is concerned, not supplying the correct information is a broken request. I'm in two minds as to whether I want say image/jpeg; charset=binary or if I'm going to allow non-existent charsets. In any case, I'm putting setting Content-Type validly as a client-side responsibility.
Similarly, for my protocol, the file name is passed in. I'm not sure what the field_name parameter is for and the source didn't give many clues.
What happens below is actually much simpler than it looks. You ask each handler if it will handle the raw input. As the author of the above states, you've got MemoryFileUploadHandler & TemporaryFileUploadHandler by default. Well, it turns out MemoryFileUploadHandler will when asked to create a new_file decide whether it will or not handle the file (based on various settings). If it decides it's going to, it throws an exception, otherwise it won't create the file and lets another handler take over.
I'm not sure what the purpose of counters was, but I've kept it from the source. The rest should be straightforward.
counters = [0]*len(upload_handlers)
for handler in upload_handlers:
result = handler.handle_raw_input("",request.META,content_length,"","")
for handler in upload_handlers:
try:
handler.new_file(field_name, file_name,
content_type, content_length, charset)
except StopFutureHandlers:
break
for i, handler in enumerate(upload_handlers):
while True:
chunk = request.read(handler.chunk_size)
if chunk:
handler.receive_data_chunk(chunk, counters[i])
counters[i] += len(chunk)
else:
# no chunk
break
for i, handler in enumerate(upload_handlers):
file_obj = handler.file_complete(counters[i])
if not file_obj:
# some indication this didn't work?
return HttpResponse(status=500)
else:
# handle file obj!

Newer Django versions allow for handling this a lot easier thanks to https://gist.github.com/g00fy-/1161423
I modified the given solution like this:
if request.content_type.startswith('multipart'):
put, files = request.parse_file_upload(request.META, request)
request.FILES.update(files)
request.PUT = put.dict()
else:
request.PUT = QueryDict(request.body).dict()
to be able to access files and other data like in POST. You can remove the calls to .dict() if you want your data to be read-only.

I hit this problem while working with Django 2.2, and was looking for something that just worked for uploading a file via PUT request.
from django.http import QueryDict
from django.http.multipartparser import MultiValueDict
from django.core.files.uploadhandler import (
SkipFile,
StopFutureHandlers,
StopUpload,
)
class PutUploadMiddleware(object):
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
method = request.META.get("REQUEST_METHOD", "").upper()
if method == "PUT":
self.handle_PUT(request)
return self.get_response(request)
def handle_PUT(self, request):
content_type = str(request.META.get("CONTENT_TYPE", ""))
content_length = int(request.META.get("CONTENT_LENGTH", 0))
file_name = request.path.split("/")[-1:][0]
field_name = file_name
content_type_extra = None
if content_type == "":
return HttpResponse(status=400)
if content_length == 0:
# both returned 0
return HttpResponse(status=400)
content_type = content_type.split(";")[0].strip()
try:
charset = content_type.split(";")[1].strip()
except IndexError:
charset = ""
upload_handlers = request.upload_handlers
for handler in upload_handlers:
result = handler.handle_raw_input(
request.body,
request.META,
content_length,
boundary=None,
encoding=None,
)
counters = [0] * len(upload_handlers)
for handler in upload_handlers:
try:
handler.new_file(
field_name,
file_name,
content_type,
content_length,
charset,
content_type_extra,
)
except StopFutureHandlers:
break
for chunk in request:
for i, handler in enumerate(upload_handlers):
chunk_length = len(chunk)
chunk = handler.receive_data_chunk(chunk, counters[i])
counters[i] += chunk_length
if chunk is None:
# Don't continue if the chunk received by
# the handler is None.
break
for i, handler in enumerate(upload_handlers):
file_obj = handler.file_complete(counters[i])
if file_obj:
# If it returns a file object, then set the files dict.
request.FILES.appendlist(file_name, file_obj)
break
any(handler.upload_complete() for handler in upload_handlers)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.