Extract Field Headers from Blob Handler in Flask on App Engine - python

I am trying to parse the POST request sent by the App Engine blobstore handler in development to get the Google Cloud Storage file path ('/gs/...') using Flask. Webapp2 has a method to get this if you inherit from blobstore_handlers.BlobstoreUploadHandler :- self.get_file_infos(). This method is not available in Flask.
This is a sample of the raw request data in Flask using request.get_data():
--===============0287937837666164318==
Content-Type: message/external-body; blob-key="encoded_gs_file:ZnBscy1kZXYvZmFrZS1nVTFHNFdrc3hobUFoaEtWVEVmNHZnPT0="; access-type="X-AppEngine-BlobKey"
Content-Disposition: form-data; name="file"; filename="Human Code Reviews One.pdf"
Content-Type: application/pdf
Content-Length: 951486
Content-MD5: NzNhOTI0YjdjNTFiMjEyYmY0NDUzZGFmYzBlOTExNTY=
X-AppEngine-Cloud-Storage-Object: /gs/appname/fake-gU1G4WksxhmAhhKVTEf4vg==
content-disposition: form-data; name="file"; filename="Human Code Reviews One.pdf"
X-AppEngine-Upload-Creation: 2018-01-22 12:26:08.095166
--===============0287937837666164318==--
I have tried both msg = email.parser.Parser().parsestr(raw_data) and msg = email.message_from_string(raw_data) but msg.items() return an empty list.
If I do rd = raw_data.split('\r\n') and parse a line containing a proper header I get what I want for that line: [('X-AppEngine-Cloud-Storage-Object', '/gs/appname/fake-gU1G4WksxhmAhhKVTEf4vg==')].
The issue is how to do this for the entire string and skip the blank and boundary lines.
For now, I am using the following code but I can't help but think there's a way to do this without reinventing the wheel:
for line in raw_data.split('\r\n'):
if line.startswith(blobstore.CLOUD_STORAGE_OBJECT_HEADER):
gcs_path = line.split(':')[1].strip()
Thank you.
Edit:
This question is not a duplicate of the one here (How to get http headers in flask?) because I have a raw string (called a field header, see the boundary delimiters not present in HTTP headers) I would like to parse into a dictionary.

Related

Handling and uploading binary file data to S3 using AWS API gateway and Lambda in Python

I am trying to process an http request which consists of multipart/form-data.
Getting input request body as -
START RequestId: 77e9936c-6bf5-48e2-91bc-ab6c9a2d15da Version: $LATEST ------WebKitFormBoundarytY6U5v3pmlyEDnhY Content-Disposition: form-data; name="contactType" {"value":"Technical","label":"Question for Technical Team"} ------WebKitFormBoundarytY6U5v3pmlyEDnhY Content-Disposition: form-data; name="message" "Test 07/06/2022 9:59" ------WebKitFormBoundarytY6U5v3pmlyEDnhY Content-Disposition: form-data; name="upload"; filename="W15V011.pdf" Content-Type: application/pdf %PDF-1.3 %����--- long filemeta data
After reading form data (with the help of cgi), I am able to extract all the fields and their data separately. Problem is about getting the file data properly. When I print, I am getting below response:
b'\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\x00\xef\xbf\xbd\x00\x03...conti
when I am trying to upload it on S3 bucket using s3.put_object.
s3.put_object(Bucket=bucket, Key=filename, Body=upload)
able to upload file but while downloading received corrupted file. I have tried in many ways but unable to fix it. Please help me here to fix it.

Handle file from WSGI request

Question
What is a good way to handle a file that has been uploaded through a WSGI POST request?
More info
So far, I'm able to read the raw POST data from environ[wsgi.input]. At this point the issue I am having is that the information associated with the file and the file itself are jammned together into one string:
'------WebKitFormBoundarymzmB1wyHKjyqZrDm
Content-Disposition: form-data; name="file"; filename="new file.wav"
Content-Type: audio/wav
THIS IS THE CONTENT
THIS IS THE CONTENT
THIS IS THE CONTENT
THIS IS THE CONTENT
THIS IS THE CONTENT
------WebKitFormBoundarymzmB1wyHKjyqZrDm--
'
Is there a library in python I should be using to handle information more cleanly? Ultimately, I'd like to take the file contents and then turn around and upload to Amazon S3.
You can use cgi.FieldStorage.
import cgi
form = cgi.FieldStorage(fp=environ['wsgi.input'], environ=environ)
f = form['file'].file
# You can use `f` as a file object: f.read(...)
Usually you want much more abstraction than raw WSGI. Consider frameworks that run on WSGI.

Is Python's Requests module adding data to a file that it is posting?

I'm trying to use the requests module to send a csv file to an API that uploads data into a database. Since the data is going into a database, the API is configured to reject files that have an unrecognized column name. The accepted columns are "id", "artist", "video". I have a test.csv file with just 1 row of data:
id,artist,video
1,The Shins,Phantom Limb
When I send the file to the api with the following curl request, it goes through successfully.
curl -i -u myUser:myPassword -X POST -T .\test.csv "http://destination.com/api/endpoint/create-or-update-records"
Here's the curl response message:
HTTP/1.1 100 Continue
HTTP/1.1 200 OK
Date: Sat, 11 Oct 2014 14:47:51 GMT
Content-Type: text/plain; charset=UTF-8
Content-Length: 0
Connection: close
However, when I try to send the file using the requests post method like this:
url = "http://destination.com/api/endpoint/create-or-update-records"
files = {'file': open("test.csv", "rb")}
r = requests.post(url, files=files, auth=("myUser","myPassword"))
The response I get back is this:
Unknown fields: '--5a6f03307ed74747904844625f76a82e'. Valid fields are: 'id', 'artist', 'video'
If I send the file again, I get the same message, but the "--lotsofcharacters" is now a difference set of characters.
I'm guessing I'm missing a setting or something, but I have combed the requests API and can't figure out what it is. What is different between the curl request and the requests request that is causing one to fail, and the other to go through?
You're not posting plain text to your server with requests. The documentation explicitly states that you use the files parameter when you wish to perform a multipart/form-data upload to the server. In this case all you need to do is
with open('test.csv', 'rb') as csv_file:
r = requests.post(url, data=csv_file, headers={'Content-Type': 'text/plain'}, auth=('user', 'password'))

POST request with Multipart/form-data. Content-type not correct

We're trying to write a script with python (using python-requests a.t.m.) to do a POST request to a site where the content has to be MultipartFormData.
When we do this POST request manually (by filling in the form on the site and post), using wireshark, this came up (short version):
Content-Type: multipart/form-data;
Content-Disposition: form-data; name="name"
Data (8 Bytes)
John Doe
When we try to use the python-requests library for achieving the same result, this is sent:
Content-Type: application/x-pandoplugin
Content-Disposition: form-data; name="name"; filename="name"\r\n
Media type: application/x-pandoplugin (12 Bytes)
//and then in this piece is what we posted://
John Doe
The weird thing is that the 'general type' of the packet indeed is multipart/form-data, but the individual item sent (key = 'name', value= 'John Doe') has type application/x-pandoplugin (a random application on my pc I guess).
This is the code used:
response = s.post('http://url.com', files={'name': 'John Doe'})
Is there a way to specify the content-type of the individual items instead of using the headers argument (which only changes the type of the 'whole' packet)?
We think the server doesn't respond correctly due to the fact that it can't understand the content-type we send it.
Little update:
I think the different parts of the multipart content are now identical to the ones sent if I do the POST in the browser, so that's good. Still the server doesn't actually do the changes I send it with the script. The only thing that still is different is the order of the different parts.
For example this is what my browser sends:
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part: (text/plain)
Content-Disposition: form-data; name="file"; filename="ex.txt"\r\n
Content-Type: text/plain\r\n\r\n
Line-based text data: text/plain
lore ipsum blabbla
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part:
Content-Disposition: form-data; name="seq"\r\n\r\n
Data (2 bytes)
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part:
Content-Disposition: form-data; name="name"\r\n\r\n
Data (2 bytes)
And this is what the script (using python-requests) sends:
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part:
Content-Disposition: form-data; name="name"\r\n\r\n
Data (2 bytes)
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part: (text/plain)
Content-Disposition: form-data; name="file"; filename="ex.txt"\r\n
Content-Type: text/plain\r\n\r\n
Line-based text data: text/plain
lore ipsum blabbla
Boundary: \r\n------WebKitFormBoundary3eXDYO1lG8Pgxjwj\r\n
Encapsulated multipart part:
Content-Disposition: form-data; name="seq"\r\n\r\n
Data (2 bytes)
Could it be possible that the server counts on the order of the parts? According to Multipart upload form: Is order guaranteed?, it apparently is? And if so, is it possible to explicitly force an order using the requests library?
And to make things worse in that case: There is a mixture of a file and just text values.
So forcing an order seems rather difficult. This is the current way I do it:
s.post('http://www.url.com', files=files,data = form_values)
EDIT2:
I did a modification in the requests plugin to make sure the order of the parts is the same as in the original request. This doesn't fix the problem so I guess there is no straightforward solution for my problem. I'll send a mail to the devs of the site and hope they can help me!
your code looks correct.
requests.post('http://url.com', files={'name': 'John Doe'})
... and should send a 'multipart/form-data' Post.
and indeed, I get something like this posted:
Accept-Encoding: gzip, deflate, compress
Connection: close
Accept: */*
Content-Length: 188
Content-Type: multipart/form-data; boundary=032a1ab685934650abbe059cb45d6ff3
User-Agent: python-requests/1.2.3 CPython/2.7.4 Linux/3.8.0-27-generic
--032a1ab685934650abbe059cb45d6ff3
Content-Disposition: form-data; name="name"; filename="name"
Content-Type: application/octet-stream
John Doe
--032a1ab685934650abbe059cb45d6ff3--
I have no idea why you'd get that weird Content-Type header:
Content-Type: application/x-pandoplugin
I would begin by removing Pando Web Plugin from your machine completely, and then try your python-requests code again. (or try from a different machine)
As of today you can do:
response = s.post('http://url.com', files={'name': (filename, contents, content_type)})
Python uses a system-wide configuration file to "guess" the mime-type of a file. If those plugins are registering your file extension with their custom mime-type you'll end up putting that in instead.
The safest approach is make your own mime type guessing that suits the particular server you're sending do, and only use the native python mime type guessing for extensions you didn't think of.
How exactly you specify the content-type manually with python-requests I don't know, but I expect it should be possible.

How to extract JSON data from a response containing a header and body?

this is my first question posed to Stack Overflow, because typically I can find the solutions to my problem here, but for this particular situation, I cannot. I am writing a Python plugin for my compiler that outputs REST calls in various languages for interaction with an API. I am authenticating with the socket and ssl modules by sending a username and password in the request body in JSON form. Upon successful authentication, the API returns a response in the following format with important response data in the body:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Date: Tue, 05 Feb 2013 03:36:18 GMT
Vary: Accept-Charset, Accept-Encoding, Accept-Language, Accept
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST,OPTIONS,GET
Access-Control-Allow-Headers: Content-Type
Server: Restlet-Framework/2.0m5
Content-Type: text/plain;charset=ISO-8859-1
Content-Length: 94
{"authentication-token":"<token>","authentication-secret":"<secret>"}
This is probably a very elementary question for Pythonistas, given its powerful tools for String manipulation. But alas, I am a new programmer who started with Java. I would like to know what would be the best way to parse this entire response to obtain the "<token>" and "<secret>"? Should I use a search for a "{" and dump the substring into a json object? My intuition is telling me to try and use the re module, but I cannot seem to figure out how it would be used in this situation, since the pattern of the token and secret are obviously not predictable. Because I have opted to authenticate with a low-level module set, this response is one big String obtained by constructing the header and appending JSON data to it in the body, then executing the request and obtaining the response with the following code:
#Socket configuration and connection execution
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
conn = ssl.wrap_socket(sock, ca_certs = pem_file)
conn.connect((host, port))
conn.send(req)
response = conn.recv()
print(response)
The print statement outputs the first code sample. Any help or insight would be greatly appreciated!
HTTP headers are split from the rest of the body by a \r\n\r\n sequence. Do something like:
import json
...
(headers, js) = response.split("\r\n\r\n")
data = json.loads(js)
token = data["authentication-token"]
secret = data["authentication-secret"]
You'll probably want to check the response, etc, and various libraries (e.g. requests) can do all of this a whole lot easier for you.

Categories