Parse Python FormData - python

I'm having difficulty finding out how to parse FormData with Python. I have base64_decoded my python string already as follows:
def lambda_handler(event, context):
print(base64.b64decode(event['body']))
# dict_obj = json.loads(base64.b64decode(event['body'])) #returns error
Now printing out the base64 decoded body prints out:
b'------WebKitFormBoundaryHFIqTxrMuAs5kep2\r\nContent-Disposition: form-data; name="username"\r\n\r\nTESTUSER\r\n------WebKitFormBoundaryHFIqTxrMuAs5kep2\r\nContent-Disposition: form-data; name="regionValue"\r\n\r\nreg\r\n------WebKitFormBoundaryHFIqTxrMuAs5kep2\r\nContent-Disposition: form-data; name="countryValue"\r\n\r\ncountry\r\n------WebKitFormBoundaryHFIqTxrMuAs5kep2\r\nContent-Disposition: form-data; name="gridValue"\r\n\r\ngrid\r\n------WebKitFormBoundaryHFIqTxrMuAs5kep2\r\nContent-Disposition: form-data; name="file1"; filename="Rate.csv"\r\nContent-Type: text/csv\r\n\r\nCluster Code,Pace Product Code...
This string contains the following information: (1) Multiple text fields (including a 'username' value), (3) two csv files in the string
How can I capture these variables from the string. I would preferably want my username as a variable and my file stored as a pandas dataframe.
Any Tips?

Related

POST two form inputs with same name but different type

I have a webpage that submits two inputs in a form. My browser's POST can be seen below.
The first input is the path to the current file uploaded to the site (type="hidden"); the second input is the new file to be uploaded (type="file").
-----------------------------334620328614915644833547167693
Content-Disposition: form-data; name="image1"
data/file.jpg
-----------------------------334620328614915644833547167693
Content-Disposition: form-data; name="image1"; filename=""
Content-Type: application/octet-stream
These inputs have the same name but are typed differently, which causes problems with my understanding of the requests library. As far as I know, to submit data, you POST a dictionary with the name of the input and its new value.
data = {}
form = get_all_forms(url)[1] #function returns list of forms gained using requests.get() - we select the correct form
formDetails = get_form_details(form) #uses BeautifulSoup to find all input values
for inputTag in formDetails["inputs"]:
if inputTag["type"] == "file":
pass #doesn't send anything
else:
data[inputTag["name"]] = inputTag["value"] #sets to current value
res = requests.post(url, data=data)
The above code will POST a dictionary with 'image1':'data/file.jpg'. Unfortunately, the page now thinks that there is a new file being uploaded called data/file.jpg, and when it can't find that file, it deletes the current image.
How do I POST separate values for type="file" and type="hidden"?
First of all, just a suggestion: consider using different, more significant name attributes for your form fields (emphasis on different!). In your case, something like this would make more sense:
<input type="hidden" name="old-file-path" value="data/file.jpg"/>
<input type="file" name="new-file"/>
Now, coming to the actual question, you can achieve what you want using both the data= and files= parameters of requests.post(), like this:
new_file = open('path/to/new/file.jpg', 'wb')
data = {'image1': 'data/file.jpg'}
files = {'image1': new_file}
requests.post(url, data=data, files=files)
Which should generate a request like this:
...
Content-Type: multipart/form-data; boundary=d64ebc3e14a4909699c6f01dd1473855
--d64ebc3e14a4909699c6f01dd1473855
Content-Disposition: form-data; name="image1"
data/file.jpg
--d64ebc3e14a4909699c6f01dd1473855
Content-Disposition: form-data; name="image1"; filename="file.jpg"
...
The key point here is to use files= for the new file to upload and data= for the old file name.
If you also want to explicitly set the new file's name and Content-Type you can pass a 3-items tuple in the files= dict instead, like this:
new_file = open('path/to/new/file.jpg', 'wb')
data = {'image1': 'data/file.jpg'}
files = {'image1': ('NEW_FILE_NAME', new_file, 'application/octet-stream')}
requests.post(url, data=data, files=files)
Result:
...
Content-Type: multipart/form-data; boundary=d64ebc3e14a4909699c6f01dd1473855
--d64ebc3e14a4909699c6f01dd1473855
Content-Disposition: form-data; name="image1"
data/file.jpg
--d64ebc3e14a4909699c6f01dd1473855
Content-Disposition: form-data; name="image1"; filename="NEW_FILE_NAME"
Content-Type: application/octet-stream
...
Or if you want to send a file from memory and not from disk, or even an empty file, you can use:
from io import BytesIO
new_file = BytesIO(b'...file content here...')
# ... same code as above
Adapting this to your existing code, the whole thing should become:
data = {}
files = {}
new_file = ... # obtain the new file somehow
for inputTag in formDetails["inputs"]:
if inputTag["type"] == "file":
files[inputTag["name"]] = ("NEW_FILE_NAME", new_file, "application/octect-stream")
elif inputTag["type"] == "hidden":
data[inputTag["name"]] = inputTag["value"]
requests.post(url, data=data, files=files)
Finally, in your bounty comment, I see you say:
[...] and when I just need multiple values for the same input name.
In this case you would have to use a list instead of a dictionary for data=, like this:
data = [('image1', 'A'), ('image1', 'B')]
# ...
requests.post(url, data=data)
And the result would look like this:
--9cc0d413be5d2ed2ba8c80a2e7b54442
Content-Disposition: form-data; name="image1"
A
--9cc0d413be5d2ed2ba8c80a2e7b54442
Content-Disposition: form-data; name="image1"
B
Whether or not this makes sense though is up to you to decide, as it depends on the server-side implementation and the meaning of the fields when using the same name multiple times.

Set value for a attribute in xml while send POST request with xml body in Python

I am using Python program where I am sending a GET request; in response I am getting a xml body. From that xml body I have to change few attribute values and again send a post request.
I have already tried:
a = '14256601101'
xml ="""<?xml version="1.0" encoding="UTF-8"?><nms:provision xmlns:nms="urn:oma:xml:rest:netapi:nms:1"><attributeList><attribute><name>MSISDN</name><value>%str(a)</value></attributeList></nms:provision>"""
print xml
Here I want %str(a) to be replaced by 14256601101.
Use string.format() - replace %str(a) with {} and call the .format(str(a)) method on the whole string:
a = '14256601101'
xml ="""<?xml version="1.0" encoding="UTF-8"?><nms:provision xmlns:nms="urn:oma:xml:rest:netapi:nms:1"><attributeList><attribute><name>MSISDN</name><value>{}</value></attributeList></nms:provision>""".format(a)
print xml

Gmail API encoding - how to get rid of 3D and &amp

I am trying to extract the body of GMAIL emails via GMAIL API, using Python well.
I am able to extract the messages using the commands below. However, there seems to be an issue with the encoding of the email text (Original email has html in it) - for some reason, every time before each quote 3D appears.
Also, within the a href="my_url", I have random equal signs = appearing, and at the end of the link, there is &amp character which is not in the original HTML of the email.
Any idea how to fix this?
Code I use to extract the email:
from __future__ import print_function
from googleapiclient.discovery import build
from httplib2 import Http
from oauth2client import file, client, tools
from apiclient import errors
import base64
msgs = service.users().messages().list(userId='me', q="no-reply#hello.com",maxResults=1).execute()
for msg in msgs['messages']:message = service.users().messages().get(userId='me', id=m_id, format='raw').execute()
"raw": Returns the full email message data with body content in the raw field as a base64url encoded string; the payload field is not used."
print(base64.urlsafe_b64decode(message['raw'].encode('ASCII')))
td style=3D"padding:20px; color:#45555f; font-family:Tahoma,He=
lvetica; font-size:12px; line-height:18px; "
JPk79hd =
JFQZEhc6%2BpAiQKF8M85SFbILbNd6IG8%2FEAWwe3VTr2jPzba4BHf%2FEnjMxq66fr228I7OS =
You should check the Content-Transfer-Encoding header to see if it specifies quoted-printable because that looks like quoted-printable encoded text.
Per RFC 1521, Section 5.1:
The Quoted-Printable encoding is intended to represent data that largely consists of octets that correspond to printable characters in the US-ASCII character set. It encodes the data in such a way that the resulting octets are unlikely to be modified by mail transport. If the data being encoded are mostly US-ASCII text, the encoded form of the data remains largely recognizable by humans. A body which is entirely US-ASCII may also be encoded in Quoted-Printable to ensure the integrity of the data should the message pass through a character-translating, and/or line-wrapping gateway.
Python's quopri module can be used to decode emails with this encoding.
Sadly I wasn't able to figure out the proper way to decode the message.
I ended up using the following workaround, which:
1) splits the message into a list, with each separate line as a list item
2) Figures out the list location of one of the strings, and location of ending string.
3) Generates a new list out of #2, then regenerates the same list, cutting out the last character (equals sign)
4) Generates a string out of the new list
5) searches for the URL I want
x= mime_msg.splitlines() #convert to list
a = ([i for i, s in enumerate(x) if 'My unique start string' in s])[0] #get list# of beginning
b = ([i for i, s in enumerate(x) if 'my end id' in s])[0] #end
y = x[a:b] #generate list w info we want
new_list=[]
for item in y:new_list.append(item[:-1]) #get rid of last character, which bs base64 encoding is "="
url = ("".join(new_list)) #convert to string
url = url.replace("3D","").replace("&amp","") #cleaner for some reason - encoding gives us random 3Ds + &amps
csv_url = re.search('Whatever message comes before the URL (.*)',url).group(1)
The above uses
import re
from __future__ import print_function
from googleapiclient.discovery import build
from httplib2 import Http
from oauth2client import file, client, tools
from apiclient import errors
import base64
import email
I have send a mail from my webservice in asp.net to gmail
The content is in true html
It showed as wanted despite the =3D
Dim Bericht As MailMessage
Bericht = New MailMessage
the content of my styleText is
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-=1">
<meta content="text/html; charset=us-ascii">
<style>h1{color:blue;}
.EditText{
background:#ff0000;/*rood*/
height:100;
font-size:10px;
color:#0000ff;/*blauw*/
}
</head>
and the content of my body is
<div class='EditText'>this is just some text</div>
finaly I combine it in
Bericht.Body = "<html>" & styleText & "<body>" & content& "</body></html>"
if I look in the source of the message received, there is still this 3D
it shows
<html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
=3D1">
<meta content=3D"text/html; charset=3Dus-ascii">
<style>h1{color:blue;}
.EditText{
background:#ff0000;/*rood*/
height:100;
font-size:10px;
color:#0000ff;/*blauw*/
}
</style>
</head><body><div class=3D'EditText'>MailadresAfzender</div></body></html>
the result showed a blue text with a red background. Great

Take the content of a single tag from an xml file with the requests module

I make a request with python requests module to a soap service with this code:
response = requests.get(url,data=body,headers=headers)
and the service return this xml as response:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:aa="example.com/api"><soap:Body>
<aa:GetStockFileResponse> GetStockFileResponseType
<aa:TestMode> boolean </aa:TestMode>
<aa:Errors> ArrayOfError
<aa:Error> Error
<aa:Code> int </aa:Code>
<aa:Description> string </aa:Description>
</aa:Error>
</aa:Errors>
<aa:Warnings> ArrayOfWarning
<aa:Warning> Warning
<aa:Code> int </aa:Code>
<aa:Description> string </aa:Description>
</aa:Warning>
</aa:Warnings>
<aa:StockFileFormat> StockFileFormat (string) </aa:StockFileFormat>
<aa:FieldDelimiter> StringLength1 (string) </aa:FieldDelimiter>
<aa:File> base64Binary </aa:File>
</aa:GetStockFileResponse>
</soap:Body></soap:Envelope>
I need to write to a csv file the content of <aa:File> base64Binary </aa:File>that is a base64 encoded csv file.
My code to write the response is:
with open ('test.csv','wb') as f:
f.write (response.content)
that obviously write the whole xml...
How to take only the <aa:File> base64Binary </aa:File> content?
Something like this would be the solution?
import re
xmlText = '<foo>Foo</foo><aa:File> base64Binary </aa:File><bar>Bar</bar>'
# Target to extract: " base64Binary "
content = re.findall(r'<aa:File>(.+?)</aa:File>', xmlText)
print(content) # outputs " base64Binary "

How to iteratively parse and save XML responses that come in as one string?

I am making API calls, that is retrieving IDs, each call represents 10000 IDs and I can only retrieve 10000 at a time. My goal is to save each XML call into a list to count how many people are in the platform automatically.
The problem I running into is two fold.
Each call comes as response object, the response object when I append to a list appends as a single string, so I can not count total number of IDs
To get the next 10000 list of IDs I have to use another API call to get information about each ID, and retrieve a piece of information called website ID and use that to call the next 10000 from the API in #1
I also want to prevent any duplicate IDs in the list but I feel like this is the easiest task.
Here is my code:
1
Call profile IDs (each call brings back 10000)
Append response object 'r' into list 'lst'
import requests
import xml.etree.ElementTree as et
import pandas as pd
from lxml import etree
import time
lst = []
xml = '''
<?xml version="1.0" encoding="utf-8" ?>
<YourMembership>
<Version>2.25</Version>
<ApiKey>*****</ApiKey>
<CallID>009</CallID>
<SaPasscode>*****</SaPasscode>
<Call Method="Sa.People.All.GetIDs">
<Timestamp></Timestamp>
<WebsiteID></WebsiteID>
<Groups>
<Code></Code>
<Name></Name>
</Groups>
</Call>
</YourMembership>
'''
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r = requests.post('https://api.yourmembership.com', data=xml, headers=headers)
lst.append(r.text)
API Call result
<YourMembership_Response>
<Sa.People.All.GetIDs>
<People>
<ID>1234567</ID>
</People>
</Sa.People.All.GetIDs>
</YourMembership_Response>
2
I take the last ID from API call in #1 and manually input the value
into the API call below in the 'ID' tags.
xml_2 = '''
<?xml version="1.0" encoding="utf-8" ?>
<YourMembership>
<Version>2.25</Version>
<ApiKey>****</ApiKey>
<CallID>001</CallID>
<SaPasscode>****</SaPasscode>
<Call Method="Sa.People.Profile.Get">
<ID>1234567</ID>
</Call>
</YourMembership>
'''
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r_2 = requests.post('https://api.yourmembership.com', data=xml_2, headers=headers)
print (r_2.text)
API call result:
<YourMembership_Response>
<ErrCode>0</ErrCode>
<ExtendedErrorInfo></ExtendedErrorInfo>
<Sa.People.Profile.Get>
<ID>1234567</ID>
<WebsiteID>7654321</WebsiteID>
</YourMembership_Response>
I take the website ID and rerun this in API Call from #1 (example) with website ID tag filled, get the next 10000 until no more results come back:
xml = '''
<?xml version="1.0" encoding="utf-8" ?>
<YourMembership>
<Version>2.25</Version>
<ApiKey>*****</ApiKey>
<CallID>009</CallID>
<SaPasscode>*****</SaPasscode>
<Call Method="Sa.People.All.GetIDs">
<Timestamp></Timestamp>
<WebsiteID>7654321</WebsiteID>
<Groups>
<Code></Code>
<Name></Name>
</Groups>
</Call>
</YourMembership>
'''
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
r = requests.post('https://api.yourmembership.com', data=xml, headers=headers)
lst.append(r.text)
Hope my question makes sense, and thank you in advance.
I once started building something to crawl over an API which sounds similar to what you are aiming to achieve. One difference in my case though was the response came as json instead of xml but shouldn't be a big deal.
Can't see in your question evidence that you are really using the power of the xml parser. Have a look at the docs. For example you can easily get the id number out of those items you are appending to the list like this:
xml_sample = """
<YourMembership_Response>
<Sa.People.All.GetIDs>
<People>
<ID>1234567</ID>
</People>
</Sa.People.All.GetIDs>
</YourMembership_Response>
"""
import xml.etree.ElementTree as ET
root = ET.fromstring(xml_sample)
print (root[0][0][0].text)
>>> '1234567'
Experiment, apply it in a loop to each element in the list or maybe you will be lucky and the whole response object will parse without needing to look through things.
You should now be able to programmatically instead of manually enter that number in the next bit of code.
Your XML for the next section for the website ID seems to have an invalid line in it <Sa.People.Profile.Get> Once I take it out it can be parsed:
xml_sample2 = """
<YourMembership_Response>
<ErrCode>0</ErrCode>
<ExtendedErrorInfo></ExtendedErrorInfo>
<ID>1234567</ID>
<WebsiteID>7654321</WebsiteID>
</YourMembership_Response>
"""
root2 = ET.fromstring(xml_sample2)
print (root2[3].text)
>>> '7654321'
So not sure if there is always an invalid line there or if you forgot to paste something, maybe remove that line with regex or something before applying xtree.
Would recommend you try sqlite to help you with the interactions between 1 and 2. I think it's good up to half a million rows otherwise you would need to hook to a proper database. It saves a file in your directory and has a bit less setup time and fuss as with a proper database. Perhaps, test the concept with sqlite and if necessary migrate to postgresql.
You can store whichever useful elements from this parsed xml you like user ID, website ID into a table and pull it out again to use in a different section. Is also not hard to go back and forth from sqlite to pandas dataframes if you need it with pandas.read_sql and pandas.DataFrame.to_sql Hope this helps..

Categories