Gmail API encoding - how to get rid of 3D and &amp - python

I am trying to extract the body of GMAIL emails via GMAIL API, using Python well.
I am able to extract the messages using the commands below. However, there seems to be an issue with the encoding of the email text (Original email has html in it) - for some reason, every time before each quote 3D appears.
Also, within the a href="my_url", I have random equal signs = appearing, and at the end of the link, there is &amp character which is not in the original HTML of the email.
Any idea how to fix this?
Code I use to extract the email:
from __future__ import print_function
from googleapiclient.discovery import build
from httplib2 import Http
from oauth2client import file, client, tools
from apiclient import errors
import base64
msgs = service.users().messages().list(userId='me', q="no-reply#hello.com",maxResults=1).execute()
for msg in msgs['messages']:message = service.users().messages().get(userId='me', id=m_id, format='raw').execute()
"raw": Returns the full email message data with body content in the raw field as a base64url encoded string; the payload field is not used."
print(base64.urlsafe_b64decode(message['raw'].encode('ASCII')))
td style=3D"padding:20px; color:#45555f; font-family:Tahoma,He=
lvetica; font-size:12px; line-height:18px; "
JPk79hd =
JFQZEhc6%2BpAiQKF8M85SFbILbNd6IG8%2FEAWwe3VTr2jPzba4BHf%2FEnjMxq66fr228I7OS =

You should check the Content-Transfer-Encoding header to see if it specifies quoted-printable because that looks like quoted-printable encoded text.
Per RFC 1521, Section 5.1:
The Quoted-Printable encoding is intended to represent data that largely consists of octets that correspond to printable characters in the US-ASCII character set. It encodes the data in such a way that the resulting octets are unlikely to be modified by mail transport. If the data being encoded are mostly US-ASCII text, the encoded form of the data remains largely recognizable by humans. A body which is entirely US-ASCII may also be encoded in Quoted-Printable to ensure the integrity of the data should the message pass through a character-translating, and/or line-wrapping gateway.
Python's quopri module can be used to decode emails with this encoding.

Sadly I wasn't able to figure out the proper way to decode the message.
I ended up using the following workaround, which:
1) splits the message into a list, with each separate line as a list item
2) Figures out the list location of one of the strings, and location of ending string.
3) Generates a new list out of #2, then regenerates the same list, cutting out the last character (equals sign)
4) Generates a string out of the new list
5) searches for the URL I want
x= mime_msg.splitlines() #convert to list
a = ([i for i, s in enumerate(x) if 'My unique start string' in s])[0] #get list# of beginning
b = ([i for i, s in enumerate(x) if 'my end id' in s])[0] #end
y = x[a:b] #generate list w info we want
new_list=[]
for item in y:new_list.append(item[:-1]) #get rid of last character, which bs base64 encoding is "="
url = ("".join(new_list)) #convert to string
url = url.replace("3D","").replace("&amp","") #cleaner for some reason - encoding gives us random 3Ds + &amps
csv_url = re.search('Whatever message comes before the URL (.*)',url).group(1)
The above uses
import re
from __future__ import print_function
from googleapiclient.discovery import build
from httplib2 import Http
from oauth2client import file, client, tools
from apiclient import errors
import base64
import email

I have send a mail from my webservice in asp.net to gmail
The content is in true html
It showed as wanted despite the =3D
Dim Bericht As MailMessage
Bericht = New MailMessage
the content of my styleText is
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-=1">
<meta content="text/html; charset=us-ascii">
<style>h1{color:blue;}
.EditText{
background:#ff0000;/*rood*/
height:100;
font-size:10px;
color:#0000ff;/*blauw*/
}
</head>
and the content of my body is
<div class='EditText'>this is just some text</div>
finaly I combine it in
Bericht.Body = "<html>" & styleText & "<body>" & content& "</body></html>"
if I look in the source of the message received, there is still this 3D
it shows
<html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
=3D1">
<meta content=3D"text/html; charset=3Dus-ascii">
<style>h1{color:blue;}
.EditText{
background:#ff0000;/*rood*/
height:100;
font-size:10px;
color:#0000ff;/*blauw*/
}
</style>
</head><body><div class=3D'EditText'>MailadresAfzender</div></body></html>
the result showed a blue text with a red background. Great

Related

Format outlook email with Python win32 and allow for a class to be used within the HTMLbody

I am trying to make a basic outlook email template for one of my tasks at work. The goal is to be able to run a script and provide a number to the program such that I can generate repetitive blocks of code for releasing materials in our warehouse. I have figured out how to draft the email (need manual inputs so using mail.save to make a draft), but I am struggling to come up with a way to populate x blocks of code for the name/tracking/item/lot/containers potion of the HTML body (used // to denote the repetitive block). I think I will probably need to remove the r""" """ bit and use quotes so that a function can be used in the middle of the HTMLbody, but I'm not sure. Here is the current code I have:
import win32com.client as win32
import os
outlook = win32.Dispatch('outlook.application')
mail = outlook.CreateItem(0)
# SEND TO
mail.To = 'Email here'
# SUBJECT
mail.Subject = 'Subject here'
# MESSAGE BODY
mail.HTMLBody = r"""
The following stuff is ready: <br><br>
<b>Name</b><br> //This is the block of code I want to repeat X times.
Tracking #/ WO #: <br>
Item #: <br>
Lot #: <br>
X containers <br><br> //This is the end of the repeatable block
Best regards, <br><br>
<p style="font-family: Arial; font-size: 10pt"><b>My Name</b><br>
Title<br>
Company<br>
D: phone number here<br>
fakeEmail#somedomain.com<br>
www.fakesite.com<br></p>
"""
mail.Save()
I originally tried to do something like this with the body portion, which I imagine is closer to what I will need:
mail.HTMLBody = (
"text here"
class call here for repetitive block
"text here to finish signature"
)
Would this approach work better for my goal?
Sounds like you need to build an HTML string based on the data from an external source. The MailItem.HTMLBody property returns or sets a string representing the HTML body of the specified item. The HTMLBody property should be an HTML syntax string, that means the string should represent a well formed HTML document.

adding string variables to HTML body in Python without breaking content-ID

I'm trying to make an automated task that (1) collects CSV data about wildfires from the internet, (2) reads its contents to see where wildfires in the CSV have occurred in the past 3 days, (3) visualizes them on a map, (4) and sends an email to a specific alias, with text and a map about where they have occurred.
In the email (4), I want to mention the locations of wildfires, which are in the form of a string:
print(provinces_on_fire_str)
Out[0]: "Province1, Province2"
I used Content-ID to add an image to the e-mail body, using the following code:
# set the plaintext body
msg.set_content('This is a plain text body.')
# create content-ID for the image
image_cid = make_msgid()
# alternative HTML body
msg.add_alternative("""\
<html>
<body>
<p>This e-mail contains information about fires reported in the past 3 days.<br>
The VIIRS sensor has reported wildfires in the following provinces: {provinces_on_fire_str}
This e-mail was sent automatically.
</p>
<img src="cid:{image_cid}">
</body>
</html>
""".format(image_cid=image_cid[1:-1]), subtype="html")
# attach image to mail
with open(f"path/toimage/{today}.png", 'rb') as img:
maintype, subtype = mimetypes.guess_type(img.name)[0].split("/")
msg.get_payload()[1].add_related(img.read(),
maintype=maintype,
subtype=subtype,
cid=image_cid)
This returns an error, implying no such object as "provinces_on_fire_str" exists for the HTML code. Without the "provinces_on_fire_str" variable in the HTML body, the expected output email is the following (albeit this lacks the text explanation of where they occurred):
Now, the obvious thing that came to my mind is to convert the HTML body part to an f-string, so I can add the "Province1, Province2" values to the e-mail text. But adding f before the e-mail string breaks the image_cid (though the Province1, Province2 values are included in the ultimate e-mail).
.add_alternative with f-string input:
# alternative HTML body
msg.add_alternative(f"""\
<html>
<body>
<p>This e-mail contains information about fires reported in the past 3 days.<br>
The VIIRS sensor has reported wildfires in the following provinces: {provinces_on_fire_str}
This e-mail was sent automatically.
</p>
<img src="cid:{image_cid}">
</body>
</html>
""".format(image_cid=image_cid[1:-1]), subtype="html")
Output email:
How do I pass the string values of provinces_on_fire_str into the HTML code without breaking the image_cid?
Consider a simpler example:
foo = 1
bar = 2
print('{foo} == {bar}'.format(foo=foo+1))
It does not work, because bar is not looked up automatically. The .format method of strings is just a method; it does not do any magic. It does not know about the caller's local variables foo and bar; it must be passed all the information that should be used. Of course, because we are passing it information explicitly, we can make modifications.
We can solve the error by simply including the missing argument:
foo = 1
bar = 2
print('{foo} == {bar}'.format(foo=foo+1, bar=bar))
f-strings are magic, or rather, syntactic sugar. They are translated into an equivalent .format call at compile time. They are not a different kind of string; after the compile-time translation, a perfectly ordinary string has a perfectly ordinary .format method called upon it.
If we do
foo = 1
bar = 2
print(f'{foo+1} == {bar}')
then that is already equivalent to the fixed .format version. We can use expressions in the {} placeholders, not just variable names. Notice that this already does the work; we should not have an explicit .format call on the result.
If we have just
foo = 1
bar = 2
print(f'{foo} == {bar}')
then of course we lose the modification of the foo value. If you want to use a modified foo in the formatted output, then either describe the modification in the f-string, or else modify the variable beforehand.
Translating that to the original code, we can either do:
msg.add_alternative(f"""\
<html>
<body>
<p>This e-mail contains information about fires reported in the past 3 days.<br>
The VIIRS sensor has reported wildfires in the following provinces: {provinces_on_fire_str}
This e-mail was sent automatically.
</p>
<img src="cid:{image_cid[1:-1]}">
</body>
</html>
""")
or:
image_cid = image_cid[1:-1]
msg.add_alternative(f"""\
<html>
<body>
<p>This e-mail contains information about fires reported in the past 3 days.<br>
The VIIRS sensor has reported wildfires in the following provinces: {provinces_on_fire_str}
This e-mail was sent automatically.
</p>
<img src="cid:{image_cid}">
</body>
</html>
""")

How to properly decode Quoted Printable encoding in Django HTML Template

I have a Google app engine in python form submit that POSTS text to a server, and the text gets encoded with the encoding Quoted Printables.
My code for POSTing is this:
<form action={{ upload_url }} method="post" enctype="multipart/form-data">
<div class="sigle-form"><textarea name="body" rows="5"></textarea></div>
<div class="sigle-form"><input name="file" type="file" /></div>
</form>
Then the result of the fetching self.request.get('body') will be encoded with the encoding Quoted Printables. I store this in text DB.textProperty() and later sends the text to a HTML template using Django. When i write out the variable using {{ body }}, the result is written with Quoted printable encoding, and it does not seem that there is a way of decoding this in the Django HTML template.
Is there any way of encoding the text in the body thats sent on another way than with Quoted Printables? If not, how to decode this encoding in the Django HTML template?
The result for submiting the text "ÅØÆ" is encoded to " xdjG ", so the sum of the Quoted Prinables are somehow added togheter as well. This happens when more than one special character are present in the encoded text. An ordinary "ø" is encoded to =F8.
EDIT: I get this problem only in production, and this thread seems to talk about the same problem.
If anyone else here on Stack Overflow are doing form submit with blobs and åæøè characters, please respond to this thread on how you have solved it!
Ok, after two days working with this issue i finally resolved it. Its seemingly a bug with Google App Engine that makes the encoding f'ed up in production. When in production the text is sometimes encoded with Quoted Printable encoded, and some other times encoded with base64 encoding. Weird. Here is my solution:
postBody = self.request.get('body')
postBody = postBody.encode('iso-8859-1')
DEBUG = os.environ['SERVER_SOFTWARE'].startswith('Dev')
if DEBUG:
r.body = postBody
else:
postBody += "=" * ((4 - len(postBody) % 4) % 4)
b64 = base64.urlsafe_b64decode(postBody)
Though the resulting b64 can't be stored in the data storage because it's not ascii encoded
'ascii' codec can't decode byte 0xe5 in position 5: ordinal not in range(128)
I solved a similar problem by using the Python quopri module to decode the string before passing it to an HTML template.
import quopri
body = quopri.decodestring(body)
This seems to be something to do with the multipart/form-data enctype. Quotable printable encoding is applied to the textarea input, which is then, in my case, submitted via a blobstore upload link. The blobstore returns the text to my upload handler still in encoded form.
Not sure what Quoted Printables are but have you tried safe?
{{ body|safe }}
https://docs.djangoproject.com/en/dev/ref/templates/builtins/?from=olddocs#safe

Python HTML - Get element by attribute

There is music website I regularly read, and it has a section where users post their own fictional music-related stories. There is a 91 part series (Written over a length of time, uploaded part by part) that always follows the convention of:
http://www.ultimate-guitar.com/columns/fiction/riot_band_blues_part_#.html.
I would like to be able to get just the formatted text from every part and put it into one html file.
Conveniently, there is a link to a print version, correctly formatted for my purposes. All I would have to do is write a script to download all of the parts and then dump them into file. Not hard.
Unfortunately, the url for a print version is as follows:
www.ultimate-guitar.com/print.php?what=article&id=95932
The only way to know what article corresponds to what ID field is to look at the value attribute of a certain input tag in the original article.
What I want to do is this:
Go to each page, incrementng through the varying numbers.
Find the <input> tag with attribute 'name="rowid"' and get the number in it's 'value=' attribute.
Go to www.ultimate-guitar.com/print.php?what=article&id=<value>.
Append everything (minus <html><head> and <body> to a html file.
Rinse and repeat.
Is this possible? And is python the right language? Also, what dom/html/xml library should I use?
Thanks for any help.
With lxml and urllib2:
import lxml.html
import urllib2
#implement the logic to download each page, with HTML strings in a sequence named pages
url = "http://www.ultimate-guitar.com/print.php?what=article&id=%s"
for page in pages:
html = lxml.html.fromstring(page)
ID = html.find(".//input[#name='rowid']").value
article = urllib2.urlopen(url % ID).read()
article_html = lxml.html.fromstring(article)
with open(ID + ".html", "w") as html_file:
html_file.write(article_html.find(".//body").text_content())
edit: Upon running this, it seems there may be some Unicode characters in the page. One way to get around this is to do article = article.encode("ascii", "ignore") or to put the encode method after .read(), to force ASCII and ignore Unicode, though this is a lazy fix.
This is assuming you just want the text content of everything inside the body tag. This will save files with the format of storyID.html (so "95932.html") in the local directory of the Python file. Change the save semantics if you like.
You could actually do this in javascript/jquery without too much trouble. javascripty-pseudocode, appending to an empty document:
for(var pageNum = 1; i<= 91; i++) {
$.ajax({
url: url + pageNum,
async: false,
success: function() {
var printId = $('input[name="rowid"]').val();
$.ajax({
url: printUrl + printId,
async: false,
success: function(data) {
$('body').append($(data).find('body').contents());
}
});
}
});
}
After the loading completes you could save the resultant HTML to a file.

Parsing responses of content-type chunked in python

I'm trying to read and parse a request of content-type: chunked in python. Here is what I see when I load the url in a browser and look at the source:
<!-- ---------------------------------------------------------------- http://github.com/Atmosphere ------------------------------------------------------------------------ -->
<!-- Welcome to the Atmosphere Framework. To work with all the browsers when suspending connection, Atmosphere must output some data to makes WebKit based browser working.-->
<!-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------- -->
<!-- EOD -->[{"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam2","value":"2.505730663333334E9"}, {"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam1","value":"1.5584484E9"},{"__publicationName":"dip\/acc\/LHC\/Beam\/Energy","value":"495"},
I'd like to retrieve and parse the json entries like this one:
{"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam2","value":"2.505730663333334E9"}
How can I do this?
Thanks
"chunked" is not a valid content-type, although it is a valid transfer-encoding. Based on the sample you've posted, that doesn't really look like your problem. This looks like a header applied to a regular jsonp response. In many cases, the sgml comments would be ignored by a browser, but you'll have to extract it manually for your own use. Here's an idea of dealing with that:
>>> import json
>>> corpus = '''<!-- ---------------------------------------------------------------- http://github.com/Atmosphere ------------------------------------------------------------------------ -->
... <!-- Welcome to the Atmosphere Framework. To work with all the browsers when suspending connection, Atmosphere must output some data to makes WebKit based browser working.-->
... <!-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------- -->
... <!-- EOD -->[{"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam2","value":"2.505730663333334E9"}, {"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam1","value":"1.5584484E9"},{"__publicationName":"dip\/acc\/LHC\/Beam\/Energy","value":"495"}]'''
>>> junk, data = corpus.split('<!-- EOD -->', 1)
>>> parsed = json.loads(data)
>>> for item in parsed:
... print item
...
{u'__publicationName': u'dip/acc/LHC/Beam/Intensity/Beam2', u'value': u'2.505730663333334E9'}
{u'__publicationName': u'dip/acc/LHC/Beam/Intensity/Beam1', u'value': u'1.5584484E9'}
{u'__publicationName': u'dip/acc/LHC/Beam/Energy', u'value': u'495'}

Categories