Python decode_header splits the original string - python

Using Python 3, I'm trying to parse e-mails from an mbox file.
for message in mailbox.mbox('file'):
sender = message['From']
c = decode_header(sender)
The raw e-mail has this unique From: header
From: "=?UTF-8?Q?Mark_from_Site?=" <info#site.com>
Anyway, c is
[(b'"', None), (b'Mark from Site', 'utf-8'), (b'" <info#site.com>', None)]
In this case, the line is unexpectedly split following the quotation marks " in multiple elements.
Handling this may be cumbersome, because there may be an undefined number of elements (not always 3 like above) in the list, according to the number of ", and there may also be other causes for splitting.
When there is no string encoding (that is: when the header is pure ascii), there is no split and c is "Mark from Site" <info#site.com>.
Is there a way to avoid this splitting also for non-ascii encodings?
Or, otherwise, how to correctly parse this kind of headers?

What about doing the simplest thing, ie. converting all parts to Unicode and then glueing them together:
from = ''.join(t[0].decode(t[1] if t[1] else 'UTF-8') for t in decode_header(sender))

You can have the email.header module handle encoding for you by creating an instance of email.header.Header with your string and the charset it should be encoded in.
from email.header import Header
for message in mailbox.mbox('file'):
sender = Header(message['From'], "utf-8")
c = decode_header(sender)

str(email.header.make_header(email.header.decode_header(encoded_string)))
Not too obvious, but this should decode and correctly rebuild the header and convert it to a string. I also found this somewhere here on StackOverflow.
Not sure if it's the most elegant way, but seems to work for me.
See https://docs.python.org/3/library/email.header.html for the documentation of these functions.

Related

Is there a way to get around unicode issues when using win32api/com modules in python 3?

I've looked around and haven't found anything just yet. I'm going through emails in an inbox and checking for a specific word set. It works on most emails but some of them don't parse. I checked the broken emails using.
print (msg.Body.encode('utf8'))
and my problem messages all start with b'.
like this
b'\xe6\xa0\xbc\xe6\xb5\xb4\xe3\xb9\xac\xe6\xa0\xbc\xe6\x85\xa5\xe3\xb9\xa4\xe0\xa8\x8d\xe6\xb4\xbc\xe7\x91\xa5\xe2\x81\xa1\xe7\x91\x
I think this is forcing python to read the body as bytes but I'm not sure. Either way after the b, no matter what encoding I try I don't get anything but garbage text.
I've tried other encoding methods as well decoding before but I'm just getting a ton of attribute errrors.
import win32api
import win32com.client
import datetime
import os
import time
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
dater = datetime.date.today() - datetime.timedelta(days = 1)
dater = str(dater.strftime("%m-%d-%Y"))
print (dater)
#for folders in outlook.folders:
# print(folders)
Receipt = outlook.folders[8]
print(Receipt)
Ritems = Receipt.folders["Inbox"]
Rmessage = Ritems.items
for msg in Rmessage:
if (msg.Class == 46 and msg.CreationTime.strftime("%m-%d-%Y") == dater):
print (msg.CreationTime)
print (msg.Subject)
print (msg.Body.encode('utf8'))
print ('..............................')
End result is to have the message printed out in the console, or at least give Python a way to read it so I can find the text I'm looking for in the body.
The byte literal posted in the question is valid UTF-8. First two characters are U+683C and U+6D74 from the CJK Unified Ideographs block, U+4E00 - U+9FFF.
Since you don't know the source encoding there is no way to be completely sure about it, but chances are that email body is just Han characters encoded in UTF-8 (Determine the encoding of text in Python). If you are not being able to see the UTF-8 characters correctly you should check your terminal or display character set.
That said, you should to get the fundamentals of character representation right. Randomly encoding or decoding is hardly going to solve anything. I would suggest you begin by reading Spolsky's introduction to Unicode and then move to Batchelder on Unicode in Python.
As martineau said the proper encoding I was searching for was utf16. The other messages were encoded using utf8. So a simple mail scrape turned out to be an excellent lesson in encoding as well message Classes (off topic). Thanks for the help.

Removing non-ascii characters on utf-16 (Python)

i have some code i'm using to decrypt a string, the string is originally encrypted and coming from .net source code but i'm able to make it all work fine. yet, the string coming into python has some extra characters in it and it has to decode as utf-16
here is some code for the decryption portion. my original string that i encrypted was "test2" , which is what is within the text variable in my code below.
import Crypto.Cipher.AES
import base64, sys
password = base64.b64decode('PSCIQGfoZidjEuWtJAdn1JGYzKDonk9YblI0uv96O8s=')
salt = base64.b64decode('ehjtnMiGhNhoxRuUzfBOXw==')
aes = Crypto.Cipher.AES.new(password, Crypto.Cipher.AES.MODE_CBC, salt)
text = base64.b64decode('TzQaUOYQYM/Nq9f/pY6yaw==')
print(aes.decrypt(text).decode('utf-16'))
text1 = aes.decrypt(text).decode('utf-16')
print(text1)
my issue is when i decrypt and print the result of text it is "test2ЄЄ" instead of the expected "test2"
if i save the same decrypt value into a variable it gets decoded incorrectly as "틊첃陋ភ滑毾穬ヸ"
my goal is i need to find a way to :
strip off the non ascii characters from the end of test2 value
be able to store that into a variable holding the correct string/text value
any help or suggestions appreciated? thanks
In python 2, you can use str.decode, like this:
string.decode('ascii', 'ignore')
The locale is ascii, and ignore specifies that anything that cannot be converted is to be dropped.
In python 3, you'll need to re-encode it first before decoding, since all str objects are decoded to your locale by default:
string.encode('ascii', 'ignore').decode()

Unicode strings to byte strings without the addition of backslashes

I'm learning python by doing the python challenge using python3.3 and I'm on question eight. There's a comment in the markup providing you with two bz2-compressed unicode strings outputting byte strings, one for username and one for password. There's also a link where you need the decompressed credentials to enter.
One way to easily solve this is just to manually copy the strings and assign it to two variables as byte strings and then just use the bz2 library to decompress it:
>>>un=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(bz2.decompress(un).decode('utf-8'))
huge
But that's not for me since I want the answer by just running my python file.
My code like this:
>>>import bz2, re, requests
>>>url = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')
>>>un = re.findall(r'un: \'(.*)\'',url.text)[0]
>>>correct=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(un,un is correct,sep='\n')
b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'
False
The problem is that when it converts from unicode string to byte string the escaping backslash gets added so that it cannot be read by bz2 module. I have tried everything I know and what got up when I searched.
How do I get it from unicode to byte so that it doesn't get changed?
Here it is a solution:
import urllib
import bz2
import re
def decode(line):
out = re.search(r"\'(.*?)\'",''.join(line)).group()
out = eval("b%s" % out)
return bz2.decompress(out)
#read lines that contain the encoded message
page = urllib.urlopen('http://www.pythonchallenge.com/pc/def/integrity.html').readlines()[20:22]
print "Click on the bee and insert: "
User_Name = decode(page[0])
print "User Name is: " + User_Name
Password = decode(page[1])
print "Password is: " + Password
The backslashes are present in the HTML source, so it's not surprising that the requests module preserves them. I don't have requests installed on my Python 3 environment, so I haven't been able to exactly replicate your situation, but it looks to me like if you start capturing the surrounding ' characters, you can use ast.literal_eval to parse the character sequence into a bytes array:
>>> test
"'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'"
>>> import ast
>>> res = ast.literal_eval("b%s" % test)
>>> import bz2
>>> len(bz2.decompress(res))
4
There are probably other ways, but why not use Python's built in knowledge that the byte sequence b'\\xaf' can be parsed into a bytes array?

How do I get rid of the "u" from a decoded JSON object?

I have a dictionary of dictionaries in Python:
d = {"a11y_firesafety.html":{"lang:hi": {"div1": "http://a11y.in/a11y/idea/a11y_firesafety.html:hi"}, "lang:kn": {"div1": "http://a11y.in/a11ypi/idea/a11y_firesafety.html:kn}}}
I have this in a JSON file and I encoded it using json.dumps(). Now when I decode it using json.loads() in Python I get a result like this:
temp = {u'a11y_firesafety.html': {u'lang:hi': {u'div1': u'http://a11y.in/a11ypi/idea/a11y_firesafety.html:hi'}, u'lang:kn': {u'div1': u'http://a11y.in/a11ypi/idea/a11y_firesafety.html:kn'}}}
My problem is with the "u" which signifies the Unicode encoding in front of every item in my temp (dictionary of dictionaries). How to get rid of that "u"?
Why do you care about the 'u' characters? They're just a visual indicator; unless you're actually using the result of str(temp) in your code, they have no effect on your code. For example:
>>> test = u"abcd"
>>> test == "abcd"
True
If they do matter for some reason, and you don't care about consequences like not being able to use this code in an international setting, then you could pass in a custom object_hook (see the json docs here) to produce dictionaries with string contents rather than unicode.
You could also use this:
import fileinput
fout = open("out.txt", 'a')
for i in fileinput.input("in.txt"):
str = i.replace("u\"","\"").replace("u\'","\'")
print >> fout,str
The typical json responses from standard websites have these two encoding representations - u' and u"
This snippet gets rid of both of them. It may not be required as this encoding doesn't hinder any logical processing, as mentioned by previous commenter
There is no "unicode" encoding, since unicode is a different data type and I don't really see any reason unicode would be a problem, since you may always convert it to string doing e.g. foo.encode('utf-8').
However, if you really want to have string objects upfront you should probably create your own decoder class and use it while decoding JSON.

Unescape Python Strings From HTTP

I've got a string from an HTTP header, but it's been escaped.. what function can I use to unescape it?
myemail%40gmail.com -> myemail#gmail.com
Would urllib.unquote() be the way to go?
I am pretty sure that urllib's unquote is the common way of doing this.
>>> import urllib
>>> urllib.unquote("myemail%40gmail.com")
'myemail#gmail.com'
There's also unquote_plus:
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.
In Python 3, these functions are urllib.parse.unquote and urllib.parse.unquote_plus.
The latter is used for example for query strings in the HTTP URLs, where the space characters () are traditionally encoded as plus character (+), and the + is percent-encoded to %2B.
In addition to these there is the unquote_to_bytes that converts the given encoded string to bytes, which can be used when the encoding is not known or the encoded data is binary data. However there is no unquote_plus_to_bytes, if you need it, you can do:
def unquote_plus_to_bytes(s):
if isinstance(s, bytes):
s = s.replace(b'+', b' ')
else:
s = s.replace('+', ' ')
return unquote_to_bytes(s)
More information on whether to use unquote or unquote_plus is available at URL encoding the space character: + or %20.
Yes, it appears that urllib.unquote() accomplishes that task. (I tested it against your example on codepad.)

Categories