Python gpgme non-ascii text handling

Python gpgme non-ascii text handling - python

I am trying to encrypt-decrypt a text via GPG using pygpgme, while it works for western characters decryption fails on a Russian text. I use GPG suite on Mac to decrypt e-mail.
Here's the code I use to produce encrypted e-mail body, note that I tried to encode message in Unicode but it didn't make any difference. I use Python 2.7.
Please help, I must say I am new to Python.
ctx = gpgme.Context()
ctx.armor = True
key = ctx.get_key('0B26AE38098')
payload = 'Просто тест'
#plain = BytesIO(payload.encode('utf-8'))
plain = BytesIO(payload)
cipher = BytesIO()
ctx.encrypt([key], gpgme.ENCRYPT_ALWAYS_TRUST, plain, cipher)

There are multiple problems here. You really should read the Unicode HOWTO, but I'll try to explain.
payload = 'Просто тест'
Python 2.x source code is, by default, Latin-1. But your source clearly isn't Latin-1, because Latin-1 doesn't even have those characters. What happens if you write Просто тест in one program (like a text editor) as UTF-8, then read it in another program (like Python) as Latin-1? You get ÐÑÐ¾ÑÑÐ¾ ÑÐµÑÑ. So, what you're doing is creating a string full of nonsense. If you're using ISO-8859-5 rather than UTF-8, it'll be different nonsense, but still nonsense
So, first and foremost, you need to find out what encoding you did use in your text editor. It's probably UTF-8, if you're on a Mac, but don't just guess; find out.
Second, you have to tell Python what encoding you used. You do that by using an encoding declaration. For example, if your text editor uses UTF-8, add this line to the top of your code:
# coding=utf-8
One you fix that, payload will be a byte string, encoded in whatever encoding your text editor uses. But you can't encode already-encoded byte strings, only Unicode strings.
Python 2.x will let you call encode on them anyway, but it's not very useful—what it will do is first decode the string to Unicode using sys.getdefaultencoding, so it can then encode that. That's unlikely to be what you want.
The right way to fix this is to make payload a Unicode string in the first place, by using a Unicode literal. Like this:
payload = u'Просто тест'
Now, finally, you can actually encode the payload to UTF-8, which you did perfectly correctly in your first attempt:
plain = BytesIO(payload.encode('utf-8'))
Finally, you're encrypting UTF-8 plain text with GPG. When you decrypt it on the other side, make sure to decode it as UTF-8 there as well, or again you'll probably see nonsense.

Related

Email quoted printable encoding confusion

I'm constructing MIME encoded emails with Python and I'm getting a difference with the same email that is MIME encoded by Amazon's SES.
I'm encoding using utf-8 and quoted-printable.
For the character "å" (that's the letter "a" with a little circle on top), my encoding produces
=E5
and the other encoding produces
=C3=A5
They both look ok in my gmail, but I find it weird that the encoding is different. Is one of these right and the other wrong in any way?
Below is my Python code in case that helps.
====
cs = charset.Charset('utf-8')
cs.header_encoding = charset.QP
cs.body_encoding = charset.QP
# See https://stackoverflow.com/a/16792713/136598
mt = mime.text.MIMEText(None, subtype)
mt.set_charset(cs)
mt.replace_header("content-transfer-encoding", "quoted-printable")
mt.set_payload(mt._charset.body_encode(payload))

Ok, I was able to figure this out, thanks to Artur's comment.
The utf-8 encoding of the character is two bytes and not one so you should expect to see two quoted printable encodings and not one so the AWS SES encoding is correct (not surprisingly).
I was sending unicode text and not utf-8 which causes only one quoted printable character. It turns out that it worked because gmail supports unicode.
For the Python code in my question, I need to manually encode the text as utf-8. I was thinking that MIMEText would do that for me but it does not.

Python UnicodeEncodeError when Outputting Parsed Data from a Webpage

I have a program that parses webpages and then writes the data out somewhere else. When I am writing the data, I get
"UnicodeEncodeError: 'ascii' codec can't encode characters in position
19-21: ordinal not in range(128)"
I am gathering the data using lxml.
name = apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
worksheet.goog["Name"].append(name)
Upon reading, http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm, it suggests I record all of my variables in unicode. This means I need to know what encoding the site is using.
My final line that actually writes the data out somewhere is:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], (str(worksheet.goog[value][row])).encode('ascii', 'ignore'))
How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out?

You error is because of:
str(worksheet.goog[value][row])
Calling str you are trying to encode the ascii, what you should be doing is encoding to utf-8:
worksheet.goog[value][row].encode("utf-8")
As far as How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out? goes, you can't there is no ascii latin ă etc... unless you want to get the the closest ascii equivalent using something like Unidecode.

I think I may have figured my own problem out.
apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
Actually defaults to unicode. So what I did was change this line to:
name = (apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text).encode('ascii', errors='ignore')
And I just output without changing anything:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], worksheet.goog[value][row])
Due to the nature of the data, ASCII only is mostly fine. Although, I may be able to use UTF-8 and catch some extra characters...but this is not relevant to the question.
:)

Python string encoding issue

I am using the Amazon MWS API to get the sales report for my store and then save that report in a table in the database. Unfortunately I am getting an encoding error when I try to encode the information as Unicode. After looking through the report (exactly as amazon sent it) I saw this string which is the location of the buyer:
'S�o Paulo'
so I tried to encode it like so:
encodeme = 'S�o Paulo'
encodeme.encode('utf-8)
but got the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)
The whole reason why I am trying to encode it is because as soon as Django sees the � character it throws a warning and cuts off the string, meaning that the location is saved as S instead of
São Paulo
Any help is appreciated.

It looks like you are having some kind of encoding problem.
First, you should be very certain what encoding Amazon is using in the report body they send you. Is it UTF-8? Is it ISO 8859-1? Something else?
Unfortunately the Amazon MWS Reports API documentation, especially their API Reference, is not very forthcoming about what encoding they use. They only encoding I see them mention is UTF-8, so that should be your first guess. The GetReport API documentation (p.36-37) describes the response element Report as being type xs:string, but I don't see where they define that data type. Maybe they mean XML Schema's string datatype.
So, I suggest you save the byte sequence you are receiving as your report body from Amazon in a file, with zero transformations. Be aware that your code which calls AWS might be modifying the report body string inadvertently. Examine the non-ASCII bytes in that file with a binary editor. Is the "São" of "São" stored as S\xC3\xA3o, indicating UTF-8 encoding? Or is it stored as S\xE3o, indicating ISO 8859-1 encoding?
I'm guessing that you receive your report as a flat file. The Amazon AWS documentation says that you can request reports be delivered to you as XML. This would have the advantage of giving you a reply with an explicit encoding declaration.
Once you know the encoding of the report body, you now need to handle it properly. You imply that you are using the Django framework and Python language code to receive the report from Amazon AWS.
One thing to get very clear (as Skirmantas also explains):
Unicode strings hold characters. Byte strings hold bytes (octets).
Encoding converts a Unicode string into a byte string.
Decoding converts a byte string into a Unicode string.
The string you get from Amazon AWS is a byte string. You need to decode it to get a Unicode string. But your code fragment, encodeme = 'São Paulo', gives you a byte string. encodeme.encode('utf-8) performs an encode() on the byte string, which isn't what you want. (The missing closing quote on 'utf-8 doesn't help.)
Try this example code:
>>> reportbody = 'S\xc3\xa3o Paulo' # UTF-8 encoded byte string
>>> reportbody.decode('utf-8') # returns a Unicode string, u'...'
u'S\xe3o Paulo'
You might find some background reading helpful. I agree with Hoxieboy that you should take the time to read Python's Unicode HOWTO. Also check out the top answers to What do I need to know about Unicode?.

I think you have to decode it using a correct encoding rather than encode it to utf-8. Try
s = s.decode('utf-8')
However you need to know which encoding to use. Input can come in other encodings that utf-8.
The error which you received UnicodeDecodeError means that your object is not unicode, it is a bytestring. When you do bytestring.encode, the string firstly is decoded into unicode object with default encoding (ascii) and only then it is encoded with utf-8.
I'll try to explain the difference of unicode string and utf-8 bytestring in python.
unicode is a python's datatype which represents a unicode string. You use unicode for most of string operations in your program. Python probably uses utf-8 in its internals though it could also be utf-16 and this doesn't matter for you.
bytestring is a binary safe string. It can be of any encoding. When you receive data, for example you open a file, you get a bytestring and in most cases you will want to decode it to unicode. When you write to file you have to encode unicode objects into bytestrings. Sometimes decoding/encoding is done for you by a framework or library. Not always however framework can do this because not always framework can known which encoding to use.
utf-8 is an encoding which can correctly represent any unicode string as a bytestring. However you can't decode any kind of bytestring with utf-8 into unicode. You need to know what encoding is used in the bytestring to decode it.

Official Python unicode documentation
You might try that webpage if you haven't already and see if you can get the answer you're looking for ;)

UnicodeEncodeError when writing to a file

I am trying to write some strings to a file (the strings have been given to me by the HTML parser BeautifulSoup).
I can use "print" to display them, but when I use file.write() I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128)
How can I parse this?

If I type 'python unicode' into Google, I get about 14 million results; the first is the official doc which describes the whole situation in excruciating detail; and the fourth is a more practical overview that will pretty much spoon-feed you an answer, and also make sure you understand what's going on.
You really do need to read and understand these sorts of overviews, however long they seem. There really isn't any getting around it. Text is hard. There is no such thing as "plain text", there hasn't been a reasonable facsimile for years, and there never really was, although we spent decades pretending there was. But Unicode is at least a standard.
You also should read http://www.joelonsoftware.com/articles/Unicode.html .

This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.
The unicode()
unicode(string[, encoding, errors])
constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings.
The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors
for example
s = u'La Pe\xf1a'
print s.encode('latin-1')
or
write(s.encode('latin-1'))
will encode using latin-1

The answer to your question is "use codecs". The appeded code also shows some gettext magic, FWIW. http://wiki.wxpython.org/Internationalization
import codecs
import gettext
localedir = './locale'
langid = wx.LANGUAGE_DEFAULT # use OS default; or use LANGUAGE_JAPANESE, etc.
domain = "MyApp"
mylocale = wx.Locale(langid)
mylocale.AddCatalogLookupPathPrefix(localedir)
mylocale.AddCatalog(domain)
translater = gettext.translation(domain, localedir,
[mylocale.GetCanonicalName()], fallback = True)
translater.install(unicode = True)
# translater.install() installs the gettext _() translater function into our namespace...
msg = _("A message that gettext will translate, probably putting Unicode in here")
# use codecs.open() to convert Unicode strings to UTF8
Logfile = codecs.open(logfile_name, 'w', encoding='utf-8')
Logfile.write(msg + '\n')
Despite Google being full of hits on this problem, I found it rather hard to find this simple solution (it is actually in the Python docs about Unicode, but rather burried).
So ... HTH...
GaJ

Python Encoding issue

Why am I getting this issue? and how do I resolve it?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 24: unexpected code byte
Thank you

Somewhere, perhaps subtly, you are asking Python to turn a stream of bytes into a "string" of characters.
Don't think of a string as "bytes". A string is a list of numbers, each number having an agreed meaning in Unicode. (#65 = Latin Capital A. #19968 = Chinese Character "One"/"First") .
There are many methods of encoding a list of Unicode entities into a stream of bytes. Python is assuming your stream of bytes is the result of a particular such method, called "UTF-8".
However, your stream of bytes has data that does not correspond to that method. Thus the error is raised.
You need to figure out the encoding of the stream of bytes, and tell Python that encoding.
It's important to know if you're using Python 2 or 3, and the code leading up to this exception to see where your bytes came from and what the appropriate way to deal with them is.
If it's from reading a file, you can explicity deal with the bytes read. But you must be sure of the file encoding.
If it's from a string that is part of your source code, then Python is assuming the "wrong thing" about your source files... perhaps $LC_ALL or $LANG needs to be set. This is a good time to firmly understand the concept of encoding, and how text editors choose an encoding to write, and what is standard for your language and operating system.

In addition to what Joe said, chardet is a useful tool to detect encoding of the source data.

Somewhere you have a plain string encoded as "Windows-1252" (or "cp1252") containing a "RIGHT SINGLE QUOTATION MARK" (’) instead of an APOSTROPHE ('). This could come from a file you read, or even in a Python source file of yours; you could be running Python 2.x and have a # -*- coding: utf8 -*- line somewhere near the script's beginning, or you could be running Python 3.x.
You don't give enough data; however, somewhere you have a cp1252-encoded string, which you try (explicitly or implicitly) to decode to unicode as utf-8. This won't work.
Give us more info, and we'll try again to help you.
Joe Koberg's answer reminded me of an older answer of mine, which some people have found helpful: Python UnicodeDecodeError - Am I misunderstanding encode?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.