Python 2: Comparing a unicode and a str - python

This topic is already on StackOverflow but I didn't find any satisfying solution:
I have some strings in Unicode coming from a server and I have some hardcoded strings in the code which I'd like to match against. And I do understand why I can't just make a == but I do not succeed in converting them properly (I don't care if I've to do str -> unicode or unicode -> str).
I tried encode and decode but it didn't gave any result.
Here is what I receive...
fromServer = {unicode} u'Führerschein nötig'
fromCode = {str} 'Führerschein nötig'
(as you can see, it is german!)
How can have them equals in Python 2 ?

First make sure you declare the encoding of your Python source file at the top of the file. Eg. if your file is encoded as latin-1:
# -*- coding: latin-1 -*-
And second, always store text as Unicode strings:
fromCode = u'Führerschein nötig'
If you get bytes from somewhere, convert them to Unicode with str.decode before working with the text. For text files, specify the encoding when opening the file, eg:
# use codecs.open to open a text file
f = codecs.open('unicode.rst', encoding='utf-8')
Code which compares byte strings with Unicode strings will often fail at random, depending on system settings, or whatever encoding happens to be used for a text file. Don't rely on it, always make sure you compare either two unicode strings or two byte strings.
Python 3 changed this behaviour, it will not try to convert any strings. 'a' and b'a' are considered objects of a different type and comparing them will always return False.

tested on 2.7
for German umlauts latin-1 is used.
if 'Führerschein nötig'.decode('latin-1') == u'Führerschein nötig':
print('yes....')
yes....

Related

Is there a way to specify which Unicode format is used in unicode encoding in python 2.7?

so I'd like to encode some values in Unicode in my python 2.7 script. I'd like to know if I can specify which type of Unicode to use, i.e UTF-8 vs UTF-32. Apart from that are there any limitations as to which encodings are supported in python 2.7, and how is the default encoding determined?
So, first things first: you should be using Python 3, not Python 2.
The handling of text and unicode is the major difference between the two versions of the language, and the real reason they had to do incompatible changes, and it is much, much more straightforward in Python 3.
This means to talk about unicode in Python 2 you have to understand certain things - unicode is used to represent text: characters regardless of the underlying representation these characters have.
In Python 2 programs, all text typed in the program itself have to be typed with "u" prefixed strings, like u"..." or u'...' - otherwise the strings are considered "byte strings" - just like one have in C code. (Alternativelly, one can place from __future__ import unicode_literals in the first or second line of the file, so this is done automatically.
Otherwise, all data read into the program, either from text files, database connections, inbound HTTP requests, will usually get as byte strings in Python2, and have to explicitly converted to text-strings (that is "unicode objects" in Python 2 speak) before being processed. This is done by calling the bytes-string .decode method - and you pass as the first parameter to it the encoding name used for those bytes. That is, if you have data you have read from an utf-8 encoded file, it can be decoded to text by doing:
data = data.decode("utf-8") # and so on for other encodings.
Also, if you are typing any non-ascii character in the source code of a Python2 file, regardless of it being inside a string (or, inside a comment, for example), you have to declare the file encoding in the first line of the file.
That is done with a Python comment that is treated in a special way by the language parser - the first LoC should contain:
# encoding: utf-8
(of course, you should type the encoding actually used by your program-editor to store the file. Also, some variants on this marking are allowed, as writting "coding" instead of encoding, the ":" being optional, and so on)
So - what I've described in the previous 5 paragraphs takes place automatically in Python 3. But if you followed up so far, you now have a program running with text to be handled. As you can perceive, you did not mention in your question how you are inputing this text you want to encode in different ways.
So, just as you did explicitly convert the input bytes to in memory unicode strings, now you can use the .encode method to convert the text back to whatever text-encoding you want.
If you have some text that you want to write in a text-file encoded in utf-32 little endian, you do:
with open("myfile.txt", "wt") as file_:
file_.write(data.encode("utf-32 LE"))
The valid text codecs are listed, as per Eran's answer at:
https://docs.python.org/2/library/codecs.html#standard-encodings
Now, if you do some tests with this and succeed, you'd better do two things before proceeding any further:
switch to use Python 3. Python 2 is real obsolete at this point - check if it is not already installed in your system by typing "python3" instead of just "Python". If it is not, just install it - it can live side-by-side with Python2
Read this article, to get a grasp on what really goes on whn we talk about unicode in encodings. (The author, Joel, is the founder of Stackoverflow itself, and the article is from 2003)
In python 2, strings are by default ASCII. You can decode them and re-encode them.
supported encodings can be found here: https://docs.python.org/2/library/codecs.html#standard-encodings
Here's an example:
a = "my string" # a is ASCII encoded bytes
b = u"my string" # b is unicode, not encoded
c = a.decode() # c is unicode, not encoded, by default decoding ASCII, you can specify otherwise as an argument
d = c.encode('utf-32') # d is utf-32 encoded bytes
print type(a) # output: <type 'str'>
print type(b) # output: <type 'unicode'>
print type(c) # output: <type 'unicode'>
print type(d) # output: <type 'str'>
Note 1: that in python 3 things are somewhat different.
Note 2: In order to write non-ascii literals in your script (that is if you want to write a = "☂" as part of your code, as opposed to having a just a variable that contains data you got from somewhere) you have to declare the encoding at the top of the file, more info here. And in python 2 only a small subset of unicode characters are accepted in the literal code. (while in memory of course you are not limited).
Note 3: Of course that while unicode type is to you not encoded, internally python keeps it encoded (either as utf-32 if I'm not mistaken). But that's an internal detail that shouldn't affect your code generally speaking.

Is Python 2.7 actually converting my string to UTF-8 or is the definition of isalnum() different across different machines?

My sample.txt:
é Roméo et Juliette vécu heureux chaque après
My program:
#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-
with open("test4", "r") as f:
s = f.read()
print(s)
print(isinstance(s, unicode))
print(s[0].isalnum())
My output:
é Roméo et Juliette vécu heureux chaque après
False
False
From Python isalpha() and scandics and How do I check if a string is unicode or ascii? lead me to believe that both statements should be true.
My hypotheses:
Emacs is using "iso-latin-1" as the file encoding, which is mucking things up
isalnum() depends on something other than encoding
Line 2 isn't working
My biggest worry is #2. I do not really care about the result of isalnum(), I just want the result to be consistent for different machines/people. Worst case, I can just roll my own isalnum(); but I am curious why I am experiencing this behaviour in the first place.
Also, I want to be sure my program understand UTF-8 encoded documents across different machines as well.
Any ideas of what is going on?
Strings (type str) in Python 2.7 are bytes. When you read text from a file, you get bytes, with possibly the line endings changed. Therefore, s is not an instance of type unicode.
On a str, tests like isalnum() assume that the string is ASCII text. ASCII is defined only for codes 0 to 127. Python has no idea, and can have no idea, what characters are represented by values outside this range, because the encoding is not known. é is not an ASCII character and therefore is not considered alphanumeric.
What you want to do is decode the byte string you've read to a Unicode string:
u = s.decode("utf8")
(assuming the string is written to the file in UTF8 encoding; if that doesn't work, you can try latin1 or cp437... the latter is what my terminal gives me on Windows 10)
When you do that, u[0].isalnum() is True and isinstance(u, unicode) is also True.
Python 3 works a little differently. You have to tell Python what encoding to use when you open the file. Then it translates the strings to Unicode from that encoding as you read them. All strings in Python 3 are Unicode; there's a separate type, bytes, for byte strings. You probably ought to use Python 3 for a lot of different reasons, but its more coherent handling of text is certainly one of those reasons.

Running Python 2.7 Code With Unicode Characters in Source

I want to run a Python source file that contains unicode (utf-8) characters in the source. I am aware of the fact that this can be done by adding the comment # -*- coding: utf-8 -*- in the beginning. However, I wish to do it without using this method.
One way I could think of was writing the unicode strings in escaped form. For example,
Edit: Updated Source. Added Unicode comments.
# Printing naïve and 男孩
def fxn():
print 'naïve'
print '男孩'
fxn()
becomes
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
I have two questions regarding the above method.
How do I convert the first code snippet, using Python, into its equivalent that
follows it? That is, only unicode sequences should be written in
escaped form.
Is the method foolproof considering only unicode (utf-8) characters are used? Is there something that can go wrong?
Your idea is generally sound but will break in Python 3 and will cause a headache when you manipulating and writing your strings in Python 2.
It's a good idea to use Unicode strings, not regular strings when dealing with non-ASCII.
Instead, you can encode your characters as Unicode (not UTF-8) escape sequences in Unicode strings.
u'na\xefve'
u'\u7537\u5b69'
note the u prefix
Your code is now encoding agnostic.
If you only use byte strings, and save your source file encoded as UTF-8, your byte strings will contain UTF-8-encoded data. No need for the coding statement (although REALLY strange that you don't want to use it...it's just a comment). The coding statement let's Python know the encoding of the source file, so it can decode Unicode strings correctly (u'xxxxx'). If you have no Unicode strings, it doesn't matter.
For your questions, no need to convert to escape codes. If you encode the file as UTF-8, you can use the more readable characters in your byte strings.
FYI, that won't work for Python 3, because byte strings cannot contain non-ASCII in that version.
That said, here's some code that will convert your example as requested. It reads the source assuming it is encoded in UTF-8, then uses a regular expression to locate all non-ASCII characters. It passes them through a conversion function to generate the replacement. This should be safe, since non-ASCII can only be used in string literals and constants in Python 2. Python 3, however, allows non-ASCII in variable names so this wouldn't work there.
import io
import re
def escape(m):
char = m.group(0).encode('utf8')
return ''.join(r'\x{:02x}'.format(ord(b)) for b in char)
with io.open('sample.py',encoding='utf8') as f:
content = f.read()
new_content = re.sub(r'[^\x00-\x7f]',escape,content)
with io.open('sample_new.py','w',encoding='utf8') as f:
f.write(new_content)
Result:
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
question 1:
try to use:
print u'naïve'
print u'长者'
question 2:
If you type the sentences by keyboard and Chinese input software, everything should be OK. But if you copy and paste sentence from some web pages, you should consider other encode format such as GBK,GB2312 and GB18030
This snippet of Python 3 should convert your program correctly to work in Python 2.
def convertchar(char): #converts individual characters
if 32<=ord(char)<=126 or char=="\n": return char #if normal character, return it
h=hex(ord(char))[2:]
if ord(char)<256: #if unprintable ASCII
h=" "*(2-len(h))+h
return "\\x"+h
elif ord(char)<65536: #if short unicode
h=" "*(4-len(h))+h
return "\\u"+h
else: #if long unicode
h=" "*(8-len(h))+h
return "\\U"+h
def converttext(text): #converts a chunk of text
newtext=""
for char in text:
newtext+=convertchar(char)
return newtext
def convertfile(oldfilename,newfilename): #converts a file
oldfile=open(oldfilename,"r")
oldtext=oldfile.read()
oldfile.close()
newtext=converttext(oldtext)
newfile=open(newfilename,"w")
newfile.write(newtext)
newfile.close()
convertfile("FILE_TO_BE_CONVERTED","FILE_TO_STORE_OUTPUT")
First a simple remarl: as you are using byte strings in a Python2 script, the # -*- coding: utf-8 -*- has simply no effect. It only helps to convert the source byte string to an unicode string if you had written:
# -*- coding: utf-8 -*-
...
utxt = u'naïve' # source code is the bytestring `na\xc3\xafve'
# but utxt must become the unicode string u'na\xefve'
Simply it might be interpreted by clever editors to automatically use a utf8 charset.
Now for the actual question. Unfortunately, what you are asking for is not really trivial: idenfying in a source file what is in a comment and in a string simply requires a Python parser... And AFAIK, if you use the parser of ast modules you will lose your comments except for docstrings.
But in Python 2, non ASCII characters are only allowed in comments and litteral strings! So you can safely assume that if the source file is a correct Python 2 script containing no litteral unicode string(*), you can safely transform any non ascii character in its Python representation.
A possible Python function reading a raw source file from a file object and writing it after encoding in another file object could be:
def src_encode(infile, outfile):
while True:
c = infile.read(1)
if len(c) < 1: break # stop on end of file
if ord(c) > 127: # transform high characters
c = "\\x{:2x}".format(ord(c))
outfile.write(c)
An nice property is that it works whatever encoding you use, provided the source file is acceptable by a Python interpreter and does not contain high characters in unicode litterals(*), and the converted file will behave exactly the same as the original one...
(*) A problem will arise if you use unicode litterals in an encoding other that Latin1, because the above function will behave as if the file contained the declaration # -*- coding: Latin1 -*-: u'é' will be translated correctly as u'\xe9' if original encoding is latin1 but as u'\xc3\xc9' (not what is expected...) if original encoding is utf8, and I cannot imagine a way to process correctly both litteral byte strings and unicode byte strings without fully parsing the source file...

How can I print characters like ♟ in python

I am trying to print a clean chess board in python 2.7 that uses unique characters such as ♟.
I have tried simply replacing a value in a string ("g".replace(g, ♟)) but it is changed to '\xe2\x80\xa6'. If I put the character into an online ASCII converter, it returns "226 153 159"
♟ is a unicode character. In python 2, str holds ascii strings or binary data, while unicode holds unicode strings. When you do "♟" you get a binary encoded version of the unicode string. What that encoding is depends on the editor/console you used to type it in. Its common (and I think preferred) to use UTF-8 to encode strings but you may find that Windows editors favor little-endian UTF-16 strings.
Either way, you want to write your strings as unicode as much as possible. You can do some mix-and-matching between str and unicode but make sure anything outside of the ASCII code set is unicode from the beginning.
Python can take an encoding hint at the front of the file. So, assuming you use a UTF-8 editor, you can do
!#/usr/bin/env python
# -*- coding: utf-8 -*-
chess_piece = u"♟"
print u"g".replace(u"g", chess_piece)

What does the value mean when assigning non-ascii character to a python built-in string?

I am recently studying something related to encoding and I am confused about the following:
See if I have
a = "哈" ## whatever non-ascii char is fine
a[0] ## = "\xe5"
a[1] ## = "\x93"
a[2] ## = "\x88"
len(a) would be 3, and each of the value would be "\xe5", "\x93", and "\x88"
I understand that if I do:
a.decode("utf-8") ## = u"\u54c8"
It will become a unicode string, and the code point would be "\u54c8".
The question is: what encoding method does the built-in python string use?
Why a[0] not be "\x54" and a[1] not be "\xc8" so that they together are "54c8"?
I guess the encoding in built-in python str should not be utf-8 because the right utf-8 code point should be "\u54c8". Is that right?
UTF-8 and Unicode are not the same thing. Unicode is an abstract mapping of integer values to characters; UTF-8 is one particular way of representing those integers as a sequence of bytes. \xe5\x93\x88 is the three-byte UTF-8 encoding of the integer 0x54c8, which cannot be represented by a single byte.
The default encoding in Python 2 was ISO-8859, but was changed to UTF-8 in Python 3.
The result of pasting a non-ascii character into the interpreter like that is dependent on your terminal encoding. It's likely (from seeing your data) that it's a utf-8 encoding on your terminal.
a = "哈"
When you evaluate that line of code in Python 2 interactive interpreter, you'll create a bytestring object that is already encoded.
To get a text object from it, you'll have to decode the data using:
a.decode(encoding)
It helps to always think of a str object as bytes and a unicode object as text.
There is no simple relationship between the codepoint and the utf-8 encoded bytes. The relationship that is simple is that
u'哈' == u'\u54c8' == unichr(21704)
Think of the codepoint as just an index in a big table, which you use to lookup the character at that index. The above equality just shows that 哈 is the character at codepoint 21704 (because in hex, 0x54c8 is 21704).
If you want to know the relationship between a codepoint (21704) and the UTF bytes (the \xe5 and \x93 stuff), I already wrote a long answer about that here. You can read it if you're interested to learn how to encode/decode UTF by hand.

Categories