String has unicode code points embedded, how to convert? Python 3 [duplicate] - python

I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?

a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'

You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))

Related

Python unicode strings

I'm a Python newbie and I'm trying to make one script that writes some strings in a file if there's a difference. Problem is that original string has some characters in \uNNNN Unicode format and I cannot convert the new string to the same Unicode format.
The original string I'm trying to compare: \u00A1 ATENCI\u00D3N! \u25C4
New string is received as: ¡ ATENCIÓN! ◄
And this the code
str = u'¡ ATENCIÓN! ◄'
print(str)
str1 = str.encode('unicode_escape')
print (str1)
str2 = str1.decode()
print (str2)
And the result is:
¡ ATENCIÓN! ◄
b'\\xa1 ATENCI\\xd3N! \\u25c4'
\xa1 ATENCI\xd3N! \u25c4
So, how can I get \xa1 ATENCI\xd3N! \u25c4 converted to \u00A1 ATENCI\u00D3N! \u25C4 as this is the only Unicode format I can save?
Note: Cases of characters in strings also need to be the same for comparison.
The issue is, according to the docs (read down a little bit, between the escape sequences tables), the \u, \U, and \N Unicode escape sequences are only recognized in string literals. That means that once the literal is evaluated in memory, such as in a variable assignment:
s = "\u00A1 ATENCI\u00D3N! \u25C4"
any attempt to str.encode() it automatically converts it to a bytes object that uses \x where it can:
b'\\xa1 ATENCI\\xd3N! \\u25c4'
Using
b'\\xa1 ATENCI\\xd3N! \\u25c4'.decode("unicode_escape")
will convert it back to '¡ ATENCIÓN! ◄'. This uses the actual (intended) representation of the characters, and not the \uXXXX escape sequences of the original string s.
So, what you should do is not mess around with encoding and decoding things. Observe:
print("\u00A1 ATENCI\u00D3N! \u25C4" == '¡ ATENCIÓN! ◄')
True
That's all the comparison you need to do.
For further reading, you may be interested in:
How to work with surrogate pairs in Python?
Encodings and Unicode from the Python docs.

Understanding unicode and encoding in Python

When I enter following in the python 2.7 console
>>>'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>>u'áíóús'
u'\xe1\xed\xf3\xfas'
I get the above output. What is the difference between the two? I understand the basics of unicode, and different kind of encoding like UTF8, UTF16 etc. But, I don't understand what is being printed on the console or how to make sense of it.
u'áíóús' is a string of text. What you see echoed in the REPL is the canonical representation of that object:
>>> print u'áíóús'
áíóús
>>> print repr(u'áíóús')
u'\xe1\xed\xf3\xfas'
The things like \xe1 are related to hexadecimal ordinals of each character:
>>> [hex(ord(c)) for c in u'áíóús']
['0xe1', '0xed', '0xf3', '0xfa', '0x73']
Only the last character was in the ascii range, i.e. ordinals in range(128), so only that last character "s" is plainly visible in Python 2.x:
>>> chr(0x73)
's'
'áíóús' is a string of bytes. What you see printed is an encoding of the same text characters, with your terminal emulator assuming the encoding:
>>> 'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>> u'áíóús'.encode('utf-8')
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'

Convert UTF-8 to string literals in Python

I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:
My string is: 'Entre\xc3\xa9'
Example one:
This code:
u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')
returns the result: u'Entre\xe9'
If I then continue by printing this:
print u'Entre\xe9'
I get the result: Entreé
This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?
Example:
a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b
I would like result of "c" to be:
Entreé
The u'' syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode object being created, but that's not the only way to create such an object.
You cannot make a unicode value from a byte string by adding u in front of it. But if you called str.decode() with the right encoding, you get a unicode value. Vice-versa, you can encode unicode objects to byte strings with unicode.encode().
Note that when displaying a unicode object, Python represents it by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.
Your a value is defined using a byte string literal, so you only need to decode:
a = 'Entre\xc3\xa9'
b = a.decode('utf8')
Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.
You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder

Returning the first N characters of a unicode string

I have a string in unicode and I need to return the first N characters.
I am doing this:
result = unistring[:5]
but of course the length of unicode strings != length of characters.
Any ideas? The only solution is using re?
Edit: More info
unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]
returns-> ?
I think that unicode strings are two bytes (char), that's why this thing happens. If I do:
result = unistring[:2]
I get
M
which is correct,
So, should I always slice*2 or should I convert to something?
Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str) and Unicode strings (unicode).
Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα" which is a byte string and unistring = u"Μεταλλικα" which is a unicode string.
The reason you see ? when you do result = unistring[:1] is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.
So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO
When you say:
unistring = "Μεταλλικα" #Metallica written in Greek letters
You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:
unistring = "Μεταλλικα".decode('utf-8')
or by using the unicode literal in a source file with the right encoding declaration
# coding: UTF-8
unistring = u"Μεταλλικα"
The unicode string will do what you want when you do unistring[:5].
There is no correct straight-forward approach with any type of "Unicode string".
Even Python "Unicode" UTF-16 string has variable length characters so, you can't just cut with ustring[:5]. Because some Unicode Code points may use more then one "character" i.e. Surrogate pairs.
So if you want to cut 5 code points (note these are not characters) so you may analyze the text, see http://en.wikipedia.org/wiki/UTF-8 and http://en.wikipedia.org/wiki/UTF-16 definitions. So you need to use some bit masks to figure out boundaries.
Also you still do not get characters. Because for example. Word "שָלוֹם" -- peace in Hebrew "Shalom" consists of 4 characters and 6 code points letter "shin", vowel "a" letter "lamed", letter "vav" and vowel "o" and final letter "mem".
So character is not code point.
Same for most western languages where a letter with diacritics may be represented as two code points. Search for example for "unicode normalization".
So... If you really need 5 first characters you have to use tools like ICU library. For example there is ICU library for Python that provides characters boundary iterator.

How to check if a string in Python is in ASCII?

I want to I check whether a string is in ASCII or not.
I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()'s documentation).
Is there another way to check?
I think you are not asking the right question--
A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.
Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer
by trying:
try:
mystring.decode('ascii')
except UnicodeDecodeError:
print "it was not a ascii-encoded unicode string"
else:
print "It may have been an ascii-encoded unicode string"
def is_ascii(s):
return all(ord(c) < 128 for c in s)
In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.
def isascii(s):
"""Check if the characters in string s are in ASCII, U+0-U+7F."""
return len(s) == len(s.encode())
To check, pass the test string:
>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True
New in Python 3.7 (bpo32677)
No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method - .isascii() will check if the strings is ascii.
print("is this ascii?".isascii())
# True
Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:
try:
mystring.encode('ascii')
except UnicodeEncodeError:
pass # string is not ascii
else:
pass # string is ascii
Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.
Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.
Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.
Instead of ord(u'é'), try this:
>>> [ord(x) for x in u'é']
That tells you which sequence of code points "é" represents. It may give you [233], or it may give you [101, 770].
Instead of chr() to reverse this, there is unichr():
>>> unichr(233)
u'\xe9'
This character may actually be represented either a single or multiple unicode "code points", which themselves represent either graphemes or characters. It's either "e with an acute accent (i.e., code point 233)", or "e" (code point 101), followed by "an acute accent on the previous character" (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.
Most of the time you shouldn't have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.
The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.
Ran into something like this recently - for future reference
import chardet
encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
print 'string is in ascii'
which you could use with:
string_ascii = string.decode(encoding['encoding']).encode('ascii')
How about doing this?
import string
def isAscii(s):
for c in s:
if c not in string.ascii_letters:
return False
return True
I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).
My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.
If you're getting a rude and persistent
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 263: ordinal not in range(128)
particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)
Eventually I determined that what I wanted to do was this:
escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))
Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):
# -*- coding: utf-8 -*-
That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').
>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'àéç'
To improve Alexander's solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html
from curses import ascii
def isascii(s):
return all(ascii.isascii(c) for c in s)
You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.
A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.
However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.
Like #RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.
>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True
I imagine a regular expression is well-optimized for this.
import re
def is_ascii(s):
return bool(re.match(r'[\x00-\x7F]+$', s))
To include an empty string as ASCII, change the + to *.
To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors
>>> ord("¶")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
For example
def is_ascii(s):
try:
return all(ord(c) < 128 for c in s)
except TypeError:
return False
I use the following to determine if the string is ascii or unicode:
>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>>
Then just use a conditional block to define the function:
def is_ascii(input):
if input.__class__.__name__ == "str":
return True
return False

Categories