Convert a unicode object to a latin string with entities - python

I have a unicode object like
x = u"a & 日本語: enči hallöle"
and want to convert it into a latin-1 string with html-entities like
"a & 日本語: enči hallöle"
the reason behind this is, that I want my users to be able to enter unicode data, but my legacy database where I need to save my data only accepts latin-1 strings. (the "ö" should not be converted, but the other special characters must be converted)
Any idea which module to use here? I searched through the encoding module, looked up some codecs, tried a bit with methods of unicode objects, but came to no sensible solution.

Use the "xmlcharrefreplace" option of unicode.encode, but note that it won't translate & to & for you:
>>> x = "a & 日本語: enči hallöle".decode("utf-8")
>>> x.replace("&", "&").encode("latin-1", "xmlcharrefreplace")
'a & 日本語: enči hall\xf6le'

Just encode your to UTF-8, that should be save.
>>> x.encode("UTF-8")
'a & \xc3\xa6\xc2\x97\xc2\xa5\xc3\xa6\xc2\x9c\xc2\xac\xc3\xa8\xc2\xaa\xc2\x9e: en\xc3\x84\xc2\x8di hall\xc3\x83\xc2\xb6le'

Related

String has unicode code points embedded, how to convert? Python 3 [duplicate]

I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?
a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))

How can I make a Python string to include unicode code points?

I want to have an ASCII representation of a string that could contain non-ascii characters such as German umlauts. The way the non-ascii characters should be encoded is as unicode code points, e.g. ß would be \u00df.
The problem is that I have those escape sequences in my database. It gets displayed like I want it, but when the user searches for something, he enters ß and not \u00df. For ß, it works for me to simply make search_query.replace('ß', r'\u00df'), but there are (many) more possible escape sequences.
What I tried
>>> name = 'Ein Spaß'
>>> name.encode('ascii', 'backslashreplace')
b'Ein Spa\\xdf'
>>> name.encode('ascii', 'xmlcharrefreplace')
b'Ein Spaß'
What I want to get:
'Ein Spa\\u00df'
As a dumb workaround, stdlib json encoding will use the 4 digit unicode escapes:
>>> name = 'Ein Spaß'
>>> json.dumps(name)
'"Ein Spa\\u00df"'
>>> ast.literal_eval(json.dumps(name)) == name
True
However, this will not really solve your search problem robustly. You'll need to normalize the query text before searching. And you'll want to normalize unicode data on the way into the database, too - or use a db + ORM which handles such details for you.
See this answer for details about a better tool for the job here: unicodedata.normalize.
encode in ascii if possible
else replace by code point as unicode string
ord : is a function to get character code point as integer base 10
new=[]
for e in name:
try:
new.append(e.encode("ascii").decode())
except:
new.append(u"\\u%04x"%ord(e))
"".join(new)
If the data in your database is stored as escaped unicode, you can use codecs.decode with encoding set to unicode_escape:
>>> name = "Ein Spa\\u00df"
>>> codecs.decode(name, "unicode_escape")
'Ein Spaß'

How to compare unicode and string in Python?

I have two variables (let's say x and y) that have the following values:
x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'
They are presumable encoding the same name but in different way. The first variable is unicode and the second one is a string.
Is there a way to transform string into unicode (or unicode into string) and check if they are really the same.
I try to use encode
x.encode('utf-8')
It returns something new (the third version):
'Ko\xc5\xa1ick\xc3\xbd'
And using the following:
print x.encode('utf-8')
returns yet another version:
KošickÛ
So, I am totally confused. Is there a way to keep everything in the same format?
You can convert a byte string to Unicode, but if it contains any non-ASCII, characters, you have to specify the encoding.
if y.decode('iso-8859-1') == x:
print(u'{0!r} converted to Unicode == {1}".format(y, x))
With your given example, this is not true; but perhaps y is in a different encoding.
In theory, you could convert either way, but generally, it makes sense to use all-Unicode internally, and convert other encodings to Unicode for use in your code (not the other way around).
You need to know the encoding of the byte string. It looks like windows-1252:
x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'
print x == y.decode('windows-1252')
print x.encode('windows-1252') == y
Output:
True
True
Best practice is to convert text to Unicode on input to the program, do all the processing in Unicode, and convert back to encoded bytes to persist to storage, transmit on a socket, etc.
Well, utf-8 is now the de facto standard for interchange and in the Linux world, but there are plenty of other encodings.
Common examples are latin1, latin9 (same with € symbol), and cp1252 a windows variant of them.
In your case:
>>> x.encode('cp1252')
'Ko\x9aick\xfd'
So the y strings seems to be cp1252 encoded.

Python - Unicode to ASCII conversion

I am unable to convert the following Unicode to ASCII without losing data:
u'ABRA\xc3O JOS\xc9'
I tried encode and decode and they won’t do it.
Does anyone have a suggestion?
The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:
>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e
All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).
See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.
As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:
>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'
The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.
I found https://pypi.org/project/Unidecode/ this library very useful
>>> from unidecode import unidecode
>>> unidecode('ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode('30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode('\u5317\u4EB0')
'Bei Jing '
I needed to calculate the MD5 hash of a unicode string received in HTTP request. MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash.
So I came up with the following code, which keeps the string intact while converting from unicode.
unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()
This removes the unicode part from the string and keeps all the data intact.

Python UTF-8 conversion problem

In my database, I have stored some UTF-8 characters. E.g. 'α' in the "name" field
Via Django ORM, when I read this out, I get something like
>>> p.name
u'\xce\xb1'
>>> print p.name
α
I was hoping for 'α'.
After some digging, I think if I did
>>> a = 'α'
>>> a
'\xce\xb1'
So when Python is trying to display '\xce\xb1' I get alpha, but when it's trying to display u'\xce\xb1', it's double encoding?
Why did I get u'\xce\xb1' in the first place? Is there a way I can just get back '\xce\xb1'?
Thanks. My UTF-8 and unicode handling knowledge really need some help...
Try to put the unicode signature u before your string, e.g. u'YOUR_ALFA_CHAR' and revise your database encoding, because Django always supports UTF-8 .
What you seem to have is the individual bytes of a UTF-8 encoded string interpreted as unicode codepoints. You can "decode" your string out of this strange form with:
p.name = ''.join(chr(ord(x)) for x in p.name)
or perhaps
p.name = ''.join(chr(ord(x)) for x in p.name).decode('utf8')
One way to get your strings "encoded" into this form is
''.join(unichr(ord(x)) for x in '\xce\xb1')
although I have a feeling your strings actually got in this state by different components of your system disagreeing on the encoding in use.
You will probably have to fix the source of your bad "encoding" rather than just fixing the data currently in your database. And the code above might be okay to convert your bad data once, but I would advise you don't insert this code into your Django app.
The problem is that p.name was not correctly stored and/or read in from the database.
Unicode small alpha is U+03B1 and p.name should have printed as u'\x03b1' or if you were using a Unicode capable terminal the actual alpha symbol itself may have been printed in quotes. Note the difference between u'\xce\xb1' and u'\xceb1'. The former is a two character string and the latter in a single character string. I have no idea how the '03' byte of the UTF-8 got translated into 'CE'.
You can turn any byte sequence into internal unicode representation through the decode function:
print '\xce\xb1'.decode('utf-8')
This allows you to import a byte sequence from any source and then turn it into a Python unicode string.
Reference: http://docs.python.org/library/stdtypes.html#string-methods
Try converting the encoding with p.name.encode('latin-1'). Here's a demonstration:
>>> print u'\xce\xb1'
α
>>> print u'\xce\xb1'.encode('latin-1')
α
>>> print '\xce\xb1'
α
>>> '\xce\xb1' == u'\xce\xb1'.encode('latin1')
True
For more information, see str.encode and Standard Encodings.

Categories