I have a list of strings with various different characters that are similar to latin ones, I get these from a website that I download from using urllib2. The website is encoded in utf-8. However, after trying quite a few variations, I can't figure out how to convert this to simple ASCII equivalent. So for example, one of the strings I have is:
u'Atl\xc3\xa9tico Madrid'
In plain text it's "Atlético Madrid", what I want, is to change it to just "Atletico Madrid".
If I use simple unidecode on this, I get "AtlA(c)tico Madrid". What am I doing wrong?
You have UTF-8 bytes in a Unicode string. That's not a proper Unicode string, that's a Mojibake:
>>> print u'Atl\xc3\xa9tico Madrid'
Atlético Madrid
Repair your string first:
>>> u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
u'Atl\xe9tico Madrid'
>>> print u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
Atlético Madrid
and Unidecode will give you what you expected:
>>> import unidecode
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid')
'AtlA(c)tico Madrid'
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8'))
'Atletico Madrid'
Better still would be to read your data correctly in the first place; you appear to have decoded the data as Latin-1 (or perhaps the Windows CP-1252 codepage) rather than as UTF-8.
Related
I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?
a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))
I am trying to convert texts into URLs, but certain characters are not being converted as I'm expecting. For example:
>>> import urllib
>>> my_text="City of Liège"
>>> my_url=urllib.parse.quote(my_text,safe='')
>>> my_url
'City%20of%20Li%C3%A8ge'
The spaces get converted properly, however, the "è" should get converted into %E8, but it is returned as %C3%A8. What am I missing ?
I am using Python 3.6.
Your string is UTF-8 encoded, and the URL encoded string reflects this.
0xC3A8 is the UTF-8 encoding of the Unicode value U+00E8, which is described as "LATIN SMALL LETTER E WITH GRAVE".
In order to get the string you are after, you need to let Python know which codepage you're using, like this:
my_text=bytes("City of Liège",'cp1252')
Is there a possible normalization path which brings both strings below to same value?
u'Aho\xe2\u20ac\u201cCorasick_string_matching_algorithm'
u'Aho\u2013Corasick string matching algorithm'
It looks like you have a Mojibake there, UTF-8 bytes that have been decoded as if they were Windows-1252 data instead. Your 3 'characters', encoded to Windows-1252, produce the exact 3 UTF-8 bytes for the U+2013 EN DASH character in your target string:
>>> u'\u2013'.encode('utf8')
'\xe2\x80\x93'
>>> u'\u2013'.encode('utf8').decode('windows-1252')
u'\xe2\u20ac\u201c'
You can use the ftfy module to repair that data, so you get an emdash for the bytes:
>>> import ftfy
>>> sample = u'Aho\xe2\u20ac\u201cCorasick_string_matching_algorithm'
>>> ftfy.fix_text(sample)
u'Aho\u2013Corasick_string_matching_algorithm'
then simply replace underscores with spaces:
>>> ftfy.fix_text(sample).replace('_', ' ')
u'Aho\u2013Corasick string matching algorithm'
You can also simply encode to Windows-1252 and decode again as UTF-8, but that doesn't always work because there are specific bytes that cannot be decoded legally as Windows-1252, but some systems producing these Mojibakes do so anyway. ftfy includes specialised repair codecs to reverse that process. In addition, it detects the specific Mojibake errors made to automate the process across multiple possible codec errors.
I am unable to convert the following Unicode to ASCII without losing data:
u'ABRA\xc3O JOS\xc9'
I tried encode and decode and they won’t do it.
Does anyone have a suggestion?
The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:
>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e
All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).
See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.
As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:
>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'
The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.
I found https://pypi.org/project/Unidecode/ this library very useful
>>> from unidecode import unidecode
>>> unidecode('ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode('30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode('\u5317\u4EB0')
'Bei Jing '
I needed to calculate the MD5 hash of a unicode string received in HTTP request. MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash.
So I came up with the following code, which keeps the string intact while converting from unicode.
unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()
This removes the unicode part from the string and keeps all the data intact.
I'm working on Arabic light stemmer package in python
I want to convert the result from any operation from Unicode to Arabic letters.
My code:
import tashaphyne
form tashaphyne import *
>>> text = u"الْعَرَبِيّةُ"
>>> strip_tashkeel(text)
I want it to display "العربية" not it's Unicode
You can convert unicode strings to any other encoding using the encode() function like so:
text.encode('utf8')
Here is a list of possible encodings in Python 2.7.
You see u'\u0627\u0644\u0639\u0631\u0628\u064a\u0629' instead of "العربية" because repr() representation for unicode strings is meant to be displayable even on 7-bit terminals.
To see actual scripts instead of unicode, do either print _ after your call to strip_tashkeel(), or do print strip_tashkeel(text) directly.