I am trying to convert texts into URLs, but certain characters are not being converted as I'm expecting. For example:
>>> import urllib
>>> my_text="City of Liège"
>>> my_url=urllib.parse.quote(my_text,safe='')
>>> my_url
'City%20of%20Li%C3%A8ge'
The spaces get converted properly, however, the "è" should get converted into %E8, but it is returned as %C3%A8. What am I missing ?
I am using Python 3.6.
Your string is UTF-8 encoded, and the URL encoded string reflects this.
0xC3A8 is the UTF-8 encoding of the Unicode value U+00E8, which is described as "LATIN SMALL LETTER E WITH GRAVE".
In order to get the string you are after, you need to let Python know which codepage you're using, like this:
my_text=bytes("City of Liège",'cp1252')
Related
I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?
Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}
The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.
I just made the first web scraper by myself which goes onto wikipedia and downloads the html of the whole page. I managed to get just the content of a list. the values on the list contain numbers either positive or negative.
But instead of printing out a '-2' it gives me a '\xe2\x88\x922' . I tried the string.replace("\xe2\x88\x92","-") but this doesn't seem to work due to the backslashes.
do you know how I can convert these utf things into their real symbol ?
I used urllib to get the html content if this is important.
You can use bytes.decode to convert it:
>>> b'\xe2\x88\x922'.decode("utf8")
'-2'
And if your data doesn't start with b (i.e. if it is not a bytes object), you can first convert it to bytes then decode:
>>> s = '\xe2\x88\x922'
>>> byte_object = bytes(ord(c) for c in s)
>>> byte_object.decode("utf8")
'-2'
That is unfortunately common when reading data from web pages: they contain characters looking like standard ASCII characters but that are not.
Here you have a MINUS character (unicode U+2212) − which looks like the normal HYPHEN-MINUS (unicode U+002D or ASCII 0x2D) -.
In UTF8 it is encoded as b'\xe2\x88\x922'. It probably means that you read it as if it was Latin1 encoded while it is UTF8 encoded.
A trick the correctly recode it is to encode it as Latin1 and decode it back:
t = '\xe2\x88\x922'
print(t.encode('latin1').decode()
−2
My strings look like this \\xec\\x88\\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98
If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want 수.
If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.
How would I go about decoding the original \\xec\\x88\\x98, so that I get the value correct output?
In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.
Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.
import unicodedata as ud
src = '\\xec\\x88\\x98'
print repr(src)
s = src.decode('string-escape')
print repr(s)
u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')
output
'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218
However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.
I have a list of strings with various different characters that are similar to latin ones, I get these from a website that I download from using urllib2. The website is encoded in utf-8. However, after trying quite a few variations, I can't figure out how to convert this to simple ASCII equivalent. So for example, one of the strings I have is:
u'Atl\xc3\xa9tico Madrid'
In plain text it's "Atlético Madrid", what I want, is to change it to just "Atletico Madrid".
If I use simple unidecode on this, I get "AtlA(c)tico Madrid". What am I doing wrong?
You have UTF-8 bytes in a Unicode string. That's not a proper Unicode string, that's a Mojibake:
>>> print u'Atl\xc3\xa9tico Madrid'
Atlético Madrid
Repair your string first:
>>> u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
u'Atl\xe9tico Madrid'
>>> print u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
Atlético Madrid
and Unidecode will give you what you expected:
>>> import unidecode
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid')
'AtlA(c)tico Madrid'
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8'))
'Atletico Madrid'
Better still would be to read your data correctly in the first place; you appear to have decoded the data as Latin-1 (or perhaps the Windows CP-1252 codepage) rather than as UTF-8.
I got this data returned b'\\u041a\\u0435\\u0439\\u0442\\u043b\\u0438\\u043d\\u043f\\u0440\\u043e from an API. This data is in Russian which I know for sure. I am guessing these values are unicode representation of the cyrillic letters?
The data returned was a byte array.
How can I convert that into readable cyrillic string? Pretty much I need a way to convert that kind into readable human text.
EDIT: Yes this is JSON data. Forgot to mention, sorry.
Chances are you have JSON data; JSON uses \uhhhh escape sequences to represent Unicode codepoints. Use the json.loads() function on unicode (decoded) data to produce a Python string:
import json
string = json.loads(data.decode('utf8'))
UTF-8 is the default JSON encoding; check your response headers (if you are using a HTTP-based API) to see if a different encoding was used.
Demo:
>>> import json
>>> json.loads(b'"\\u041a\\u0435\\u0439\\u0442\\u043b\\u0438\\u043d\\u043f\\u0440\\u043e"'.decode('utf8'))
'Кейтлинпро'