Python UTF-16 WAVY DASH encoding question / issue - python

I was doing some work today, and came across an issue where something "looked funny". I had been interpreting some string data as utf-8, and checking the encoded form. The data was coming from ldap (Specifically, Active Directory) via python-ldap. No surprises there.
So I came upon the byte sequence '\xe3\x80\xb0' a few times, which, when decoded as utf-8, is unicode codepoint 3030 (wavy dash). I need the string data in utf-16, so naturally I converted it via .encode('utf-16'). Unfortunately, it seems python doesn't like this character:
D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'
It seems IronPython isn't a fan either:
D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'
If somebody could tell me what, exactly, is going on here, it'd be much appreciated.

This seems to be the correct behaviour. The character u'\u3030' when encoded in UTF-16 is the same as the encoding of '00' in UTF-8. It looks strange, but it's correct.
The '\xff\xfe' you can see is just a Byte Order Mark.
Are you sure you want a wavy dash, and not some other character? If you were hoping for a different character then it might be because it had already been misencoded before entering your application.

But it decodes okay:
>>> u"\u3030".encode("utf-16-le")
'00'
>>> '00'.decode("utf-16-le")
u'\u3030'
It's that the UTF-16 encoding of that character happens to coincide with the ASCII code for '0'. You could also represent it with '\x30\x30':
>>> '00' == '\x30\x30'
True

You are being confused by two things here (threw me off too):
utf-16 and utf-32 encodings use a BOM unless you specify which byte order to use, via utf-16-be and such. This is the \xff\xfe in the second last line.
'00' is two of the characters digit zero. It is not a null character. That'd print differently anyway:
>>> '\0\0'
'\x00\x00'

There is a basic error in your sample code above. Remember, you encode Unicode to an encoded string, and you decode from an encoded string back to Unicode. So, you do:
'\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
which translates to the following steps:
'\xe3\x80\xb0' # (some string)
.decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030'
.encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, i.e. '00'
.decode('utf-8') # OOPS! decode using the wrong encoding here!
u'\u3030' is indeed encoded as '00' (ascii zero twice) in UTF-16LE but you somehow think that this is a null byte ('\0') or something.
Remember, you can't reach to the same character if you encode with one and decode with another encoding:
>>> import unicodedata as ud
>>> c= unichr(193)
>>> ud.name(c)
'LATIN CAPITAL LETTER A WITH ACUTE'
>>> ud.name(c.encode("cp1252").decode("cp1253"))
'GREEK CAPITAL LETTER ALPHA'
In this code, I encoded to Windows-1252 and decoded from Windows-1253. In your code, you encoded to UTF-16LE and decoded from UTF-8.

Related

How can I convert special characters into regular characters (é to e, å to a, etc.)?

I have a list of songs I am trying to use to search with through YouTube. However, when certain songs with special characters are used, the following error pops up:
Code:
import urllib.request
import re
search_kw = tracks[3]['Artist'] + '+' + tracks[3]['Track Title']
search_kw = search_kw.replace(' ','+')
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_kw)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
print("https://www.youtube.com/watch?v=" + video_ids[0])
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 43: ordinal not in range(128)
Example of string that causes error:
Tutu Au Mic' – dumbéa
How can I convert the special characters into regular characters to prevent the error from occurring?
Use the Unidecode library for this: https://pypi.org/project/Unidecode/, that guarantees a ascii string in return.
For a web query you probably need to use urlencode
urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)
or for general character translations the string maketrans method
Python 3.9.5 (default, Nov 18 2021, 16:00:48)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> txt = "Tutu Au Mic' – dumbéa"
>>> mytable = txt.maketrans("é", "e")
>>> print(txt.translate(mytable))
Tutu Au Mic' – dumbea
>>>
Instead of doing this, you should encode the non-ascii characters. Youtube will likely be able to understand what you mean with an ascii approximation, but not all characters have an ascii approximation. And it's not necessary, there are well defined ways to pass non-ascii characters in as part of a URL's query string.
The standard library offers urlib.parse.quote_plus for escaping text to be used as a query string. Or use the excellent requests library, https://docs.python-requests.org/en/latest/.

Why do we need to encode and decode in python?

What is the use case of encode/decode?
My understanding was that encode is used to convert string into byte string in order to be able to pass non ascii data across the program. And decode was to convert this byte string back into string.
But foll. examples shows non acsii characters getting successfully printed even if not encoded/decoded. Example:
val1="À È Ì Ò Ù Ỳ Ǹ Ẁ"
val2 = val1
print('val1 is: ',val2)
encoded_val1=val1.encode()
print('encoded_val1 is: ',encoded_val1)
decoded_encoded_val1=encoded_val1.decode()
print('decoded_encoded_val1 is: ',decoded_encoded_val1)
Output:
So what is the use case of encode and decode in python?
The environment you are working on may support those characters, in addition to that your terminal(or whatever you use to see output) may support displaying those characters. Some terminals/command lines or text editors may not support them. Apart from displaying issues, here are some actual reasons and examples:
1- When you transfer data over internet/network (eg with a socket), information is transferred as raw bytes. Non-ascii characters can not be represented by a single byte so we need a special representation for them (utf-16 or utf-8 with more than one byte). This is the most common reason I encountered.
2- Some text editors only supports utf-8. For example you need to represent your Ẁ character in utf-8 format in order to work with them. Reason for that is when dealing with text, people mostly used ASCII characters, which are just one byte. When some systems needed to be integrated with non-ascii characters people converted them to utf-8. Some people with more in-depth knowledge about text editors may give a better explanation about this point.
3- You may have a text written with unicode characters with some Chinese/Russian letters in it, and for some reason store it in your remote Linux server. But your server does not support letters from those languages. You need to convert your text to some strict format (utf-8 or utf-16) and store it in your server so you can recover them later.
Here is a little explanation of UTF-8 format. There are also other articles about the topic if you are interested.
Use utf-8 encoding because it's universal.
Set your code editor to utf-8 encoding and put at the top of all your python file: # coding: utf8
When you get an input (file, string...), it can have a different encoding then you have to get his encode type and decode it. Exemple in HTML file encode type is in meta balise.
If you change something in the HTML file and want to save it or send it by network, then you have to encode it in the encode type it was juste before.
Always use unicode for your string in python. (Automatic for python 3 but for python2.7 use the prefix u like u'Hi')
$ python2.7
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type('this is a string') # bits => encoded
<type 'str'>
>>> type(u'this is a string') # unicode => decoded
<type 'unicode'>
$ python3
Python 3.2.3 (default, Oct 19 2012, 20:10:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type("this is a string") # unicode => decoded
<class 'str'>
>>> type(b"this is a string") # bits => encoded
<class 'bytes'>
1 Use UTF8. Now. All over.
2 In your code, specify the file encoding and declare your strings as "unicode".
3 At the entrance, know the encoding of your data, and decode with decode ().
4 At the output, encode in the expected encoding by the system which will receive the data, or if you can not know it, in UTF8, with encode ().

Python 3 and b'\x92'.decode('latin1')

I'm getting results I didn't expect from decoding b'\x92' with the latin1 codec. See the session below:
Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32
>>> b'\xa3'.decode('latin1').encode('ascii', 'namereplace')
b'\\N{POUND SIGN}'
>>> b'\x92'.decode('latin1').encode('ascii', 'namereplace')
b'\\x92'
>>> ord(b'\x92'.decode('latin1'))
146
The result decoding b'\xa3' gave me exactly what I was expecting. But the two results for b'\x92' were not what I expected. I was expecting b'\x92'.decode('latin1') to result in U+2018, but it seems to be returning U+0092.
What am I missing?
The error I made was to expect that the character 0x92 decoded to "RIGHT SINGLE QUOTATION MARK" in latin-1, it doesn't. The confusion was caused because it was present in a file that was specified as being in latin1 encoding. It now appears that the file was actually encoded in windows-1252. This is apparently a common source of confusion:
http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
If the character is decoded with the correct encoding, then the expected result is obtained.
>>> b'\x92'.decode('windows-1252').encode('ascii', 'namereplace')
b'\\N{RIGHT SINGLE QUOTATION MARK}'
I was expecting b'\x92'.decode('latin1') to result in U+2018
latin1 is an alias for ISO-8859-1. In that encoding, byte 0x92 maps to character U+0092, an unprintable control character.
The encoding you might have really meant is windows-1252, the Microsoft Western code page based on it. In that encoding, 0x92 is U+2019 which is close...
(Further befuddlement arises because for historical reasons web browsers are also confused between the two. When a web page is served as charset=iso-8859-1, web browsers actually use windows-1252.)
I just want to make clear that you're not encoding anything here.
xa3 has a ordinal value of 163 (0xa3 in hexadecimal). Since that ordinal is not seven bits, it can't be encoded into ascii. Your handler for errors just replaces the Unicode Character into the name of the character. The Unicode Character 163 maps to £.
'\x92' on the other hand, has an ordinal value of 146. According to this Wikipedia Article, the character isn't printable - it's a privately used control code in the C2 space. This explains why it's name is simply the literal '\\x92'.
As an aside, if you need the name of the character, it's much better to do it like this:
import unicodedata
print unicodedata.name(u'\xa3')

Decoding Cyrillic in Python - character maps to <undefined>

I receive a server response, bytes:
\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4
\xd0\x9a\xd0\xa6\xd0\x91
This is for sure Cyrillic, but I'm not sure which encoding. Every attempt to decode it in Python fails:
b = b'\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91'
>>> b.decode('utf-8')
'\u0420\u0443\u0431\u043b\u0438 \u0420\u0424 \u041a\u0426\u0411'
>>> print(b.decode('utf-8'))
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4:
character maps to <undefined>
>>> b.decode('cp1251')
'\u0420\xa0\u0421\u0453\u0420±\u0420»\u0420\u0451 \u0420\xa0\u0420¤
\u0420\u0459\u0420¦\u0420\u2018'
>>> print(b.decode('cp1251'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u0420' in
position 0: character maps to <undefined>
Both results somewhat resemble Unicode-escape, but this does not work either:
>>> codecs.decode('\u0420\u0443\u0431\u043b\u0438 \u0420\u0424 \u041a\u0426\u0411',
'unicode-escape')
'Ð\xa0Ñ\x83бли Ð\xa0Ф Ð\x9aЦÐ\x91'
There's a web service for recovering Cyrillic texts, it is able to decode my bytes using Windows-1251:
Output (source encoding : WINDOWS-1251)
Рубли РФ КЦБ
But I don't have any more ideas as for how to approach it.
I think I'm missing something about how encoding works, so if the problem seems trivial to you, I would greatly appreciate a bit of explanation/a link to a tutorial/ some keywords for further googling.
Solution:
Windows PowerShell uses Windows-850 codepage by default, which is incapable of handling some Cyrillic characters. One fix is to change the codepage to Unicode every time starting the shell:
chcp 65001
Here is explained how to make it the new default
This is for sure Cyrillic, but I'm not sure which encoding.
This is UTF-8 (100%).
Python 3.4.3 (default, Mar 25 2015, 17:13:50)
Type "copyright", "credits" or "license" for more information.
IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: b = b'\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91'
In [2]: s = b.decode('utf-8')
In [3]: print(s)
Рубли РФ КЦБ
Works fine for me. May be you have problem with your terminal or repl?
Try out this.
In [1]: s = "\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91"
In [11]: print s.decode('utf-8')
Рубли РФ КЦБ
To print or display some strings properly, they need to be decoded (Unicode strings).
There is a lot information with examples in standart Python library
Python 3:
>>> import sys
>>> print (sys.version)
3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2]
>>> b = b'\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91'
>>> b.decode('utf-8')
'Рубли РФ КЦБ'

Django \u characters in my UTF8 strings

I am adding UTF-8 data to a database in Django.
As the data goes into the database, everything looks fine - the characters (for example): “Hello” are UTF-8 encoded.
My MySQL database is UTF-8 encoded. When I examine the data from the DB by doing a select, my example string looks like this: ?Hello?. I assume this is showing the characters as UTF-8 encoded.
When I select the data from the database in the terminal or for export as a web-service, however - my string looks like this: \u201cHello World\u201d.
Does anyone know how I can display my characters correctly?
Do I need to perform some additional UTF-8 encoding somewhere?
Thanks,
Nick.
u'\u201cHello World\u201d'
Is the correct Python representation of the Unicode text “Hello World”. The smartquote characters are being displayed using a \uXXXX hex escape rather than verbatim because there are often problems with writing Unicode characters to the terminal, particularly on Windows. (It looks like MySQL tried to write them to the terminal but failed, resulting in the ? placeholders.)
On a terminal that does manage to correctly input and output Unicode characters, you can confirm that they're the same thing:
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) [GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u201cHello World\u201d'==u'“Hello World”'
True
just as for byte strings, \x sequences are just the same as characters:
>>> '\x61'=='a'
True
Now if you've got \u or \x sequences escaping Python and making their way into an exported file, then you've done something wrong with the export. Perhaps you used repr() somewhere by mistake.

Categories