Why does Python add \xe3 in the output of:
>>> b'Transa\xc3\xa7\xc3\xa3o'.decode('utf-8')
'Transaç\xe3o'
Expected value is:
'Transação'
Some more information about my environment:
>>> import sys
>>> print (sys.version)
3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AMD64)]
>>> sys.stdout.encoding
'cp437'
This was under Console 2 + Powershell.
You need to use a console or terminal that supports all of the characters that you want to print.
When printing in the interactive console, the characters are encoded to the correct codec for your console, with any character that is not supported using the backslashreplace error handler to keep the output readable rather than throw an exception. This is a feature of the default sys.displayhook() function:
If repr(value) is not encodable to sys.stdout.encoding with sys.stdout.errors error handler (which is probably 'strict'), encode it to sys.stdout.encoding with 'backslashreplace' error handler.
Your console can handle ç but not ã. There are several codecs that include the first character but not the last; you are using IBM codepage 437, but it is by no means the only one.
If you are running Python in the standard Windows console (cmd.exe) then be aware that Python, Unicode and that console do not mix very well. You can install the win-unicode-console package to make Python 3 use the Windows APIs to better output Unicode text; you'll need to make sure you have a font capable of displaying your Unicode text still.
I don't know for certain if that package is compatible with other Windows shells; your mileage may vary.
Related
What is the use case of encode/decode?
My understanding was that encode is used to convert string into byte string in order to be able to pass non ascii data across the program. And decode was to convert this byte string back into string.
But foll. examples shows non acsii characters getting successfully printed even if not encoded/decoded. Example:
val1="À È Ì Ò Ù Ỳ Ǹ Ẁ"
val2 = val1
print('val1 is: ',val2)
encoded_val1=val1.encode()
print('encoded_val1 is: ',encoded_val1)
decoded_encoded_val1=encoded_val1.decode()
print('decoded_encoded_val1 is: ',decoded_encoded_val1)
Output:
So what is the use case of encode and decode in python?
The environment you are working on may support those characters, in addition to that your terminal(or whatever you use to see output) may support displaying those characters. Some terminals/command lines or text editors may not support them. Apart from displaying issues, here are some actual reasons and examples:
1- When you transfer data over internet/network (eg with a socket), information is transferred as raw bytes. Non-ascii characters can not be represented by a single byte so we need a special representation for them (utf-16 or utf-8 with more than one byte). This is the most common reason I encountered.
2- Some text editors only supports utf-8. For example you need to represent your Ẁ character in utf-8 format in order to work with them. Reason for that is when dealing with text, people mostly used ASCII characters, which are just one byte. When some systems needed to be integrated with non-ascii characters people converted them to utf-8. Some people with more in-depth knowledge about text editors may give a better explanation about this point.
3- You may have a text written with unicode characters with some Chinese/Russian letters in it, and for some reason store it in your remote Linux server. But your server does not support letters from those languages. You need to convert your text to some strict format (utf-8 or utf-16) and store it in your server so you can recover them later.
Here is a little explanation of UTF-8 format. There are also other articles about the topic if you are interested.
Use utf-8 encoding because it's universal.
Set your code editor to utf-8 encoding and put at the top of all your python file: # coding: utf8
When you get an input (file, string...), it can have a different encoding then you have to get his encode type and decode it. Exemple in HTML file encode type is in meta balise.
If you change something in the HTML file and want to save it or send it by network, then you have to encode it in the encode type it was juste before.
Always use unicode for your string in python. (Automatic for python 3 but for python2.7 use the prefix u like u'Hi')
$ python2.7
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type('this is a string') # bits => encoded
<type 'str'>
>>> type(u'this is a string') # unicode => decoded
<type 'unicode'>
$ python3
Python 3.2.3 (default, Oct 19 2012, 20:10:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type("this is a string") # unicode => decoded
<class 'str'>
>>> type(b"this is a string") # bits => encoded
<class 'bytes'>
1 Use UTF8. Now. All over.
2 In your code, specify the file encoding and declare your strings as "unicode".
3 At the entrance, know the encoding of your data, and decode with decode ().
4 At the output, encode in the expected encoding by the system which will receive the data, or if you can not know it, in UTF8, with encode ().
This question already has answers here:
How should I write a Windows path in a Python string literal?
(5 answers)
Closed last year.
The folder I want to get to is called python and is on my desktop.
I get the following error when I try to get to it
>>> os.chdir('C:\Users\expoperialed\Desktop\Python')
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
You need to use a raw string, double your slashes or use forward slashes instead:
r'C:\Users\expoperialed\Desktop\Python'
'C:\\Users\\expoperialed\\Desktop\\Python'
'C:/Users/expoperialed/Desktop/Python'
In regular Python strings, the \U character combination signals an extended Unicode codepoint escape.
You can hit any number of other issues, for any of the other recognised escape sequences, such as \a, \t, or \x.
Note that as of Python 3.6, unrecognized escape sequences can trigger a DeprecationWarning (you'll have to remove the default filter for those), and in a future version of Python, such unrecognised escape sequences will cause a SyntaxError. No specific version has been set at this time, but Python will first use SyntaxWarning in the version before it'll be an error.
If you want to find issues like these in Python versions 3.6 and up, you can turn the warning into a SyntaxError exception by using the warnings filter error:^invalid escape sequence .*:DeprecationWarning (via a command line switch, environment variable or function call):
Python 3.10.0 (default, Oct 15 2021, 22:25:32) [Clang 13.0.0 (clang-1300.0.29.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import warnings
>>> '\expoperialed'
'\\expoperialed'
>>> warnings.filterwarnings('default', '^invalid escape sequence .*', DeprecationWarning)
>>> '\expoperialed'
<stdin>:1: DeprecationWarning: invalid escape sequence '\e'
'\\expoperialed'
>>> warnings.filterwarnings('error', '^invalid escape sequence .*', DeprecationWarning)
>>> '\expoperialed'
File "<stdin>", line 1
'\expoperialed'
^^^^^^^^^^^^^^^
SyntaxError: invalid escape sequence '\e'
This usually happens in Python 3. One of the common reasons would be that while specifying your file path you need "\\" instead of "\". As in:
filePath = "C:\\User\\Desktop\\myFile"
For Python 2, just using "\" would work.
f = open('C:\\Users\\Pooja\\Desktop\\trolldata.csv')
Use '\\' for python program in Python version 3 and above..
Error will be resolved..
All the three syntax work very well.
Another way is to first write
path = r'C:\user\...................' (whatever is the path for you)
and then passing it to os.chdir(path)
I had the same error.
Basically, I suspect that the path cannot start either with "U" or "User" after "C:\".
I changed my directory to "c:\file_name.png" by putting the file that I want to access from python right under the 'c:\' path.
In your case, if you have to access the "python" folder, perhaps reinstall the python, and change the installation path to something like "c:\python". Otherwise, just avoid the "...\User..." in your path, and put your project under C:.
I receive a server response, bytes:
\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4
\xd0\x9a\xd0\xa6\xd0\x91
This is for sure Cyrillic, but I'm not sure which encoding. Every attempt to decode it in Python fails:
b = b'\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91'
>>> b.decode('utf-8')
'\u0420\u0443\u0431\u043b\u0438 \u0420\u0424 \u041a\u0426\u0411'
>>> print(b.decode('utf-8'))
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4:
character maps to <undefined>
>>> b.decode('cp1251')
'\u0420\xa0\u0421\u0453\u0420±\u0420»\u0420\u0451 \u0420\xa0\u0420¤
\u0420\u0459\u0420¦\u0420\u2018'
>>> print(b.decode('cp1251'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u0420' in
position 0: character maps to <undefined>
Both results somewhat resemble Unicode-escape, but this does not work either:
>>> codecs.decode('\u0420\u0443\u0431\u043b\u0438 \u0420\u0424 \u041a\u0426\u0411',
'unicode-escape')
'Ð\xa0Ñ\x83бли Ð\xa0Ф Ð\x9aЦÐ\x91'
There's a web service for recovering Cyrillic texts, it is able to decode my bytes using Windows-1251:
Output (source encoding : WINDOWS-1251)
Рубли РФ КЦБ
But I don't have any more ideas as for how to approach it.
I think I'm missing something about how encoding works, so if the problem seems trivial to you, I would greatly appreciate a bit of explanation/a link to a tutorial/ some keywords for further googling.
Solution:
Windows PowerShell uses Windows-850 codepage by default, which is incapable of handling some Cyrillic characters. One fix is to change the codepage to Unicode every time starting the shell:
chcp 65001
Here is explained how to make it the new default
This is for sure Cyrillic, but I'm not sure which encoding.
This is UTF-8 (100%).
Python 3.4.3 (default, Mar 25 2015, 17:13:50)
Type "copyright", "credits" or "license" for more information.
IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: b = b'\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91'
In [2]: s = b.decode('utf-8')
In [3]: print(s)
Рубли РФ КЦБ
Works fine for me. May be you have problem with your terminal or repl?
Try out this.
In [1]: s = "\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91"
In [11]: print s.decode('utf-8')
Рубли РФ КЦБ
To print or display some strings properly, they need to be decoded (Unicode strings).
There is a lot information with examples in standart Python library
Python 3:
>>> import sys
>>> print (sys.version)
3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2]
>>> b = b'\xd0\xa0\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8 \xd0\xa0\xd0\xa4 \xd0\x9a\xd0\xa6\xd0\x91'
>>> b.decode('utf-8')
'Рубли РФ КЦБ'
I am adding UTF-8 data to a database in Django.
As the data goes into the database, everything looks fine - the characters (for example): “Hello” are UTF-8 encoded.
My MySQL database is UTF-8 encoded. When I examine the data from the DB by doing a select, my example string looks like this: ?Hello?. I assume this is showing the characters as UTF-8 encoded.
When I select the data from the database in the terminal or for export as a web-service, however - my string looks like this: \u201cHello World\u201d.
Does anyone know how I can display my characters correctly?
Do I need to perform some additional UTF-8 encoding somewhere?
Thanks,
Nick.
u'\u201cHello World\u201d'
Is the correct Python representation of the Unicode text “Hello World”. The smartquote characters are being displayed using a \uXXXX hex escape rather than verbatim because there are often problems with writing Unicode characters to the terminal, particularly on Windows. (It looks like MySQL tried to write them to the terminal but failed, resulting in the ? placeholders.)
On a terminal that does manage to correctly input and output Unicode characters, you can confirm that they're the same thing:
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) [GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u201cHello World\u201d'==u'“Hello World”'
True
just as for byte strings, \x sequences are just the same as characters:
>>> '\x61'=='a'
True
Now if you've got \u or \x sequences escaping Python and making their way into an exported file, then you've done something wrong with the export. Perhaps you used repr() somewhere by mistake.
If I try to paste a unicode character such as the middle dot:
·
in my python interpreter it does nothing. I'm using Terminal.app on Mac OS X and when I'm simply in in bash I have no trouble:
:~$ ·
But in the interpreter:
:~$ python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
^^ I get nothing, it just ignores that I just pasted the character. If I use the escape \xNN\xNN representation of the middle dot '\xc2\xb7', and try to convert to unicode, trying to show the dot causes the interpreter to throw an error:
>>> unicode('\xc2\xb7')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
I have setup 'utf-8' as my default encoding in sitecustomize.py so:
>>> sys.getdefaultencoding()
'utf-8'
What gives? It's not the Terminal. It's not Python, what am I doing wrong?!
This question is not related to this question, as that indivdiual is able to paste unicode into his Terminal.
unicode('\xc2\xb7') means to decode the byte string in question with the default codec, which is ascii -- and that of course fails (trying to set a different default encoding has never worked well, and in particular doesn't apply to "pasted literals" -- that would require a different setting anyway). You could use instead u'\xc2\xb7', and see:
>>> print(u'\xc2\xb7')
·
since those are two unicode characters of course. While:
>>> print(u'\uc2b7')
슷
gives you a single unicode character (of some oriental persuasion -- sorry, I'm ignorant about these things). BTW, neither of these is the "middle dot" you were looking for. Maybe you mean
>>> print('\xc2\xb7'.decode('utf8'))
·
which is the middle dot. BTW, for me (python 2.6.4 from python.org on a Mac Terminal.app):
>>> print('슷')
슷
which kind of surprised me (I expected an error...!-).