Using unicode character u201c - python

I'm a new to python and am having problems understand unicode. I'm using
Python 3.4.
I've spent an entire day trying to figure this out by reading about unicode including http://www.fileformat.info/info/unicode/char/201C/index.htm and
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
I need to refer to special quotes because they are used in the text I'm analyzing. I did test that the W7 command window can read and write the 2 special quote characters.
To make things simple, I wrote a one line script:
print ('“') # that's the special quote mark in between normal single quotes
and get this output:
Traceback (most recent call last):
File "C:\Users\David\Documents\Python34\Scripts\wordCount3.py", line 1, in <module>
print ('\u201c')
File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 0: character maps to <undefined>
So how do I write something to refer to these two characters u201C and u201D?
Is this the correct encoding choice in the file open statement?
with open(fileIn, mode='r', encoding='utf-8', errors='replace') as f:

The reason is that in 3.x Python You can't just mix unicode strings with byte strings. Probably, You've read the manuals dealing with Python 2.x where such things are possible as long as bytestring contains convertable chars.
print('\u201c', '\u201d')
works fine for me, so the only reason is that you're using wrong encoding for source file or terminal.
Also You may explicitly point python to codepage you're using, by throwing the next line ontop of your source:
# -*- coding: utf-8 -*-
Added: it seems that You're working on Windows machine, if so you could change Your console codepage to utf-8 by running
chcp 65001
before You fire up your python interpreter. That changes would be temporary, and if You want permanent, run the next .reg file:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Console]
"CodePage"=dword:fde9

Related

Unicode file with python and fileinput

I am becoming more and more convinced that the business of file encodings is made as confusing as possible on purpose. I have a problem with reading a file in utf-8 encoding that contains just one line:
“blabla this is some text”
(note that the quotation marks are some fancy version of the standard quotation marks).
Now, I run this piece of Python code on it:
import fileinput
def charinput(paths):
with open(paths) as fi:
for line in fi:
for char in line:
yield char
i = charinput('path/to/file.txt')
for item in i:
print(item)
with two results:
If i run my python code from command prompt, the result is some strange characters, followed by an error mesage:
ď
»
ż
â
Traceback (most recent call last):
File "krneki.py", line 11, in <module>
print(item)
File "C:\Python34\lib\encodings\cp852.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position
0: character maps to <undefined>
I get the idea that the problem comes from the fact that Python tries to read a "wrongly" encoded document, but is there a way to order fileinput.input to read utf-8?
EDIT: Some really weird stuff is happening and I have NO idea how any of it works. After saving the same file as before in notepad++, the python code now runs within IDLE and results in the following output (newlines removed):
“blabla this is some text”
while I can get the command prompt to not crash if I first input chcp 65001. Running the file then results in
Ä»żâ€śblabla this is some text ”
Any ideas? This is a horrible mess, if you ask me, but it is vital I understand it...
Encoding
Every file is encoded. The byte 0x4C is interpreted as latin capital letter L according to the ASCII encoding, but as less-than sign ('<') according to the EBCDIC encoding. There Ain't No Such Thing As Plain Text.
There are single byte character sets like ASCII that use a single byte to encode each symbol, there are double byte character sets like KS X 1001 that use two bytes to encode each symbol, and there are encodings like the popular UTF-8 that use a variable number of bytes per symbol.
UTF-8 has become the most popular encoding for new applications, so I'll give some examples: The Latin Capital Letter A is stored as a single byte: 0x41. The Left Double Quotation Mark (“) is stored as three bytes: 0xE2 0x80 0x9C. The emoji Pile of Poo is stored as four bytes: 0xF0 0x9F 0x92 0xA9.
Any program that reads a file and has to interpret the bytes as symbols has to know (or to guess) which encoding was used.
If you are not familiar with Unicode or UTF-8 you might want to read http://www.joelonsoftware.com/articles/unicode.html
Reading Files in Python 3
Python 3's builtin function open() has an optional keywords argument encoding to support different encodings. To open a UTF-8 encoded file you can write open(filename, encoding="utf-8") and Python will take care of the decoding.
Also, the fileinput module supports encodings via the openhook keyword argument: fileinput.input(filename, openhook=fileinput.hook_encoded("utf-8")).
If you are not familiar with Python and Unicode or UTF-8 you should read http://docs.python.org/3/howto/unicode.html
I also found some nice tricks in http://www.chirayuk.com/snippets/python/unicode
Reading Strings in Python 2
In Python 2 open() does not know about encodings. Instead you can use the codecs module to specify which encoding should be used: codecs.open(filename, encoding="utf-8")
The best source for Python2/Unicode enlightment is http://docs.python.org/2/howto/unicode.html

urllib.parse.quote won't take utf8

I am trying to use urllib.parse.quote as intended but cant get it to work. I even tried the example given in the documentation
Example: quote('/El Niño/') yields '/El%20Ni%C3%B1o/'.
If I try this following happens.
quote('/El Niño/')
File "<stdin>", line 0
^
SyntaxError: 'utf-8' codec can't decode byte 0xf1 in position 13: invalid continuation byte
Anyone got a hint what is wrong? I am using Python 3.2.3
PS: Link to the docs http://docs.python.org/3.2/library/urllib.parse.html
\xf1 is a latin-1 encoded ñ
>>> print(b'\xf1'.decode("latin-1"))
ñ
..not a utf-8 encoded character, like Python 3 assumes by default:
>>> print(b'\xf1'.decode("utf-8"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: unexpected end of data
Meaning, there is an encoding issue either with the .py file you have written, or the terminal in which you are running the Python shell - it is supplying latin-1 encoded data to Python, not utf-8
try adding the following line at the begining of your source code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os, sys
…
per default, python assumes that your source code is written in ASCII, and thus it can't read unicode strings from your source file. Read PEP-0263 on the topic.
Though, if you switch to python3, you don't need to place that coding: utf-8 comment after the shebang line, because utf-8 is the default.
Edit: Just noticed that you are actually trying to do python3, which should be utf-8-safe. Though looking at the error, it looks to me that you're actually executing python2 code whereas you think you are executing python3.
Is the shebang line correctly set?
Are you calling the script with the right interpreter?
here's the right shebang line:
#/usr/bin/env python3
or
#/usr/bin/python3
and not just /usr/bin/python or /usr/bin/env python.
can you give your full failing script, and the way you're calling it in your question?

UnicodeEncodeError in Python on Windows Console

I'm having the following error while recursing the files in a directory and printing file names in the console:
Traceback (most recent call last):
File "C:\Program Files\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
53: character maps to <undefined>
According to the error, one of the characters in the file name string is \u2013 which is an EN DASH – character different from the commonly seen - minus character.
I have checked my Windows encoding which is set to 437. Now, I see that I have two options to workaround this by either changing the encoding of Windows console or convert the characters in get from the file names to suit the console encoding. How would I go do that in Python 3.3?
Windows console is using cp437 encoding and there is a character \u2013 that isn't supported by that encoding. Try adding this to your code:
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,'cp437','backslashreplace')
or convert the characters in get from the file names to suit the console encoding
Probably the console encoding is already correct (can't tell from the error message though). Code page 437 simply doesn't include that character so you won't be able to print it.
You can reopen stdout with a text encoder that has a fallback encoding, as demonstrated in iamsudip's answer which uses backslashreplace, to at least get readable (if not reliably recoverable) output instead of an error.
changing the encoding of Windows console
You can do this by executing the console command chcp 1252 before running Python, but that will still only give you a different limited repertoire of printable characters - including U+2013, but not many other Unicode characters.
In theory you can chcp to 65001 to get UTF-8 which would allow you to print any character. Unfortunately there are serious bugs in the C runtime's standard IO implementation, which usually make this unusable in practice.
This sorry state of affairs affects all applications that use the MS C runtime's stdio library calls, including Python and most other languages, with the result that Unicode on the Windows console just doesn't work in most cases.
If you really have to get Unicode out to the Windows console you can use the Win32 WriteConsoleW API directly using ctypes, but it's not much fun.

Python 2.7 Unicode Error within a function (using __future__ print_function and unicode_literals)

I've read some threads about unicode now.
I am using Python 2.7.2 but with the future print_function (because the raw print statement is quite confusing for me..)
So here is some code:
# -*- coding: L9 -*-
from __future__ import print_function, unicode_literals
now if I print things like
print("öäüߧ€")
it works perfectly.
However, and yes I am totally new to python, if I declare a function which shall print unicode strings it blows my script
def foo():
print("öäü߀")
foo()
Traceback (most recent call last):
File "C:\Python27\test1.py", line 7, in <module>
foo()
File "C:\Python27\test1.py", line 5, in foo
print("÷õ³▀Ç")
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x80' in position 4: character maps to <undefined>
What's the best way to handle this error and unicode in general?
And should I stick with the 2.7 print statement instead?
I suspect that print("öäü߀".encode('L9')) will solve your problems.
This may help:
print(type(s1))
s1.encode('ascii',errors='ignore') #this works
s1.decode('ascii',errors='ignore') #this does not work
The reason is that s1.decode can't decode unicode directly so an explicit call to encode is first made, but without the errors='ignore' flag thus an error is raised
Depending on whether you were issuing your commands from a file or from a python prompt with unicode support may explain why you get an error in the latter but not the former.
Console code pages use legacy "OEM" code pages for compatibility with by old DOS console programs, while the rest of Windows uses updated code pages that support modern characters, but still differ by region. In your case the console uses cp850 and GUI programs use cp1252. cp850 doesn't support the Euro character, so Python raises an exception when trying to print the character on the console. You can run chcp 1252 before running your script if you need the Euro to work. Make sure the console font supports the character, though.
BTW, L9 != cp1252 either.
Are you sure printing from the console worked with a Euro? When I cut-and-paste your print, I get the following if the code page is 850, but it works after chcp 1252.
>>> print("öäüߧ€")
öäüߧ? # Note the ?
Encoding charts:
cp850
cp1252
L9 (aka ISO-8859-15)

How can I display native accents to languages in console in windows?

print "Español\nPortuguês\nItaliano".encode('utf-8')
Errors:
Traceback (most recent call last):
File "", line 1, in
print "Español\nPortuguês\nItaliano".encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 4: ordinal not in range(128)
I'm trying to make a multilingual console program in Windows. Is this possible?
I've saved the file in utf-8 encoding as well, I get the same error.
*EDIT
I"m just outputting text in this program. I change to lucida fonts, I keep getting this:
alt text http://img826.imageshack.us/img826/7312/foreignlangwindowsconso.png
I'm just looking for a portable way to correctly display foreign languages in the console in windows. If it can do it cross platform, even better. I thought utf-8 was the answer, but all of you are telling me fonts, etc.. also plays a part. So anyone have a definitive answer?
Short answer:
# -*- coding: utf-8 -*-
print u"Español\nPortuguês\nItaliano".encode('utf-8')
The first line tells Python that your file is encoded in UTF-8 (your editor must use the same settings) and this line should always be on the beginning of your file.
Another thing is that Python 2 knows two different basestring objects - str and unicode. The u prefix will create such a unicode object instead of the default str object, which you can then encode as UTF-8 (but printing unicode objects directly should also work).
First of all, in Python 2.x you can't encode a str that has non-ASCII characters. You have to write
print u"Español\nPortuguês\nItaliano".encode('utf-8')
Using UTF-8 at the Windows console is difficult.
You have to set the Command Prompt font to a Unicode font (of which the only one available by default is Lucida Console), or else you get IBM437 encoding anyway.
chcp 65001
Modify encodings._aliases to treat "cp65001" as an alias of UTF-8.
And even then, it doesn't seem to work right.
This works for me:
# coding=utf-8
print "Español\nPortuguês\nItaliano"
You might want to try running it using chcp 65001 && your_program.py As well, try changing the command prompt font to Lucida Console.

Categories