urllib.parse.quote won't take utf8 - python

I am trying to use urllib.parse.quote as intended but cant get it to work. I even tried the example given in the documentation
Example: quote('/El Niño/') yields '/El%20Ni%C3%B1o/'.
If I try this following happens.
quote('/El Niño/')
File "<stdin>", line 0
^
SyntaxError: 'utf-8' codec can't decode byte 0xf1 in position 13: invalid continuation byte
Anyone got a hint what is wrong? I am using Python 3.2.3
PS: Link to the docs http://docs.python.org/3.2/library/urllib.parse.html

\xf1 is a latin-1 encoded ñ
>>> print(b'\xf1'.decode("latin-1"))
ñ
..not a utf-8 encoded character, like Python 3 assumes by default:
>>> print(b'\xf1'.decode("utf-8"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: unexpected end of data
Meaning, there is an encoding issue either with the .py file you have written, or the terminal in which you are running the Python shell - it is supplying latin-1 encoded data to Python, not utf-8

try adding the following line at the begining of your source code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os, sys
…
per default, python assumes that your source code is written in ASCII, and thus it can't read unicode strings from your source file. Read PEP-0263 on the topic.
Though, if you switch to python3, you don't need to place that coding: utf-8 comment after the shebang line, because utf-8 is the default.
Edit: Just noticed that you are actually trying to do python3, which should be utf-8-safe. Though looking at the error, it looks to me that you're actually executing python2 code whereas you think you are executing python3.
Is the shebang line correctly set?
Are you calling the script with the right interpreter?
here's the right shebang line:
#/usr/bin/env python3
or
#/usr/bin/python3
and not just /usr/bin/python or /usr/bin/env python.
can you give your full failing script, and the way you're calling it in your question?

Related

Change python 3.7 default encoding from cp1252 to cp65001 aka UTF-8

I need to change Python's encoding from Windows-1252 to UTF-8. I am using Python 3.7.1, Atom, and the Atom script package for terminal.
I have read about PEP 540 -- Add a new UTF-8 Mode (a solution to this? I don’t know how to implement or if useful) I cannot find a sound resolution.
Currently it cannot handle '\u2705' or others. When checking the Python file directory I found
...Python\Python37\lib\encodings\cp1252.py
# When I run
import locale
import sys
print(sys.getdefaultencoding())
print(locale.getpreferredencoding())
# I get
utf-8
cp1252
[Finished in 0.385s]
# Error for print('\u2705')
Traceback (most recent call last):
File "C:\Users\en4ijjp\Desktop\junk.py", line 7, in <module>
print('\u2705').decode('utf-8')
File "C:\Users\en4ijjp\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2705' in
position 0: character maps to <undefined>
[Finished in 0.379s]
I expect my terminal to handle the characters and display them when using print().
This is resolved when putting the following at the top of your Python script. I am able to print all characters without error.
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = 'utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.detach(), encoding = 'utf-8')

Python Notepad++ encoding error

I am using Python for the first time and am running into an encoding error that I can't seem to get around. Here is the code:
#!/usr/bin/python
#-*- coding: utf -*-
import pandas as pd
a = "C:\Users"
print(a)
When I do this, I get:
File "C:\Users\Public\Documents\Python Scripts\ImportExcel.py", line
5
a = "C:\Users"
^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in positio n 2-3: truncated \UXXXXXXXX escape
In Notepad++ I have tried all of the encoding options. Nothing seems to work.
Any suggestions?
Specifically, the problem is that the '\' is an escape character.
If you want to print the string
"C:\Users"
then you have to do it thus:
a = "C:\\Users"
Hope this helps.
The error message suggests you're on a Windows machine, but you're using *nix notation for #!/usr/bin/python. That line should look something like #!C:\Python33\python.exe on a Windows machine, depending on where you've installed Python.
Use this: # -*- coding: utf-8 -*- instead of #-- coding: utf --
You can set the encoding in Notepad++, but you also need to tell Python about it.
In legacy Python (2.7), source code is ASCII unless specified otherwise. In Python 3, source code is UTF-8 unless otherwise specified.
You should use the following as the first or second line of the file to specify the encoding of the source code. The documentation gives:
# -*- coding: <encoding> -*-
This is the format originally from the Emacs editor, but according to PEP263 you can also use:
# vim: set fileencoding=<encoding>:
of even:
# coding=<encoding>
Where <encoding> can be any encoding that Python supports, but utf-8 is generally a good choice for portable code.

Using unicode character u201c

I'm a new to python and am having problems understand unicode. I'm using
Python 3.4.
I've spent an entire day trying to figure this out by reading about unicode including http://www.fileformat.info/info/unicode/char/201C/index.htm and
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
I need to refer to special quotes because they are used in the text I'm analyzing. I did test that the W7 command window can read and write the 2 special quote characters.
To make things simple, I wrote a one line script:
print ('“') # that's the special quote mark in between normal single quotes
and get this output:
Traceback (most recent call last):
File "C:\Users\David\Documents\Python34\Scripts\wordCount3.py", line 1, in <module>
print ('\u201c')
File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 0: character maps to <undefined>
So how do I write something to refer to these two characters u201C and u201D?
Is this the correct encoding choice in the file open statement?
with open(fileIn, mode='r', encoding='utf-8', errors='replace') as f:
The reason is that in 3.x Python You can't just mix unicode strings with byte strings. Probably, You've read the manuals dealing with Python 2.x where such things are possible as long as bytestring contains convertable chars.
print('\u201c', '\u201d')
works fine for me, so the only reason is that you're using wrong encoding for source file or terminal.
Also You may explicitly point python to codepage you're using, by throwing the next line ontop of your source:
# -*- coding: utf-8 -*-
Added: it seems that You're working on Windows machine, if so you could change Your console codepage to utf-8 by running
chcp 65001
before You fire up your python interpreter. That changes would be temporary, and if You want permanent, run the next .reg file:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Console]
"CodePage"=dword:fde9

Python 2.7 Unicode Error within a function (using __future__ print_function and unicode_literals)

I've read some threads about unicode now.
I am using Python 2.7.2 but with the future print_function (because the raw print statement is quite confusing for me..)
So here is some code:
# -*- coding: L9 -*-
from __future__ import print_function, unicode_literals
now if I print things like
print("öäüߧ€")
it works perfectly.
However, and yes I am totally new to python, if I declare a function which shall print unicode strings it blows my script
def foo():
print("öäü߀")
foo()
Traceback (most recent call last):
File "C:\Python27\test1.py", line 7, in <module>
foo()
File "C:\Python27\test1.py", line 5, in foo
print("÷õ³▀Ç")
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x80' in position 4: character maps to <undefined>
What's the best way to handle this error and unicode in general?
And should I stick with the 2.7 print statement instead?
I suspect that print("öäü߀".encode('L9')) will solve your problems.
This may help:
print(type(s1))
s1.encode('ascii',errors='ignore') #this works
s1.decode('ascii',errors='ignore') #this does not work
The reason is that s1.decode can't decode unicode directly so an explicit call to encode is first made, but without the errors='ignore' flag thus an error is raised
Depending on whether you were issuing your commands from a file or from a python prompt with unicode support may explain why you get an error in the latter but not the former.
Console code pages use legacy "OEM" code pages for compatibility with by old DOS console programs, while the rest of Windows uses updated code pages that support modern characters, but still differ by region. In your case the console uses cp850 and GUI programs use cp1252. cp850 doesn't support the Euro character, so Python raises an exception when trying to print the character on the console. You can run chcp 1252 before running your script if you need the Euro to work. Make sure the console font supports the character, though.
BTW, L9 != cp1252 either.
Are you sure printing from the console worked with a Euro? When I cut-and-paste your print, I get the following if the code page is 850, but it works after chcp 1252.
>>> print("öäüߧ€")
öäüߧ? # Note the ?
Encoding charts:
cp850
cp1252
L9 (aka ISO-8859-15)

Python 2.7 decode error using UTF-8 header: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Traceback:
Traceback (most recent call last):
File "venues.py", line 22, in <module>
main()
File "venues.py", line 19, in main
print_category(category, 0)
File "venues.py", line 13, in print_category
print_category(subcategory, ident+1)
File "venues.py", line 10, in print_category
print u'%s: %s' % (category['name'].encode('utf-8'), category['id'])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Code:
# -*- coding: utf-8 -*-
# Using https://github.com/marcelcaraciolo/foursquare
import foursquare
# Prints categories and subcategories
def print_category(category, ident):
for i in range(0,ident):
print u'\t',
print u'%s: %s' % (category['name'].encode('utf-8'), category['id'])
for subcategory in category.get('categories', []):
print_category(subcategory, ident+1)
def main():
client = foursquare.Foursquare(client_id='id',
client_secret='secret')
for category in client.venues.categories()['categories']:
print_category(category, 0)
if __name__ == '__main__':
main()
The trick is, keep all your string processing in the source completely Unicode. Decode to Unicode when reading input (files/pipes/console) and encode when writing output. If category['name'] is Unicode, keep it that way (remove `.encode('utf8').
Also Per your comment:
However, the error
still occurs when I try to do: python venues.py > categories.txt, but
not when output goes to the terminal: python venues.py
Python can usually determine the terminal encoding and will automatically encode to that encoding, which is why writing to the terminal works. If you use shell redirection to output to a file, you need to tell Python the I/O encoding you want via an environment variable, for example:
set PYTHONIOENCODING=utf8
python venues.py > categories.txt
Working example, using my US Windows console that uses cp437 encoding. The source code is saved in "UTF-8 without BOM". It's worth pointing out that the source code bytes are UTF-8, but declaring the source encoding and using a Unicode string in allows Python to decode the source correctly, and encode the print output automatically to the terminal using its default encoding
#coding:utf8
import sys
print sys.stdout.encoding
print u'üéâäàåçêëèïîì'
Here Python uses the default terminal encoding, but when redirected, does not know what the encoding is, so defaults to ascii:
C:\>python example.py
cp437
üéâäàåçêëèïîì
C:\>python example.py >out.txt
Traceback (most recent call last):
File "example.py", line 4, in <module>
print u'üéâäàåçêëèïîì'
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-12: ordinal not in range(128)
C:\>type out.txt
None
Since we're using shell redirection, use a shell variable to tell Python what encoding to use:
C:\>set PYTHONIOENCODING=cp437
C:\>python example.py >out.txt
C:\>type out.txt
cp437
üéâäàåçêëèïîì
We can also force Python to use another encoding, but in this case the terminal doesn't know how to display UTF-8. The terminal is still decoding the bytes in the file using cp437:
C:\>set PYTHONIOENCODING=utf8
C:\>python example.py >out.txt
C:\>type out.txt
utf8
üéâäàåçêëèïîì
I'm not sure, but I think the culprit is the "u" character at the start of u"%s: %s". This is assuming that what you want to print is a byte string and not a unicode string --- which would be reasonable(*): you output bytes, suitably encoded. Modified like this:
print '%s: %s' % (category['name'].encode('utf-8'), category['id'])
this would turn the unicode string category['name'] to a UTF-8 byte string, and then the rest of the processing is done with byte strings.
(*) It is reasonable in one point of view; another point of view is to print unicode strings and let the environment decide how it should be encoded, but then you're at the mercy of several factors that you don't really control. That's why you see differences between the output going to the terminal or to a file. To avoid all these issues, just print byte strings.

Categories