BeautifulSoup code works in IPython Notebook but not Eclipse - python

The following code works fine when run from Jupyter IPython notebook:
from bs4 import BeautifulSoup
xml_file_path = "<Path to XML file>"
s = BeautifulSoup(open(xml_file_path), "xml")
But it fails when creating the soup when run from Eclipse/PyDev (which uses the same Python interpreter):
Traceback (most recent call last):
File "~/parser/scratch.py", line 3, in <module>
s = BeautifulSoup(open(xml_file), "xml")
File "/anaconda/lib/python3.5/site-packages/bs4/__init__.py", line 175, in __init__
markup = markup.read()
File "/anaconda/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1812: ordinal not in range(128)
Python version: 3.5.2 (Anaconda 4.1.1)
BeautifulSoup: version 4
IPython Notebook version: 4.2.1
Eclipse version: Mars.2 Release (4.5.2)
PyDev version: 5.1.2.20160623256
Mac OS X: El Capitan 10.11.6
UPDATE:
The character in the file that is causing issue in Eclipse is �, but this causes no issues in IPython Notebook! If I remove this character from the XML file, then the code works fine in Eclipse as well. Is there some setting in Eclipse I need to change so that the code won't fail on this (and possibly other such) character?

I think that you have to open with open(xml_file_path, 'rb') -- and specify the encoding for things to work the same in both (otherwise you're having an implicit conversion from bytes to unicode -- and apparently it uses a different encoding based on your env, since you have something in Eclipse and another thing in IPython).
Try doing:
with open(xml_file_path, 'rb') as stream:
contents = stream.read()
contents.decode('utf-8')
Just to check if you're really able to decode it as utf-8 (i.e.: to check if that char is a valid utf-8 char).

Related

Why does GitBash doesnt run properly Python code? [duplicate]

in test.py i have
print('Привет мир')
with cmd worked as normal
> python test.py
?????? ???
with Git Bash got error
$ python test.py
Traceback (most recent call last):
File "test.py", line 2, in <module>
print('\u041f\u0440\u0438\u0432\u0435\u0442 \u043c\u0438\u0440')
File "C:\Users\raksa\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
Does anyone know the reason behind of getting error when execute python code via Git Bash?
Python 3.6 directly uses the Windows API to write Unicode to the console, so is much better about printing non-ASCII characters. But Git Bash isn't the standard Windows console so it falls back to previous behavior, encoding Unicode string in the terminal encoding (in your case, cp1252). cp1252 doesn't support Cyrillic, so it fails. This is "normal". You'll see the same behavior in Python 3.5 and older.
In the Windows console Python 3.6 should print the actual Cyrillic characters, so what is surprising is your "?????? ???". That is not "normal", but perhaps you don't have a font selected that supports Cyrillic. I have a couple of Python versions installed:
C:\>py -3.6 --version
Python 3.6.2
C:\>py -3.6 test.py
Привет мир
C:\>py -3.3 --version
Python 3.3.5
C:\>py -3.3 test.py
Traceback (most recent call last):
File "test.py", line 1, in <module>
print('\u041f\u0440\u0438\u0432\u0435\u0442 \u043c\u0438\u0440 \u4f60\u597d')
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
Had this problem with python 3.9
import sys, locale
print("encoding", sys.stdout.encoding)
print("local preferred", locale.getpreferredencoding())
print("fs encoding", sys.getfilesystemencoding())
If this returns "cp1252" and not "utf-8" then print() doesn't work with unicode.
This was fixed by changing the windows system locale.
Region settings > Additional settings > Administrative > Change system locale > Beta: Use Unicode UTF-8 for worldwide language support
Since Python 3.7 you can do
import sys
sys.stdout.reconfigure(encoding='utf-8')
This mostly fixes the git bash problem for me with Chinese characters. They still don't print correctly to standard out on the console, but it doesn't crash, and when redirected to a file the correct unicode characters are present.
Credit to sth in this answer.
Set the the environment variable PYTHONUTF8=1, or
Use -Xutf8 command line option.

Why does "Save as UTF-8" in Eclipse fix the Python UnicodeEncodeError?

I have:
a file file.txt containing just one character: ♠, and UTF-8 encoded.
a CP-1252 encoded Python script test.py containing:
import codecs
text = codecs.open('file.txt', 'r', 'UTF-8').read()
print('text: {0}'.format(text))
When I run it in Eclipse 4.7.2 on Windows 7 SP1 x64 Ultimate and with Python 3.5.2 x64, I get the error message:
Traceback (most recent call last):
File "C:\eclipse-4-7-2-workspace\SEtest\test.py", line 3, in <module>
print('text: {0}'.format(text))
File "C:\programming\python\Python35-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2660' in position 6: character maps to <undefined>
My understanding is that the issue stems from the fact that on Microsoft Windows, by default the Python interpreter uses CP-1252 as its encoding and therefore has is with the character ♠.
Also, I would note at that point that I kept Eclipse default encoding, which can be seen in Preferences > General > Workspace:
When I change the Python script test.py to:
import codecs
print(u'♠') # <--- adding this line is the only modification
text = codecs.open('file.txt', 'r', 'UTF-8').read()
print('text: {0}'.format(text))
then try to run it, I get the error message:
(note: Eclipse is configured to save the script whenever I run it).
After selecting the option Save as UTF-8, I get the same error message:
Traceback (most recent call last):
File "C:\Users\Francky\eclipse-4-7-2-workspace\SEtest\test.py", line 2, in <module>
print(u'\u2660')
File "C:\programming\python\Python35-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2660' in position 0: character maps to <undefined>
which I think is expected since the Python interpreter still uses CP-1252.
But if I run the script again in Eclipse without any modification, it works. The output is:
♠
text: ♠
Why does it work?
Phyton converts the text to be printed to the encoding of the console which is the active code page on Windows (at least until version 3.6).
To avoid the UnicodeEncodeError you have to change the console encoding to UTF-8. There are several ways to do this, e. g. on the Windows command line by executing cmd /K chcp 65001.
In Eclipse, the encoding of the console can be set to UTF-8 in the run configuration (Run > Run Configurations...), in the Common tab.
The text file encoding settings in Window > Preferences: General > Workspace and in Project > Properties: Ressource are only used by text editors how to display text files.

Opening an UTF-8 encoded file by a Python 3.6 script running in PyCharm 2016.3.2

I have a quite odd problem with PyCharm and a Python app that I am working on.
Pycharm is PyCharm Community Edition 2016.3.2
The project interpreter is: 3.6.0
OS is MacOS Sierra
As I am have been googling for a solution for some time and no proposed idea helps I want to ask here.
I want to open an UTF-8 encoded file using the following code:
#!/usr/bin/env python3
import os, platform
def read(file):
f = open(file, "r")
content = f.read()
f.close()
return content
print(platform.python_version())
print(os.environ["PYTHONIOENCODING"])
content = read("testfile")
print(content)
The code crashes when run in PyCharm. The output is
3.6.0
UTF-8
Traceback (most recent call last):
File "/Users/xxx/Documents/Scripts/pycharmutf8/file.py", line 14, in <module>
content = read("testfile")
File "/Users/xxx/Documents/Scripts/pycharmutf8/file.py", line 7, in read
content = f.read()
File "/usr/local/Cellar/python3/3.6.0_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
When I run the identical code from command line, it works just fine:
./file.py
3.6.0
utf-8:surrogateescape
I am a file with evil unicode characters: äöü
I have found out that in comparable situations people are advised to set the environment variable PYTHONIOENCODING to utf-8:surrogateescape that I did (as you can see in above output) system-wide
export PYTHONIOENCODING=utf-8:surrogateescape
but also in PyCharm itself (Settings -> Build -> Console -> Python Console -> Environment variables).
This does not have any effect. Do you have further suggestions?
If it's harder to change the encoding for the open call i.e. it's happening in a library you can change this environment variable in the run configurations: LC_CTYPE=en_US.UTF-8
Source:
PyCharm is changing the default encoding in my Django app
If you want to read a UTF8 file, specify the encoding:
def read(file):
with open(file, encoding='utf8') as f:
content = f.read()

Pip install raises UnicodeDecodeError on Windows. Fix?

When trying to install mysql-python on my Windows 10 machine i get the following error:
File "<string>", line 1, in <module>
File "C:\Users\LUCAFL~1\AppData\Local\Temp\pip-build-3u7aih0l\mysql-python\setup.py", line 21, in <module>
setuptools.setup(**metadata)
File "c:\program files (x86)\python35-32\lib\distutils\core.py", line 148, in setup
dist.run_commands()
...
File "c:\program files (x86)\python35-32\lib\subprocess.py", line 1055, in communicate
stdout = self.stdout.read()
File "c:\program files (x86)\python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1716: character maps to <undefined>
I tried installing other packages and received the same error on almost every one (one exception being pymysql). All of these packages were big and had dependencies. I guess that the big ones create temporary data in my user directories APPDATA folder. As you can see, the ü is not properly decoded (ü being byte 0x81). It's always a german umlaut that produces the error (mainly ü, as it's part of my user folders name).
I googled for the last 2 hours and found a lot of people having the same problem, but mostly they were opening github tickets or discussing the problem for Ubuntu/Fedora/OSX, etc. A couple times i read, that the standard encoding under windows is cp-1252 which causes the problem. Can i somehow force windows using my console to use utf-8 for this session and then run pip with that?
Please don't recommend me renaming my user folder. It's not easily done under Windows 10 and i dont want to re-install windows just because of python.
My setup: Windows 10, Python 3.5.1, pip 8.0.3
Can you try the following and see if it works. Replace path for python by your actual path.
I am not able to simulate on my windows laptop.
import sys
import subprocess
reload(sys) # Reload may do the trick!
sys.setdefaultencoding('UTF8')
theproc =subprocess.call(['C:\\Python27\\Scripts\\pip.exe', 'install', 'mysql-python'])
theproc.communicate()

Python webbrowser platform specific unicode error on osx

I am developing a cross-platform script on a Windows 7, Python 2.7 computer. The script will be also used on a MacOSX computer with Python 2.7 installed.
The following script is working perfectly on my Windows computer, however when I run it on the Mac, I get a unicode error.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import webbrowser
webbrowser.open(u"http://www.google.fr?q=testéè")
Here is the error:
Mac-mini-de-paul:paul paul$ python testUnicode.py
Traceback (most recent call last):
File "testUnicode.py", line 6, in <module>
webbrowser.open(u"http://www.google.fr?q=testéè")
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/webbrowser.py", line 62, in open
if browser.open(url, new, autoraise):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/webbrowser.py", line 637, in open
osapipe.write(script)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 42-43: ordinal not in range(128)
I don't really understand what's the problem here, Python's base functions are supposed to deal properly with unicode filenames, aren't they?
Note:
I saw this question, but it did not help me and the OP is not having any error: IMO not a duplicate
Try to manually encode to utf-8:
webbrowser.open(u"http://www.google.fr?q=testéè".encode('utf-8'))
or don't use unicode, if you provide file encoding:
#!/usr/bin/python
# -*- coding: utf-8 -*-
...
webbrowser.open("http://www.google.fr?q=testéè")

Categories