How to convert html entities into symbols?

How to convert html entities into symbols? - python

I have made some adaptations to the script from this answer. and I am having problems with unicode. Some of the questions end up being written poorly.
Some answers and responses end up looking like:
Yeah.. I know.. I’m a simpleton.. So what’s a Singleton? (2)
How can I make the ’ to be translated to the right character?
Note: If that matters, I'm using python 2.6, on a French windows.
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
EDIT1: Based on Ryan Ginstrom's post, I have been able to correct a part of the output, but I am having problems with python's unicode.
In Idle / python shell:
Yeah.. I know.. Iâ€™m a simpleton.. So
whatâ€™s a Singleton?
In a text file, when redirecting stdout
Yeah.. I know.. I’m a simpleton.. So
what’s a Singleton?
How can I correct that ?
Edit2: I have tried Jarret Hardie's solution but it didn't do anything.
I am on windows, using python 2.6, so my site-packages folder is at:
C:\Python26\Lib\site-packages
There was no siteconfig.py file, so I created one, pasted the code provided by Jarret Hardie, started a python interpreter, but seems like it has not been loaded.
sys.getdefaultencoding()
'ascii'
I noticed there is a site.py file at :
C:\Python26\Lib\site.py
I tried changing the encoding in the function
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !
to set the encoding to utf-8. It worked (after a restart of python of course).
>>> sys.getdefaultencoding()
'utf-8'
The sad thing is that it didn't correct the caracters in my program. :(

You should be able to convert HTMl/XML entities into Unicode characters. Check out this answer in SO:
Decoding HTML Entities With Python
Basically you want something like this:
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(urllib2.urlopen(URL),
convertEntities=BeautifulStoneSoup.ALL_ENTITIES)

Does changing your default encoding in siteconfig.py work?
In your site-packages file (on my OS X system it's in /Library/Python/2.5/site-packages/) create a file called siteconfig.py. In this file put:
import sys
sys.setdefaultencoding('utf-8')
The setdefaultencoding method is removed from the sys module once siteconfig.py is processed, so you must put it in site-packages so that Python will read it when the interpreter starts up.

Related

pprint: UnicodeEncodeError: 'ascii' codec can't encode character

This is driving me crazy. I'm trying to pprint a dict with an é char, and it throws me out.
I'm using Python 3:
from pprint import pprint
knights = {'gallahad': 'the pure', 'robin': 'the bravé'}
pprint (knights)
Error:
File "/data/prod_envs/pythons/python36/lib/python3.6/pprint.py", line 176, in _format
stream.write(rep)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 43: ordinal not in range(128)
I read up on the Python ASCII doc, but there does not seem a quick way to solve this, other than taking the dict apart, and rewriting the offending value to an ASCII value via .encode, and then re-assembling the dict again
Is there any way I can get this to print without taking the dict apart?

This is unrelated to pprint: the module only formats the string into another string and then passes the formatted string to the underlying stream. So your error occurs when the é character (U+00E9) is written to stdout.
Now it really depends on the underlying OS and the configuration of the Python interpreter. In Linux or other Unix-like systems, you could try to declare a UTF-8 or Latin1 charset in your terminal session by setting the environment variable PYTHONIOENCODING before starting Python:
$ export PYTHONIOENCODING=Latin1
$ python
(or use PYTHONIOENCODING=utf8 depending on the actual encoding of your terminal or terminal window).

Standard input and output are file objects in Python. The Python 3 documentation says that, when these objects are created, if encoding is left unspecified then locale.getpreferredencoding(False) is called to fetch the locale's preferred encoding.
Your system should have been set up with one or more "locales" when GNU/Linux was installed (I'm guessing from your paths that you are using some version of GNU/Linux). On a "sensible" setup, the default locale should allow UTF-8. But if you only did a "minimal" installation (for example as part of setting up a container), or something like that, then it is possible that the system has set locale to "C" (the ultimate fallback locale), which does not support UTF-8.
Just because your terminal can accept UTF-8 (as demonstrated by using echo with a UTF-8 string), does not mean Python knows that UTF-8 is acceptable. If Python sees the locale set to "C" then it will assume only ASCII is allowed unless told otherwise.
You can check the current locale by typing locale at the shell prompt, and change it by setting the LC_ALL environment variable. But before changing it you must check with locale -a to see which locales are available on your system, otherwise your change may not be effective and you may get the "C" locale anyway. If your system has not been set up with the locale you want, you can add it if you have root access: most GNU/Linux distributions provide options to do this when you (re)configure a package called locales, so for example on Debian/Ubuntu-based distros, sudo dpkg-reconfigure locales should show you the options.
But sometimes you will be in the awkward position of having to write a Python script to run on a system that has not been set up with decent locales and there's nothing you can do about it because you don't have root and the sysadmin insists on giving you the absolute minimum. Then what do we do?
Well there are options within Python itself. You could run export PYTHONIOENCODING=utf-8 before running Python, to tell Python to use that encoding no matter what the locale says. Or you could give pprint a stream= parameter, set to a stream that you've opened yourself using open() with an encoding="utf-8" parameter (although this is no good if you want to use sys.stdout or os.popen instead of a file). Or you could upgrade to Python 3.7 and use sys.stdout.reconfigure(encoding='utf-8') (but this won't work in the Python 3.6 mentioned in the original question).
Or, you could import codecs and do w=codecs.getwriter("utf-8")(sys.stdout.buffer) and then pass stream=w to your pprint:
from pprint import pprint
import sys, codecs
w=codecs.getwriter("utf-8")(sys.stdout.buffer)
d = {"testing": "这是个考验"}
pprint (d, stream=w)

Python3 utf-8 decode issue

The following code runs fine with Python3 on my Windows machine and prints the character 'é':
data = b"\xc3\xa9"
print(data.decode('utf-8'))
However, running the same on an Ubuntu based docker container results in :
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
Is there anything that I have to install to enable utf-8 decoding ?

Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:
Thus the recommended way seems to be to tell your python instance to use utf-8 as default encoding:
Set your default encoding of python source files via environment variable:
export PYTHONIOENCODING=utf8
Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....
Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command
with open(fname, "rt", encoding="utf-8") as f:
...
and there's a more hackish way with some side effects, but saves you to explicitly specify it each time
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Please read the warnings about this hack in the related answer and comments.

The problem is with the print() expression, not with the decode() method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.
Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,
the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "Ã©" instead of "é");
the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).
In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter.
These are a few options to address this problem:
Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
Re-encode STDOUT, like so:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
The encoding used has to match the one of the terminal.
Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).
There might be other options, but I doubt that there are nicer ones.

python html.unescape() doesn't work in concole [duplicate]

This question already has answers here:
Python, Unicode, and the Windows console
(15 answers)
Closed 5 years ago.
I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print() function.
The code is like this:
conn.request("POST", resource, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
data = response.read()
print(data.decode('utf-8'));
the HTTPResponse .read() method returns a bytes element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict default behavior I get the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 10248: character maps to <undefined>
I could fix it using this quite ugly code:
print(data.decode('utf-8').encode('cp850','replace').decode('cp850'))
Now it replace the offending character "—" with a ?. Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.
There are several things I do not like from my solution.
The code is ugly with all that decoding, encoding, and decoding.
It solves the problem for just this case. If I port the program for a system using some other encoding (latin-1, cp437, back to cp1252, etc.) it should recognize the target encoding. It does not. (for instance, when using again the IDLE GUI, the emdash is also lost, which didn't happen before)
It would be nicer if the emdash translated to a hyphen instead of a interrogation bang.
The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an 'Á' U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).
So, the question:
Is there a nicer solution that makes my code agnostic from the output interface encoding?

I see three solutions to this:
Change the output encoding, so it will always output UTF-8. See e.g. Setting the correct encoding when piping stdout in Python, but I could not get these example to work.
Following example code makes the output aware of your target charset.
# -*- coding: utf-8 -*-
import sys
print sys.stdout.encoding
print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')
This example properly replaces any non-printable character in my name with a question mark.
If you create a custom print function, e.g. called myprint, using that mechanisms to encode output properly you can simply replace print with myprint whereever necessary without making the whole code look ugly.
Reset the output encoding globally at the begin of the software:
The page http://www.macfreek.nl/memory/Encoding_of_Python_stdout has a good summary what to do to change output encoding. Especially the section "StreamWriter Wrapper around Stdout" is interesting. Essentially it says to change the I/O encoding function like this:
In Python 2:
if sys.stdout.encoding != 'cp850':
sys.stdout = codecs.getwriter('cp850')(sys.stdout, 'strict')
if sys.stderr.encoding != 'cp850':
sys.stderr = codecs.getwriter('cp850')(sys.stderr, 'strict')
In Python 3:
if sys.stdout.encoding != 'cp850':
sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'cp850':
sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')
If used in CGI outputting HTML you can replace 'strict' by 'xmlcharrefreplace' to get HTML encoded tags for non-printable characters.
Feel free to modify the approaches, setting different encodings, .... Note that it still wont work to output non-specified data. So any data, input, texts must be correctly convertable into unicode:
# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
print u"Stöcker" # works
print "Stöcker".decode("utf-8") # works
print "Stöcker" # fails

Based on Dirk Stöcker's answer, here's a neat wrapper function for Python 3's print function. Use it just like you would use print.
As an added bonus, compared to the other answers, this won't print your text as a bytearray ('b"content"'), but as normal strings ('content'), because of the last decode step.
def uprint(*objects, sep=' ', end='\n', file=sys.stdout):
enc = file.encoding
if enc == 'UTF-8':
print(*objects, sep=sep, end=end, file=file)
else:
f = lambda obj: str(obj).encode(enc, errors='backslashreplace').decode(enc)
print(*map(f, objects), sep=sep, end=end, file=file)
uprint('foo')
uprint(u'Antonín Dvořák')
uprint('foo', 'bar', u'Antonín Dvořák')

For debugging purposes, you could use print(repr(data)).
To display text, always print Unicode. Don't hardcode the character encoding of your environment such as Cp850 inside your script. To decode the HTTP response, see A good way to get the charset/encoding of an HTTP response in Python.
To print Unicode to Windows console, you could use win-unicode-console package.

I dug deeper into this and found the best solutions are here.
http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python
In my case I solved "UnicodeEncodeError: 'charmap' codec can't encode character "
original code:
print("Process lines, file_name command_line %s\n"% command_line))
New code:
print("Process lines, file_name command_line %s\n"% command_line.encode('utf-8'))

If you are using Windows command line to print the data, you should use
chcp 65001
This worked for me!

If you use Python 3.6 (possibly 3.5 or later), it doesn't give that error to me anymore. I had a similar issue, because I was using v3.4, but it went away after I uninstalled and reinstalled.

Translate special character ½

I am reading a source that contains the special character ½. How do I convert this to 1/2? The character is part of a sentence and I still need to be able to use this string "normally". I am reading webpage sources, so I'm not sure that I will always know the encoding??
Edit: I have tried looking at other answers, but they don't work for me. They always seem to start with something like:
s= u'£10"
but I get an error already there: "no encoding declared". But do I know what encoding I'm getting in, or does that not matter? Do I just pick one?

This is really two questions.
#1. To interpret ½: Use the unicodedata module. You can ask for the numeric value of the character or you can normalize using a canonical normalization form it and parse it yourself.
>>> import unicodedata
>>> unicodedata.numeric(u'½')
0.5
>>> unicodedata.normalize('NFKC', u'½')
'1⁄2'
#2. Encoding problems: If you're working with the terminal, make sure Python knows the terminal encoding. If you're writing source files, make sure Python knows the file encoding. You can't just "pick" an encoding to set for Python, you must inform Python about the encoding that your terminal / text editor already uses.
Python lets you set the encoding of files with Vim/Emacs style comments. Put a comment at the top of the file like this if you use Vim:
# coding=UTF-8
Or this, if you use Emacs:
# -*- coding: UTF-8 -*-
If you use neither Vim nor Emacs, then it doesn't matter which one. Obviously, if you don't use UTF-8 you should substitute the encoding you actually use. (UTF-8 is the only encoding I can recommend.)

Dietrich beat me to the punch, but here is some more detail about setting the encoding for your source file:
Because you want to search for a literal unicode ½, you need to be able to write it in your source file. Unfortunately, the Python interpreter chokes on any unicode input, unless you specify the encoding of that source file with a comment in the first couple of lines, like so:
# coding=utf8
# ... do stuff here ...
This assumes your editor is saving the file as UTF-8. If it's using a different encoding specify that instead. See PEP-0263 for more details.
Once you've specified the encoding you should be able to write something this in your code:
text = text.replace('½', '1/2')
Encoding of the webpage
Depending on how you are downloading the page, you probably don't need to worry about this at all, most HTTP libraries handle choosing the encoding for you automatically.

Did you try using codecs to read your file? [docs]
import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
You can check the whole guide here.
Also a good ref: http://docs.python.org/howto/unicode

python noob question about codecs and utf-8

Using python to pick it some pieces so definitely a noob ? here but didn't seeing a satisfactory answer.
I have a json utf-8 file with some pieces that have grave's, accute's etc.... I'm using codecs and have (for example):
str=codecs.open('../../publish_scripts/locations.json', 'r','utf-8')
locations=json.load(str)
for location in locations:
print location['name']
For print'ing, does anything special need to be done? It's giving me the following
ascii' codec can't encode character u'\xe9' in position 5
It looks like the correct utf-8 value for e-accute. I suspect I'm doing something wrong with print'ing. Would the iteration cause it to lose it's utf-8'ness?
PHP and Ruby versions handle the utf-8 piece fine; is there some looseness in those languages that python won't do?
thx

codec.open() will decode the contents of the file using the codec you supplied (utf-8). You then have a python unicode object (which behaves similarly to a string object).
Printing a unicode object will cause an implict (behind-the-scenes) encode using the default codec, which is usually ascii. If ascii cannot encode all of the characters present it will fail.
To print it, you should first encode it, thus:
for location in locations:
print location['name'].encode('utf8')
EDIT:
For your info, json.load() actually takes a file-like object (which is what codecs.open() returns). What you have at that point is neither a string nor a unicode object, but an iterable wrapper around the file.
By default json.load() expects the file to be utf8 encoded so your code snippet can be simplified:
locations = json.load(open('../../publish_scripts/locations.json'))
for location in locations:
print location['name'].encode('utf8')

You're probably reading the file correctly. The error occurs when you're printing. Python tries to convert the unicode string to ascii, and fails on the character in position 5.
Try this instead:
print location['name'].encode('utf-8')
If your terminal is set to expect output in utf-8 format, this will print correctly.

It's the same as in PHP. UTF8 strings are good to print.

The standard io streams are broken for non-ascii, character io in python2 and some site.py setups. Basically, you need to sys.setdefaultencoding('utf8') (or whatever the system locale's encoding is) very early in your script. With the site.py shipped in ubuntu, you need to imp.reload(sys) to make sys.setdefaultencoding available. Alternatively, you can wrap sys.stdout (and stdin and stderr) to be unicode-aware readers/writers, which you can get from codecs.getreader / getwriter.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert html entities into symbols? - python

Related

pprint: UnicodeEncodeError: 'ascii' codec can't encode character

Python3 utf-8 decode issue

python html.unescape() doesn't work in concole [duplicate]

Translate special character ½

python noob question about codecs and utf-8

Categories

Resources