Encoding issue when trying to scrape a page - python

I'm using beautifulSoup to scrape a page that has a ISO-8859-1 encoding however I've run into my little hiccup.
I have a line that reads:
logging.info("Processing [%s]" % (link))
The variable link is one of the values scraped from beautifulsoup. It is a Unicode string and I can print it by typing print link. It shows up on the console exactly the way it was scraped but the line above throws this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
I've read up on Unicode right now but I can't figure out why it is able to print it but it can't log it.
The string in question is this:
booba-concert-à-bercy
Any ideas on where I'm mucking this up?
Thank you.

logging doesn't like unicode; pass it bytes.
logging.info("Processing [%s]" % (link.encode('utf-8')))

I managed to solve this by adding a file called sitecustomize.py in my Python/Lib/site-packages directory. This file contained two lines: import sys and sys.setdefaultencoding('utf-8').
The default encoding prior to that was ascii and therefore the issues. Now I don't need to specify an explicit encoding for the link variable as it uses the default encoding i.e. utf-8 and converts it to that.
Of course, I'll never see the characters properly until my terminal in the same encoding but that won't break my code.

Related

Python opening files with utf-8 file names

In my code I used something like file = open(path +'/'+filename, 'wb') to write the file
but in my attempt to support non-ascii filenames, I encode it as such
naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...
so the file is named something like directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:
for file in path:
data = open(file.as_posix(), 'rb)
...
I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb') but I get 'utf-8' codec can't encode characters in position...'
I also tried file.as_posix().encode('utf-8', 'surrogateescape'), I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'
How can I open a file with a utf-8 filename?
I'm using Python 3.9 on ubuntu linux
Any help is greatly appreciated.
EDIT
I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath) or filepath.as_posix() returns the string as directory/path/????????.txt so it gives me an error when I try to encode it to any codec.
Currently I'm investigating if the issue's related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.
More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format
I don't understand why you feel you need to recode filepaths.
Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There's no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there's no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you're likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.
Anyway, there's no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files, 𝔐 and mañana. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.
>>> [*Path('ñ').iterdir()]
[PosixPath('ñ/𝔐'), PosixPath('ñ/mañana')]
>>> Path.mkdir('ñ2')
>>> for path in Path('ñ').iterdir():
... open(Path('ñ2', path.name), 'w').close()
...
>>> [*Path('ñ2').iterdir()]
[PosixPath('ñ2/𝔐'), PosixPath('ñ2/mañana')]
>>> [open(path).read() for path in Path('ñ2').iterdir()]
['', '']
Note:
In a comment, OP says that they had previously tried:
file = open('/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
and received the error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-11: ordinal not in range(128)
Without more details, it's hard to know how to respond to that. It's possible that open will raise that error for a filesystem which doesn't allow non-ascii characters, but that wouldn't be normal on Linux.
However, it's worth noting that the string literal
'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png'
is not the string you think it is. \x escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal, "\xd8\xb9" contains two characters, "O with stroke" (Ø) and "superscript 1" (¹); in other words, it is exactly the same as the string literal "\u00d8\u00b9".
To get the Arabic letter ain (ع), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639: "\u0639".
If for some reason you insist on using explicit UTF-8 byte encoding, you can use a byte literal as the argument to open:
file = open(b'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
But that's not recommended.
So after being in a rabbit hole for the past few days, I figured the issue isn't with python itself but with the locale that my web framework was using. Debugging this, I saw that
import sys
print(sys.getfilesystemencoding())
returned 'ASCII', which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such WSGIDaemonProcess my_app locale='C.UTF-8' in the Apache configuration file thanks to this post.

'ascii' codec can't encode error when reading using Python to read JSON

Yet another person unable to find the correct magic incantation to get Python to print UTF-8 characters.
I have a JSON file. The JSON file contains string values. One of those string values contains the character "à". I have a Python program that reads in the JSON file and prints some of the strings in it. Sometimes when the program tries to print the string containing "à" I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 12: ordinal not in range(128)
This is hard to reproduce. Sometimes a slightly different program is able to print the string "à". A smaller JSON file containing only this string does not exhibit the problem. If I start sprinkling encode('utf-8') and decode('utf-8') around the code it changes what blows up in unpredictable ways. I haven't been able to create a minimal code fragment and input that exhibits this problem.
I load the JSON file like so.
with codecs.open(filename, 'r', 'utf-8') as f:
j = json.load(f)
I'll pull out the offending string like so.
s = j['key']
Later I do a print that has s as part of it and see the error.
I'm pretty sure the original file is in UTF-8 because in the interactive command line
codecs.open(filename, 'r', 'utf-8').read()
returns a string but
codecs.open(filename, 'r', 'ascii').read()
gives an error about the ascii codec not being able to decode such-and-such a byte. The file size in bytes is identical to the number of characters returned by wc -c, and I don't see anything else that looks like a non-ASCII character, so I suspect the problem lies entirely with this one high-ASCII "à".
I am not making any explicit calls to str() in my code.
I've been through the Python Unicode HOWTO multiple times. I understand that I'm supposed to "sandwich" unicode handling. I think I'm doing this, but obviously there's something I still misunderstand.
Mostly I'm confused because it seems like if I specify 'utf-8' in the codecs.open call, everything should be happening in UTF-8. I don't understand how the ASCII codec still sneaks in.
What am I doing wrong? How do I go about debugging this?
Edit: Used io module in place of codecs. Same result.
Edit: I don't have a minimal example, but at least I have a minimal repro scenario.
I am printing an object derived from the strings in the JSON that is causing the problem. So the following gives an error.
print(myobj)
(Note that I am using from __future__ import print_function though that does not appear to make a difference.)
Putting an encode('utf-8') on the end of my object's __str__ function return value does not fix the bug. However changing the print line to this does.
print("%s" % myobj)
This looks wrong to me. I'd expect these two print calls to be equivalent.
I can make this work by doing the sys.setdefaultencoding hack:
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")
But this is apparently a bad idea that can make Python malfunction in other ways.
What is the correct way to do this? I tried
env PYTHONIOENCODING=UTF-8 ./myscript.py
but that didn't work. (Unsurprisingly, since the issue is the default encoding, not the io encoding.)
When you write directly to a file or redirect stdout to a file or pipe the default encoding is ASCII and you have to encode Unicode strings before writing them. With opened file handles you can set an encoding to have this happen automatically but with print you must use an encode() method.
print s.encode('utf-8')
It is recommended to use the newer io module in place of codecs because it has an improved implementation and is forward compatible with Py3.x open().

ascii codec can't decode byte 0xe3 in position error in Ubuntu/Python, but not on OS X/Python

I'm now on Ubuntu 13.04 and Python 2.7.4 and tried to run a script including the following lines:
html = unicode(html, 'cp932').encode('utf-8')
html1, html2 = html.split(some_text) # this line spits out the error
However, when I ran the above script on Ubuntu 13.04, it spitted out an error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 106: ordinal not in range(128). However, this exactly same script can always be executed successfully on OS X 10.8 and Python 2.7.3. So I wonder why the error occurred only one of the two platforms...
The first thought came to my mind, especially after reading this post (UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1) was that the dichotomy arose because I'm in a different LANG environment, where I use jp_JP.UTF-8 on OS X but en_US.UTF-8 on Ubuntu. So I also tried to add one more line os.environ['LANG'] = 'jp_JP.UTF-8' to the aforementioned scrip, but still got the same error.
One more strange phenomenon is that when I attempt to run the script from within IPython shell on Ubuntu and go into debug mode immediately after the error happens, and then run the line which originally triggered the error, I don't get the error any more...
So what's happening here? And what am I missing?
Thanks in advance.
You haven't given us enough information to be sure, but there's a pretty good chance this is your problem:
If some_text is a unicode object, then this line:
html1, html2 = html.split(some_text) # this line spits out the error
… is calling split on a str, and passing a unicode parameter. Whenever you mix str and unicode in the same call, Python 2.x handles that by automatically calling unicode on the str. So, that's equivalent to:
html1, html2 = unicode(html).split(some_text) # this line spits out the error
… which is equivalent to:
html1, html2 = html.decode(sys.getdefaultencoding()).split(some_text) # this line spits out the error
… which will fail if there are any non-ASCII characters in html, exactly as you're seeing.
The easy workaround is to explicitly encode some_text to UTF-8:
html1, html2 = html.split(some_text.encode('utf-8'))
But personally, I wouldn't even try to work with str objects from 3 different charsets all in the same program. Why not just decode/encode at the very edges, and just deal with unicode objects everywhere in between?

Python Unicode Encode Error ordinal not in range<128> with Euro Sign

I have to read an XML file in Python and grab various things, and I ran into a frustrating error with Unicode Encode Error that I couldn't figure out even with googling.
Here are snippets of my code:
#!/usr/bin/python
# coding: utf-8
from xml.dom.minidom import parseString
with open('data.txt','w') as fout:
#do a lot of stuff
nameObj = data.getElementsByTagName('name')[0]
name = nameObj.childNodes[0].nodeValue
#... do more stuff
fout.write(','.join((name,bunch of other stuff))
This spectacularly crashes when a name entry I am parsing contains a Euro sign. Here is the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 60: ordinal not in range(128)
I understand why Euro sign will screw it up (because it's at 128, right?), but I thought doing # coding: utf-8 would fix that. I also tried adding .encode(utf-8) so that the name looks instead like
name = nameObj.childNodes[0].nodeValue.encode(utf-8)
But that doesn't work either. What am I doing wrong? (I am using Python 2.7.3 if anyone wants to know)
EDIT: Python crashes out on the fout.write() line -- it will go through fine where the name field is like:
<name>United States, USD</name>
But will crap out on name fields like:
<name>France, € </name>
when you are opening a file in python using the open built-in function you will always read the file in ascii. To access it in another encoding you have to use codecs:
import codecs
fout = codecs.open('data.txt','w','utf-8')
It looks like you're getting Unicode data from your XML parser, but you're not encoding it before writing it out. You can explicitly encode the result before writing it out to the file:
text = ",".join(stuff) # this will be unicode if any value in stuff is unicode
encoded = text.encode("utf-8") # or use whatever encoding you prefer
fout.write(encoded)

UnicodeEncodeError when fetching url

I have this issue trying to get all the text nodes in an HTML document using lxml but I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128). However, when I try to find out the type of encoding of this page (encoding = chardet.detect(response)['encoding']), it says it's utf-8. It seems weird that a single page has utf-8 and ascii. Actually, this:
fromstring(response).text_content().encode('ascii', 'replace')
solves the problem.
Here it's my code:
from lxml.html import fromstring
import urllib2
import chardet
request = urllib2.Request(my_url)
request.add_header('User-Agent',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
request.add_header("Accept-Language", "en-us")
response = urllib2.urlopen(request).read()
print encoding
print fromstring(response).text_content()
Output:
utf-8
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)
What can I do to solve this issue?. Keep in mind that I want to do this with a few other pages, so I don't want to encode on an individual basis.
UPDATE:
Maybe there is something else going on here. When I run this script on the terminal, I get a correct output but when a run it inside SublimeText, I get UnicodeEncodeError... ¿?
UPDATE2:
It's also happening when I create a file with this output. .encode('ascii', 'replace') is working but I'd like to have a more general solution.
Regards
Can you try wrapping your string with repr()?
This article might help.
print repr(fromstring(response).text_content())
As far as writing out to a file as said in your edit, I would recommend opening the file with the codecs module:
import codecs
output_file = codecs.open('filename.txt','w','utf8')
I don't know SublimeText, but it seems to be trying to read your output as ASCII, hence the encoding error.
Based on your first update I would say that the terminal told Python to output utf-8 and SublimeText made clear it expects ascii. So I think the solution will be in finding the right settings in SublimeText.
However, if you cannot change what SublimeText expects it is better to use the encode function like you already did in a separate function.
def smartprint( text ) :
if sys.stdout.encoding == None :
print text
else :
print text.encode( sys.stdout.encoding , 'replace' )
You can use this function instead of print. Keep in mind that your program's output when run in SublimeText differs from Terminal. Because of the replace accented characters will loose their accents when this code is run in SublimeText, e.g. é will be shown as e.

Categories