Python BeautifulSoup or CSV encoding issue with &nbsp - python

I was looking for conversion of an HTML table to CSV format, and came across the following, which looked promising (as I am also trying to learn Python)
https://stackoverflow.com/a/16697784/838253
Unfortunately, it doesn't work on my samples, and I encounter error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 753: ordinal not in range(128)
This seems to be the result of BeautifulSoup stripped_strings conversion of nonbreaking spaces into u'\xa0'
This looks like perfectly normal Unicode (although converting multiple into a single `u'\xa0' seems a bit off)
The error seems to come from the csv module.
Why can't this handle standard Unicode, and what is the best way of handling this?

In Python 2.7, the csv module doesn't support unicode, see the note at the beginning of the documentation.
You can use UnicodeWriter from the examples to write csv data with Unicode.

Related

Encoding string in scrapy and dropping to JSON

I need need to scrape text data from sites using languages other than English (mostly Eastern European langs), using Scrapy. When Scrapy finishes, it needs to convert scraped data to JSON for further use.
The thing is, if I just scrape the text like this:
i['title'] = response.xpath('//home/title//text()').extract_first()
without encoding it, Scrapy throws something like this:
UnicodeEncodeError: 'charmap' codec can't encode character '\u0107' in position 103: character maps to <undefined>
On the other hand, if I do encode it, and try to process that with json.dumps(), I get a TypeError, since json can't serialize bytes. I've seen this explanation (How to encode bytes in JSON? json.dumps() throwing a TypeError), but its of little use, since I need to use utf-8 or utf-16, and not ascii.
Any idea how to solve this?
have you taken a look at the response headers? What encoding does it tell you? I can imagine that it tells you another encoding than it actually is.
Pythons decoding function has a parameter error ('strict', 'replace', 'ignore') which you can use to debug and find the problem'
Sorry this more a comment than an answer but i cant comment yet (too less rep)

Python UnicodeEncodeError when Outputting Parsed Data from a Webpage

I have a program that parses webpages and then writes the data out somewhere else. When I am writing the data, I get
"UnicodeEncodeError: 'ascii' codec can't encode characters in position
19-21: ordinal not in range(128)"
I am gathering the data using lxml.
name = apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
worksheet.goog["Name"].append(name)
Upon reading, http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm, it suggests I record all of my variables in unicode. This means I need to know what encoding the site is using.
My final line that actually writes the data out somewhere is:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], (str(worksheet.goog[value][row])).encode('ascii', 'ignore'))
How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out?
You error is because of:
str(worksheet.goog[value][row])
Calling str you are trying to encode the ascii, what you should be doing is encoding to utf-8:
worksheet.goog[value][row].encode("utf-8")
As far as How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out? goes, you can't there is no ascii latin ă etc... unless you want to get the the closest ascii equivalent using something like Unidecode.
I think I may have figured my own problem out.
apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
Actually defaults to unicode. So what I did was change this line to:
name = (apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text).encode('ascii', errors='ignore')
And I just output without changing anything:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], worksheet.goog[value][row])
Due to the nature of the data, ASCII only is mostly fine. Although, I may be able to use UTF-8 and catch some extra characters...but this is not relevant to the question.
:)

'ascii' codec can't encode character at position * ord not in range(128)

There are a few threads on stackoverflow, but i couldn't find a valid solution to the problem as a whole.
I have collected huge sums of textual data from the urllib read function and stored the same in pickle files.
Now I want to write this data to a file.
While writing i'm getting errors similar to -
'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128)
and a lot of data is being lost.
I suppose the data off the urllib read is byte data
I've tried
1. text=text.decode('ascii','ignore')
2. s=filter(lambda x: x in string.printable, s)
3. text=u''+text
text=text.decode().encode('utf-8')
but still im ending up with similar errors.
Can somebody point out a proper solution.
And also would codecs strip work.
I have no issues if the conflict bytes are not written to the file as a string hence the loss is accepted.
You can do it through smart_str of Django module. Just try this:
from django.utils.encoding import smart_str, smart_unicode
text = u'\u2019'
print smart_str(text)
You can install Django by starting a command shell with administrator privileges and run this command:
pip install Django
Your data is unicode data. To write that to a file, use .encode():
text = text.encode('ascii', 'ignore')
but that would remove anything that isn't ASCII. Perhaps you wanted to encode to a more suitable encoding, like UTF-8, instead?
You may want to read up on Python and Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe
You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like ’ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')
Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.
[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

How to handle Unicode (non-ASCII) characters in Python?

I'm programming in Python and I'm obtaining information from a web page through the urllib2 library. The problem is that that page can provide me with non-ASCII characters, like 'ñ', 'á', etc. In the very moment urllib2 gets this character, it provokes an exception, like this:
File "c:\Python25\lib\httplib.py", line 711, in send
self.sock.sendall(str)
File "<string>", line 1, in sendall:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 74: ordinal not in range(128)
I need to handle those characters. I mean, I don't want to handle the exception but to continue the program. Is there any way to, for example (I don't know if this is something stupid), use another codec rather than the ASCII? Because I have to work with those characters, insert them in a database, etc.
You just read a set of bytes from the socket. If you want a string you have to decode it:
yourstring = receivedbytes.decode("utf-8")
(substituting whatever encoding you're using for utf-8)
Then you have to do the reverse to send it back out:
outbytes = yourstring.encode("utf-8")
You want to use unicode for all your work if you can.
You probably will find this question/answer useful:
urllib2 read to Unicode
You might want to look into using an actual parsing library to find this information. lxml, for instance, already addresses Unicode encode/decode using the declared character set.

Categories