LXML ValueError and UTF strings - python

I am making a little Python script for mass-editing of HTML files (replacing links to images etc.). Now, the HTML files contain some Cyrillic, that means I have to encode the string UTF-8. I replace all the links in the HTML, and type tag.set(data) and BOOM, the console displays:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters.
How can I fix this? I'm pretty sure that there aren't any control characters or NULL bytes. I'm using Python 2.7.11.
value = tag.get('value').encode('utf-8')
#h = HTMLParser.HTMLParser()
#value = h.unescape(value)
urls = regex.finditer(value)
if urls is None: continue
for turl in urls:
ufile = turl.group().rsplit('/', 1)[-1]
value = value.replace(turl.group(), '/'+newsrc+'/'+ufile)
#value = cgi.escape(value, True)
value = value.replace('\0', '')
tag.set('value', value)

It's easy. You only need to remove the encode('utf-8') part. You see LXML doesn't like people messing with the character encodings of strings. Just leave it to LXML to convert text into the suitable encoding and everything will be fine. :)

Related

Encoding issue with Scrapy (Python)

I have created a crawlspider with Scrapy. I need to get a specific part of the page with a Xpath :
item = ExplorerItem()
item['article'] = response.xpath("//div[#class='post-content']").extract()
Then I am using this item in pipelines.py.
But item['article'] gives me a result in unicode:
`u'<div class="post-content">\n\t\t\t\t\t<h2>D\xe9signation</h2>\n<p>`
I need to convert it in UTF-8.
What you are seeing are unicode characters when you see \xe9 \xe7. These are unicode characters. You may have some luck with this module Unidecode I have used it before with success, but those characters are fine I think your console just isn't set to render them. Web pages or source data doesn't always tell the truth about its encoding. Often data is a jumble of encodings. Unidecode will do its best to represent the character in ASCII.
from unidecode import unidecode
unidecode(u"\u5317\u4EB0") # Note the u before the string on this line stands for unicode
Set FEED_EXPORT_ENCODING='utf-8' i settings.py
See docs here https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_ENCODING

Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters

I use python 2.7 and I'm receiving a string from a server (not in unicode!).
Inside that string I find text with unicode escape sequences. For example like this:
<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>
How do I convert those \uxxxx - back to utf-8? The answers I found were either dealing with &# or required eval() which is too slow for my purposes. I need a universal solution for any text containing such sequenes.
Edit:
<\a> is a typo but I want a tolerance against such typos as well. There should only be reaction to \u
The example text is meant in proper python syntax like this:
"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
The desired output is in proper python syntax
"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"
Try
>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'
And then you can encode to utf8 as usual.
Python does contain some special string codecs for cases like this.
In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python.
(On which your program should be performing all textual operations) -
Whenever you are outputting that text again, you convert it to utf-8 as usual:
rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")
If there are othe bytes outside the 32-127 range, the unicode_escape codec
assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:
decode the original string using utf-8
encode back to latin1
decode using "unicode_escape"
work on the text
encode back to utf-8

UnicodeEncodeError: handling special characters

I am trying to scrap a web page. To keep care of all characters other then ASCII, I have written this code.
mydata = ''.join([i if ord(i) < 128 else ' ' for i in response.text])
and processed it further using beautiful soup python library. Now this is not handling some special characters that are on webpage like [tick], [star] (can't show a picture here).
Any clue on how to escape these characters and replace them with a space.
Right now I have this error
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 62: character maps to <undefined>
It's always preferable to process everything in Unicode, and convert to any specific encoding only before storage or transfer. For example,
s = u"Hi, привет, ciao"
> s
u'Hi, \u043f\u0440\u0438\u0432\u0435\u0442, ciao'
> s.encode('ascii', 'ignore')
'Hi, , ciao'
> s.encode('ascii', 'replace')
'Hi, ??????, ciao'
If you need to replace non-ascii chars specifically with spaces, you can write and register your own conversion error handler, see codecs.register_error().
fp = open("output.txt","w")
gives you a file open for writing text using the default encoding, which in your case is an encoding that doesn't have the character ✓ (probably cp1252), hence the error. Open the file with an encoding that supports it and you'll be fine:
fp = open('output.txt', 'w', encoding='utf-8')
Note also that:
print("result: "+ str(ele))
can fail if your console doesn't support Unicode, which under Windows it likely will not. Use print(ascii(...)) to get an ASCII-safe representation for debugging purposes.
The probable reason your attempt to get rid of non-ASCII characters fails is that you are removing them before parsing the HTML, rather than from the values you get after parsing. So a literal ✓ would be removed, but if a character reference like ✓ were used, it would be left alone, get parsed by bs4, and end up as ✓.
(I am sad that the default reaction to Unicode errors always seems to be to try to get rid of non-ASCII characters completely, instead of fixing the code to handle them correctly.)
You're also extracting text in a pretty weird way, using str() to get markup and then trying to pick out the tags and remove them. This is unreliable—HTML is not that straightforward to parse, which is why BeautifulSoup is a thing—and needless because you already have a perfectly good HTML parser that can give you the pure text in an element (get_text()).
Most of your code is not necessary. request is already doing the correct decoding for you, beautifulsoup is doing the text extraction for you, and python is doing the correct encoding for you when writing to a file:
import requests
from bs4 import BeautifulSoup
#keyterm = input("Enter a keyword to search:")
URL = 'https://www.google.com/search?q=jaguar&num=30'
#NO_OF_LINKS_TO_BE_EXTRACTED = 10
print("Requesting data from %s" % URL)
response = requests.get(URL)
soup = BeautifulSoup(response.text)
#print(soup.prettify())
metaM = soup.findAll("span","st")
#metaM = soup.find("div", { "class" : "f slp" })
with open("output.txt", "w", encoding='utf8') as fp:
for ele in metaM:
print("result: %r" % ele)
fp.write(ele.get_text().replace('\n', ' ') + '\n')

Python - writing unicode strings to a file & beautiful soup

I'm using BeautifulSoup to parse some XML files. One of the fields in this file frequently uses Unicode characters. I've tried unsuccessfully to write the unicode to a file using encode.
The process so far is basically:
Get the name
gamename = items.find('name').string.strip()
Then incorporate the name into a list which is later converted into a string:
stringtoprint = userid, gamename.encode('utf-8') #
newstring = "INSERT INTO collections VALUES " + str(stringtoprint) + ";" +"\n"
Then write that string to a file.
listofgamesowned.write(newstring.encode("UTF-8"))
It seems that I won't have to .encode quite so often. I had tried encoding directly upon parsing out the name e.g. gamename = items.find('name').string.strip().encode('utf-8') - however, that did not seem to work.
Currently - 'Uudet L\xc3\xb6yt\xc3\xb6retket'
is being printed and saved rather than Uudet Löytöretket.
It seems if this were a string I was generating then I'd use something.write(u'Uudet L\xc3\xb6yt\xc3\xb6retket'); however, it's one element embedded in a string.
Unicode is an in-memory representation of a string. When you write out or read in you need to encode and decode.
Uudet L\xc3\xb6yt\xc3\xb6retket is the utf-8 encoded version of Uudet Löytöretket, so it is what you want to write out. When you want to read a string back from a file you need to decode it.
>>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket'
Uudet Löytöretket
>>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket'.decode('utf-8')
Uudet Löytöretket
Just remember to encode immediately before you output and decode immediately after you read it back.

Convert html entities to ascii in Python

I need to convert any html entity into its ASCII equivalent using Python. My use case is that I am cleaning up some HTML used to build emails to create plaintext emails from the HTML.
Right now, I only really know how to create unicode from these entities when I need ASCII (I think) so that the plaintext email reads correctly with things like accented characters. I think a basic example is the html entity "& aacute;" or á being encoded into ASCII.
Furthermore, I'm not even 100% sure that ASCII is what I need for a plaintext email. As you can tell, I'm completely lost on this encoding stuff.
Here is a complete implementation that also handles unicode html entities. You might find it useful.
It returns a unicode string that is not ascii, but if you want plain ascii, you can modify the replace operations so that it replaces the entities to empty string.
def convert_html_entities(s):
matches = re.findall("&#\d+;", s)
if len(matches) > 0:
hits = set(matches)
for hit in hits:
name = hit[2:-1]
try:
entnum = int(name)
s = s.replace(hit, unichr(entnum))
except ValueError:
pass
matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
if len(matches) > 0:
hits = set(matches)
for hit in hits:
hex = hit[3:-1]
try:
entnum = int(hex, 16)
s = s.replace(hit, unichr(entnum))
except ValueError:
pass
matches = re.findall("&\w+;", s)
hits = set(matches)
amp = "&"
if amp in hits:
hits.remove(amp)
for hit in hits:
name = hit[1:-1]
if htmlentitydefs.name2codepoint.has_key(name):
s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
s = s.replace(amp, "&")
return s
Edit: added matching for hexcodes. I've been using this for a while now, and ran into my first situation with ' which is a single quote/apostrophe.
ASCII is the American Standard Code for Information Interchange and does not include any accented letters. Your best bet is to get Unicode (as you say you can) and encode it as UTF-8 (maybe ISO-8859-1 or some weird codepage if you're dealing with seriously badly coded user-agents/clients, sigh) -- the content type header of that part together with text/plain can express what encoding you've chosen to use (I do recommend trying UTF-8 unless you have positively demonstrated it cannot work -- it's almost universally supported these days and MUCH more flexible than any ISO-8859 or "codepage" hack!).
You can use the htmlentitydefs package:
import htmlentitydefs
print htmlentitydefs.entitydefs['aacute']
Basically, entitydefs is just a dictionary, and you can see this by printing it at the python prompt:
from pprint import pprint
pprint htmlentitydefs.entitydefs
We put up a little module with agazso's function:
http://github.com/ARTFL/util/blob/master/ents.py
We find agazso's function to faster than the alternatives for ent conversion. Thanks for posting it.

Categories