Unable to convert string to Json in Python because of unicode characters - python

I have a String in Python 3.5 from which I'd like to create a Json object. But turns out that the string contains things like this:
"saved_search_almost_max_people_i18n":"You are reaching your current limit of saved people searches. \\u003ca href=\\"/mnyfe/subscriptionv2?displayProducts=&family=general&trk=vsrp_ss_upsell\\"\\u003eLearn more >\\u003c/a\\u003e"
These unicode characters make the json.loads function fail; actually if I try to format the string as Json in any online formatter, multiple errors show up.
As you can see, I'm a Python newbie, but I've been looking many sources and haven't found any solution.
By the way, the String comes from a Beautifulsoup operation:
soup = self.loadSoup(URL)
result = soup.find('code', id=TAG_TO_FIND)
rTxt=str(result)
j = json.loads(rTxt)
The first error I see (if I correct this one, there are many more coming):
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 858 (char 857)
Thanks everybody.

If I understand you correctly, you’re trying to parse an HTML document with Beautiful Soup and extract JSON text out of a particular code element in that document.
If so, the following line is wrong:
rTxt=str(result)
Calling str() on a Beautiful Soup Tag returns its HTML representation. Instead, you want the string attribute:
rTxt=result.string

Related

How can I 'translate' all unicode codes in a string to the actual symbols using Python 3?

I'm parsing web content to isolate the body of news articles from a certain site, for which I'm using urllib.request to retrieve the source code for the article webpage and isolate the main text. However, urllib takes characters like "ç" and puts it into a python string as its utf-8 notation, "c387". It does the same for the '”' and "„" characters, which print as an 'e' followed by a set of numbers. This is very irritating when trying to read the article and thus needs to be resolved. I could loop through the article and change every recognizable utf-8 code to the actual character using a tedious function, but I was wondering if there was a way to do that more easily.
For an example, the current output of my program might be:
e2809eThis country doesn't...e2809d
I would like it to be:
„This country doesn't...”
Note: I've already checked the source code of the web page, which just uses these 'special' characters, so it's definitely a urllib issue.
Thanks in advance!
urllib returns bytes:
>import urllib
>url = 'https://stackoverflow.com/questions/62085906'
>data = urllib.request.urlopen(url).read()
>type(data)
bytes
>idx = data.index(b'characters like')
>data[idx:idx+20]
b'characters like "\xc3\xa7"'
Now, let's try to interpret this as utf-8:
>data[idx:idx+20].decode('utf-8')
'characters like "ç"'
Et voilà!

Python Webscraping character encoding issues

I am a recent graduate, who just began self learning about python webscraping and, just for fun, I am attempting to build a script that allows me to store the names, episodes and episode description of Anime shows from a particular website, using python requests, re and other relevant modules.
I have managed to get the webscraping aspect of script working, which is openning the necessary urls and retreiving relevant data, however, one major issue I continuosly cant overcome are different encodings and special html character decoding contained within the names of some of the shows.
After going through several stack overflow websites I have come up with the following solutions for trying to sort out this issue of decoding html characters and also fixing of encoding:
try:
# Python 2.6-2.7
from HTMLParser import HTMLParser
except ImportError:
# Python 3
from html.parser import HTMLParser
decodeHTMLSpecialChar = HTMLParser()
def whatisthis(s):
# This function checks to see if a given string is an ordinary string, unicode encoded string or not a string at all
if isinstance(s, str):
return "ordinary string"
elif isinstance(s, unicode):
return "unicode string"
else:
return "not a string"
def DecodeHTMLAndFixEncoding(string_data):
string_data = decodeHTMLSpecialChar.unescape(string_data)
encoding_check = whatisthis(string_data)
if encoding_check != "ordinary string":
string_data = string_data.encode("utf-8")
return string_data
All of the above code I obtained from various different stack overflow solutions.
Although this fixed most of the encoding issues I faced, today I found out other issues, that I just cant seem to figure out how to solve.
Below are the 2 different strings that are resulting to python string encoding errors or are not appropriately converting html special characters.
ISSUE CASE 1:
string1 = "Musekinin Galaxy☆Tylor"
print(DecodeHTMLAndFixEncoding(string1))
#...Results to "Musekinin Galaxy☆Tylor", however, because I have the name stored as a key within a dictionary to help check if the name has already been stored or not, when referencing the key, I get the following error:
Error Type: <type 'exceptions.KeyError'>
Error Contents: ('Musekinin Galaxy\xe2\x98\x86Tylor',)
The dictionary where i store the data is in the following format:
data = {show name (Key):
{
description (Key2) : "Overall Description for the show"
show episode name (Key) : "Description for episode"
}
}
ISSUE CASE 2:
string2 = "Knight's &#038; Magic"
print(DecodeHTMLAndFixEncoding(string2))
Results to... "Knight's & Magic"
# Although this kind of works it should have resulted to "Knight's & Magic".
I have tried my best to explain the issue I face here, my main question essentially is, is there a simple solution to:
firstly allow me to remove symbols, emojis, etc. from a string to ensure it can be used a dictionary key, and can later be easily referenced without any issues, and
secondly, a better solution than html parser to decode special html character encodings such as the issue shown in issue case 2
My last request is, I would prefer a solution using stock python provided default libraries or modules in contrast to external ones, such as beutifulsoup and such. However, If you feel they are some helpful external modules that can help me, then please feel free to show me those as well.

Python Beautiful Soup, saved text not displaying properly in original encoding

I'm having trouble with a saved file not displaying properly when using it's original encoding.
I'm downloading a web page, searching it for content I want and then writing that content to a file.
The encoding on the site is 'iso-8859-1' or so chrome and beautiful soup tell me and it appears perfectly when viewed using that encoding on the original site.
When I download the page and try to view it however I end up with strange characters (HTML Entities?) like these:
“ , ’
If I manually set Chromes encoding to 'Utf-8' when viewing the saved page it appears normally, as does the original page if I set that to 'Utf-8'.
I'm not sure what to do with this, I would change the encoding before writing the text to a file but I get ascii errors when I try that.
Here is a sample page (possible adult content):
http://original.adultfanfiction.net/story.php?no=600106516
And the code I am using to get the text from the page:
site = requests.post(url, allow_redirects=False)
html = site.text
soup = BeautifulSoup(html)
rawStory = soup.findAll("td",{"colspan" : '3'})
story = str(rawStory)
return story
I turn the ResultSet into a string so that I can write it to a file, I don't know if that could be part of the problem, if I print the html to the console after requesting it but before doing anything to it it displays improperly in the console as well.
I'm 90% sure that your problem is just that you're asking BeautifulSoup for a UTF-8 fragment and then trying to use it as ISO-8859-1, which obviously isn't going to work. The documentation explains all of this pretty nicely.
You're calling str. As Non pretty printing explains:
If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it… The str() function returns a string encoded in UTF-8.
As Output encoding explains:
When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with.
It then follows up with an example of almost exactly what you're doing—parsing a Latin-1 HTML document and writing it back as UTF-8—and then immediately explaining how to fix it:
If you don’t want UTF-8, you can pass an encoding into prettify()… You can also call encode() on the BeautifulSoup object, or any element in the soup, just as if it were a Python string…
So, that's all you have to do.
However, you've got another problem before you get there. When you call findAll, you don't get back a tag, you get back a ResultSet, which is basically a list of tags. Just as calling str on a list of strings gives you brackets, commas, and the repr of each string (with extraneous quotes and backlash escapes for non-printable-ASCII characters) instead of the strings themselves, calling str on a ResultSet gives you something similar. And you obviously can't call Tag methods on a ResultSet.
Finally, I'm not sure what problem you're actually trying to solve. You're creating a HTML fragment. Ignoring the fact that a fragment isn't a valid document, and browsers shouldn't strictly speaking display it in the first place, it doesn't specify an encoding, meaning the browser can only get that information from some out-of-band place, like you selecting a menu item. Changing it to Latin-1 won't actually "fix" things, it'll just mean that now you get the right display when you pick Latin-1 in the menu and wrong when you pick UTF-8 instead of vice-versa. Why not actually create a full HTML document that actually has a meta http-equiv that actually means what you want it to mean, instead of trying to figure out how to trick Chrome into guessing what you want it to guess?

Python: Youtube HTML full of BOMs

I'm trying to parse youtube comments using BeautifulSoup 4 in Python 2.7. When I try for any youtube video I get text full of BOMs, not just at the file start:
<p> thank you kind sir :)</p>
One appears in almost every comment. This is not the case for other websites (guardian.co.uk). The code I'm using:
# Source (should be taken from file to allow updating but not during wip):
source_url = 'https://www.youtube.com/watch?v=aiYzrCjS02k&feature=related'
# Get html from source:
response = urllib2.urlopen(source_url)
html = response.read()
# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.decode("utf-8-sig")
soup = BeautifulSoup(html)
strings = soup.findAll("div", {"class" : "comment-body"})
print strings
As you can see I've tried decoding but as soon as I soup it brings back the BOM character. Any ideas?
This seems to be invalid on YouTube's part, but you can't just tell them to fix it, you need a workaround.
So, here's a simple workaround:
# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.replace(b'\xEF\xBB\xBF', b'')
html = html.decode("utf-8")
(The b prefixes are unnecessary but harmless for Python 2.7, but they'll make your code work in Python 3… on the other hand, they'll break it for Python 2.5, so if that's more important to you, get rid of them.)
Alternatively, you can first decode and then replace(u'\uFEFF', u''). This should have the exact same effect (decoding extra BOMs should work harmlessly). But I think it makes more sense to fix the UTF-8 then decode it, rather than trying to decode and then fixing the result.

Encode unicode chars to HTML entities in Python, excluding tags

As you may know, for an email to be valid in many clients, all unicode chars must be encoded. I would like to automate this encoding in a Python script.
Obviously tags need to be excluded from conversion, otherwise the html won't work - doing this is really the complicated part - to be sure of success it is necessary to use a parsing package like lxml or beautifulsoup.
As far as I know, neither of those two packages support converting to numbered unicode entities such as & #x6F22 ; (漢)
Any help would be really invaluable, I've been banging my head against this wall all day!
I’ve had a similar problem, however it was always enough to run the following expression on the raw text, which just converts hex entities to decimal entities, which are then parsed just fine:
>>> hex_entity_pat = re.compile('&#x([^;]+);')
>>> hex_entity_fix = lambda x: hex_entity_pat.sub(lambda m: '&#%d;' % int(m.group(1), 16), x) # convert hex to dec entities
>>> BeautifulSoup(hex_entity_fix("<b>漢</b>"), convertEntities=BeautifulSoup.ALL_ENTITIES)
<b>漢</b>
I’m assuming that your emails are in HTML, not plain text. I think you are looking for this:
some_unicode_string.encode('ascii', errors='xmlcharrefreplace')
But maybe you can do this some other way. How do you generate the HTML in the first place?

Categories