Python Webscraping character encoding issues

Python Webscraping character encoding issues - python

I am a recent graduate, who just began self learning about python webscraping and, just for fun, I am attempting to build a script that allows me to store the names, episodes and episode description of Anime shows from a particular website, using python requests, re and other relevant modules.
I have managed to get the webscraping aspect of script working, which is openning the necessary urls and retreiving relevant data, however, one major issue I continuosly cant overcome are different encodings and special html character decoding contained within the names of some of the shows.
After going through several stack overflow websites I have come up with the following solutions for trying to sort out this issue of decoding html characters and also fixing of encoding:
try:
# Python 2.6-2.7
from HTMLParser import HTMLParser
except ImportError:
# Python 3
from html.parser import HTMLParser
decodeHTMLSpecialChar = HTMLParser()
def whatisthis(s):
# This function checks to see if a given string is an ordinary string, unicode encoded string or not a string at all
if isinstance(s, str):
return "ordinary string"
elif isinstance(s, unicode):
return "unicode string"
else:
return "not a string"
def DecodeHTMLAndFixEncoding(string_data):
string_data = decodeHTMLSpecialChar.unescape(string_data)
encoding_check = whatisthis(string_data)
if encoding_check != "ordinary string":
string_data = string_data.encode("utf-8")
return string_data
All of the above code I obtained from various different stack overflow solutions.
Although this fixed most of the encoding issues I faced, today I found out other issues, that I just cant seem to figure out how to solve.
Below are the 2 different strings that are resulting to python string encoding errors or are not appropriately converting html special characters.
ISSUE CASE 1:
string1 = "Musekinin Galaxy☆Tylor"
print(DecodeHTMLAndFixEncoding(string1))
#...Results to "Musekinin Galaxy☆Tylor", however, because I have the name stored as a key within a dictionary to help check if the name has already been stored or not, when referencing the key, I get the following error:
Error Type: <type 'exceptions.KeyError'>
Error Contents: ('Musekinin Galaxy\xe2\x98\x86Tylor',)
The dictionary where i store the data is in the following format:
data = {show name (Key):
{
description (Key2) : "Overall Description for the show"
show episode name (Key) : "Description for episode"
}
}
ISSUE CASE 2:
string2 = "Knight's &#038; Magic"
print(DecodeHTMLAndFixEncoding(string2))
Results to... "Knight's & Magic"
# Although this kind of works it should have resulted to "Knight's & Magic".
I have tried my best to explain the issue I face here, my main question essentially is, is there a simple solution to:
firstly allow me to remove symbols, emojis, etc. from a string to ensure it can be used a dictionary key, and can later be easily referenced without any issues, and
secondly, a better solution than html parser to decode special html character encodings such as the issue shown in issue case 2
My last request is, I would prefer a solution using stock python provided default libraries or modules in contrast to external ones, such as beutifulsoup and such. However, If you feel they are some helpful external modules that can help me, then please feel free to show me those as well.

Related

How can I 'translate' all unicode codes in a string to the actual symbols using Python 3?

I'm parsing web content to isolate the body of news articles from a certain site, for which I'm using urllib.request to retrieve the source code for the article webpage and isolate the main text. However, urllib takes characters like "ç" and puts it into a python string as its utf-8 notation, "c387". It does the same for the '”' and "„" characters, which print as an 'e' followed by a set of numbers. This is very irritating when trying to read the article and thus needs to be resolved. I could loop through the article and change every recognizable utf-8 code to the actual character using a tedious function, but I was wondering if there was a way to do that more easily.
For an example, the current output of my program might be:
e2809eThis country doesn't...e2809d
I would like it to be:
„This country doesn't...”
Note: I've already checked the source code of the web page, which just uses these 'special' characters, so it's definitely a urllib issue.
Thanks in advance!

urllib returns bytes:
>import urllib
>url = 'https://stackoverflow.com/questions/62085906'
>data = urllib.request.urlopen(url).read()
>type(data)
bytes
>idx = data.index(b'characters like')
>data[idx:idx+20]
b'characters like "\xc3\xa7"'
Now, let's try to interpret this as utf-8:
>data[idx:idx+20].decode('utf-8')
'characters like "ç"'
Et voilà!

Is there a way to get around unicode issues when using win32api/com modules in python 3?

I've looked around and haven't found anything just yet. I'm going through emails in an inbox and checking for a specific word set. It works on most emails but some of them don't parse. I checked the broken emails using.
print (msg.Body.encode('utf8'))
and my problem messages all start with b'.
like this
b'\xe6\xa0\xbc\xe6\xb5\xb4\xe3\xb9\xac\xe6\xa0\xbc\xe6\x85\xa5\xe3\xb9\xa4\xe0\xa8\x8d\xe6\xb4\xbc\xe7\x91\xa5\xe2\x81\xa1\xe7\x91\x
I think this is forcing python to read the body as bytes but I'm not sure. Either way after the b, no matter what encoding I try I don't get anything but garbage text.
I've tried other encoding methods as well decoding before but I'm just getting a ton of attribute errrors.
import win32api
import win32com.client
import datetime
import os
import time
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
dater = datetime.date.today() - datetime.timedelta(days = 1)
dater = str(dater.strftime("%m-%d-%Y"))
print (dater)
#for folders in outlook.folders:
# print(folders)
Receipt = outlook.folders[8]
print(Receipt)
Ritems = Receipt.folders["Inbox"]
Rmessage = Ritems.items
for msg in Rmessage:
if (msg.Class == 46 and msg.CreationTime.strftime("%m-%d-%Y") == dater):
print (msg.CreationTime)
print (msg.Subject)
print (msg.Body.encode('utf8'))
print ('..............................')
End result is to have the message printed out in the console, or at least give Python a way to read it so I can find the text I'm looking for in the body.

The byte literal posted in the question is valid UTF-8. First two characters are U+683C and U+6D74 from the CJK Unified Ideographs block, U+4E00 - U+9FFF.
Since you don't know the source encoding there is no way to be completely sure about it, but chances are that email body is just Han characters encoded in UTF-8 (Determine the encoding of text in Python). If you are not being able to see the UTF-8 characters correctly you should check your terminal or display character set.
That said, you should to get the fundamentals of character representation right. Randomly encoding or decoding is hardly going to solve anything. I would suggest you begin by reading Spolsky's introduction to Unicode and then move to Batchelder on Unicode in Python.

As martineau said the proper encoding I was searching for was utf16. The other messages were encoded using utf8. So a simple mail scrape turned out to be an excellent lesson in encoding as well message Classes (off topic). Thanks for the help.

python request library giving wrong value single quotes

Facing some issue in calling API using request library. Problem is described as follows
The code:.
r = requests.post(url, data=json.dumps(json_data), headers=headers)
When I perform r.text the apostrophe in the string is giving me as
like this Bachelor\u2019s Degree. This should actually give me the response as Bachelor's Degree.
I tried json.loads also but the single quote problem remains the same,
How to get the string value correctly.

What you see here ("Bachelor\u2019s Degree") is the string's inner representation, where "\u2019" is the unicode codepoint for "RIGHT SINGLE QUOTATION MARK". This is perfectly correct, there's nothing wrong here, if you print() this string you'll get what you expect:
>>> s = 'Bachelor\u2019s Degree'
>>> print(s)
Bachelor’s Degree
Learning about unicode and encodings might save you quite some time FWIW.
EDIT:
When I save in db and then on displaying on HTML it will cause issue
right?
Have you tried ?
Your database connector is supposed to encode it to the proper encoding (according to your fields, tables and client encoding settings).
wrt/ "displaying it on HTML", it mostly depends on whether you're using Python 2.7.x or Python 3.x AND on how you build your HTML, but if you're using some decent framework with a decent template engine (if not you should reconsider your stack) chances are it will work out of the box.
As I already mentionned, learning about unicode and encodings will save you a lot of time.

It's just using a UTF-8 encoding, it is not "wrong".
string = 'Bachelor\u2019s Degree'
print(string)
Bachelor’s Degree
You can decode and encode it again, but I can't see any reason why you would want to do that (this might not work in Python 2):
string = 'Bachelor\u2019s Degree'.encode().decode('utf-8')
print(string)
Bachelor’s Degree

From requests docs:
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text
On the response object, you may use .content instead of .text to get the response in UTF-8

BeautifulSoup 4 converting HTML entities to unicode, but getting junk characters when using print

I am trying to scrape text from the web using BeautifulSoup 4 to parse it out. I am running into an issue when printing bs4 processed text out to the console. Whenever I hit a character that was originally an HTML entity, like ’ I get garbage characters on the console. I believe bs4 is converting these entities to unicode correctly because if I try using another encoding to print out the text, it will complain about the appropriate lack of unicode mapping for a character (like u'\u2019.) I'm not sure why the print function gets confused over these characters. I've tried changing around fonts, which changes the garbage characters, and am on a Windows 7 machine with US-English locale. Here is my code for reference, any help is appreciated. Thanks in advance!
#!/usr/bin/python
import json
import urllib2
import cookielib
import bs4
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Tiguan\
&page=0&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
totalpages = decoded['response']['meta']['hits']/10
for page in range(totalpages + 1):
if page>0:
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?\
q=Tiguan&page=" + str(page) + "&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
for url in decoded['response']['docs']:
print url['web_url']
urlstring = url['web_url']
art = opener.open(urlstring)
soup = bs4.BeautifulSoup(art.read())
goodstuff = soup.findAll('nyt_text')
for tag in goodstuff:
print tag.prettify().encode("UTF")

The problem has nothing to do with bs4, or HTML entities, or anything else. You could reproduce the exact same behavior, on most Windows systems, with a one-liner program to print out the same characters that are appearing as garbage when you try to print them, like this:
print u'\u2019'.encode('UTF-8')
The problem here is that, like the vast majority of Windows systems (and nothing else anyone uses in 2013), your default character set is not UTF-8, but something like CP1252.
So, when you encode your Unicode strings to UTF-8 and print those bytes to the console, the console interprets them as CP1252. Which, in this case, means you get â€™ instead of ’.
Changing fonts won't help. The UTF-8 encoding of \u2013 is the three bytes \xe2, \x80, and \x99, and the CP1252 meaning of those three bytes is â, €, and ™.
If you want to encode manually for the console, you need to encode to the right character set, the one your console actually uses. You may be able to get that as sys.stdout.encoding.
Of course you may get an exception trying to encode things for the right character set, because 8-bit character sets like CP1252 can only handle about 240 of the 110K characters in Unicode. The only way to handle that is to use the errors argument to encode to either ignore them or replace them with replacement characters.
Meanwhile, if you haven't read the Unicode HOWTO, you really need to. Especially if you plan to stick with Python 2.x and Windows.
If you're wondering why a few command-line programs seem to be able to get around these problems: Microsoft's solution to the character set problem is to create a whole parallel set of APIs that use 16-bit characters instead of 8-bit, and those APIs always use UTF-16. Unfortunately, many things, like the portable stdio wrappers that Microsoft provides for talking to the console and that Python 2.x relies on, only have the 8-bit API. Which means the problem isn't solved at all. Python 3.x no longer uses those wrappers, and there have been recurring discussions on making some future version talk UTF-16 to the console. But even if that happens in 3.4 (which seems very unlikely), that won't help you as long as you're using 2.x.

#abarnert's answer contains a good explanation of the issue.
In your particular case, you could just pass encoding parameter to prettify() instead of default utf-8.
If you are printing to console, you could try to print Unicode directly:
print soup.prettify(encoding=None, formatter='html') # print Unicode
It may fail. If you pass ascii; then BeautifulSoup may use numerical character references instead of non-ascii characters:
print soup.prettify('ascii', formatter='html')
It assumes that current Windows codepage is ascii-based encoding (most of them do). It should also work if the output is redirected to a file or another program via a pipe.
For portability, you could always print Unicode (encoding=None above) and use PYTHONIOENCODING to get appropriate character encoding e.g., utf-8 for files, pipes and ascii:xmlcharrefreplace to avoid garbage in a console.

How to solve UnicodeEncodeError while working with Cyrillic (Russian) letters?

I try to read a RSS-feed using feed parser.
import feedparser
url = 'http://example.com/news.xml'
d=feedparser.parse(url)
f = open('rss.dat','w')
for e in d.entries:
title = e.title
print >>f, address
f.close()
It works fine with English RSS-feeds but I get a UnicodeEncodeError if I try to display a title written in Cyrillic letters. It happens when I:
Try to write a title into a file.
Try to display a title into the screen.
Try to use it in URL to access a web page.
My question is how to solve this problem easily. I would love to have a solution as simple as this:
new_title = some_function(title)
May be there is a way to replace every Cyrillic symbol by its HTML code?

FeedParser itself works fine with encodings, except in the case when it is wrongly declared. Refer to http://code.google.com/p/feedparser/issues/detail?id=114 for a possible explanation. It seems Python 2.5 uses ascii as default encoding, and causes problems.
Can you paste the actual feed URL, to see how the encoding is declared there. If it appear that the declare encoding is wrong - you'll have to find a way to instruct FeedParser to override the default value.
EDIT: Okay, it seems the error is in the print statement.
Use
f.write(title.encode('utf-8'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.