How to solve UnicodeEncodeError while working with Cyrillic (Russian) letters? - python

I try to read a RSS-feed using feed parser.
import feedparser
url = 'http://example.com/news.xml'
d=feedparser.parse(url)
f = open('rss.dat','w')
for e in d.entries:
title = e.title
print >>f, address
f.close()
It works fine with English RSS-feeds but I get a UnicodeEncodeError if I try to display a title written in Cyrillic letters. It happens when I:
Try to write a title into a file.
Try to display a title into the screen.
Try to use it in URL to access a web page.
My question is how to solve this problem easily. I would love to have a solution as simple as this:
new_title = some_function(title)
May be there is a way to replace every Cyrillic symbol by its HTML code?

FeedParser itself works fine with encodings, except in the case when it is wrongly declared. Refer to http://code.google.com/p/feedparser/issues/detail?id=114 for a possible explanation. It seems Python 2.5 uses ascii as default encoding, and causes problems.
Can you paste the actual feed URL, to see how the encoding is declared there. If it appear that the declare encoding is wrong - you'll have to find a way to instruct FeedParser to override the default value.
EDIT: Okay, it seems the error is in the print statement.
Use
f.write(title.encode('utf-8'))

Related

mimic web URL encode for Chinese character in python

I want to mimic URL encoding for Chinese characters. For my use case, I have a searching URL for a e-commerce site
'https://search.jd.com/Search?keyword={}'.format('ipad')
When I search a product in english, this works fine. However, I need to have input in Chinese, I tried
'https://search.jd.com/Search?keyword={}'.format('耐克t恤')
, and found the following encoding under the network tab
https://list.tmall.com/search_product.htm?q=%C4%CD%BF%CBt%D0%F4
So basically, I need to encode inputs like '耐克t恤' into '%C4%CD%BF%CBt%D0%F4'. I'm not sure which encoding the website is using? Also, how to convert Chinese characters to these encodings with python?
Update: I checked headers and it seems like content encoding is gzip?
Try using the library urllib.parse module. More specifically, urllib.parse.urlencode() function. You can pass the encoding (in this case it appears to be 'gb2312') and a dict containing the query parameters to get a valid valid url suffix which you can use directly.
In this case, your code will look something like:
import urllib.parse
keyword = '耐克t恤'
url = 'https://search.jd.com/Search?{url_suffix}'.format(url_suffix=urllib.parse.urlencode({'keyword': keyword}, encoding='gb2312'))
More info about encoding here
More info about urlencode here
The encoding used seems to be GB2312
This could help you:
def encodeGB2312(data):
hexData = data.encode(encoding='GB2312').hex().upper()
encoded = '%' + '%'.join(hexData[i:i + 2] for i in range(0, len(hexData), 2))
return encoded
output = encodeGB2312('耐克t恤')
print(output)
url = f'https://list.tmall.com/search_product.htm?q={output}'
print(url)
Output:
%C4%CD%BF%CB%74%D0%F4
https://list.tmall.com/search_product.htm?q=%C4%CD%BF%CB%74%D0%F4
The only problem with my code is that it doesn't seem to 100% corrospond with the link you are trying to achieve. It converts the t chacaracter into GB2312 encoding. While it seems to use the non encoded t character in your link. Altough it still seems to work when opening the url.
Edit:
Vignesh Bayari R his post handles the URL in the correct (intended) way. But in this case my solution works too.

How can I 'translate' all unicode codes in a string to the actual symbols using Python 3?

I'm parsing web content to isolate the body of news articles from a certain site, for which I'm using urllib.request to retrieve the source code for the article webpage and isolate the main text. However, urllib takes characters like "ç" and puts it into a python string as its utf-8 notation, "c387". It does the same for the '”' and "„" characters, which print as an 'e' followed by a set of numbers. This is very irritating when trying to read the article and thus needs to be resolved. I could loop through the article and change every recognizable utf-8 code to the actual character using a tedious function, but I was wondering if there was a way to do that more easily.
For an example, the current output of my program might be:
e2809eThis country doesn't...e2809d
I would like it to be:
„This country doesn't...”
Note: I've already checked the source code of the web page, which just uses these 'special' characters, so it's definitely a urllib issue.
Thanks in advance!
urllib returns bytes:
>import urllib
>url = 'https://stackoverflow.com/questions/62085906'
>data = urllib.request.urlopen(url).read()
>type(data)
bytes
>idx = data.index(b'characters like')
>data[idx:idx+20]
b'characters like "\xc3\xa7"'
Now, let's try to interpret this as utf-8:
>data[idx:idx+20].decode('utf-8')
'characters like "ç"'
Et voilà!

BeautifulSoup 4 converting HTML entities to unicode, but getting junk characters when using print

I am trying to scrape text from the web using BeautifulSoup 4 to parse it out. I am running into an issue when printing bs4 processed text out to the console. Whenever I hit a character that was originally an HTML entity, like ’ I get garbage characters on the console. I believe bs4 is converting these entities to unicode correctly because if I try using another encoding to print out the text, it will complain about the appropriate lack of unicode mapping for a character (like u'\u2019.) I'm not sure why the print function gets confused over these characters. I've tried changing around fonts, which changes the garbage characters, and am on a Windows 7 machine with US-English locale. Here is my code for reference, any help is appreciated. Thanks in advance!
#!/usr/bin/python
import json
import urllib2
import cookielib
import bs4
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Tiguan\
&page=0&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
totalpages = decoded['response']['meta']['hits']/10
for page in range(totalpages + 1):
if page>0:
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?\
q=Tiguan&page=" + str(page) + "&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
for url in decoded['response']['docs']:
print url['web_url']
urlstring = url['web_url']
art = opener.open(urlstring)
soup = bs4.BeautifulSoup(art.read())
goodstuff = soup.findAll('nyt_text')
for tag in goodstuff:
print tag.prettify().encode("UTF")
The problem has nothing to do with bs4, or HTML entities, or anything else. You could reproduce the exact same behavior, on most Windows systems, with a one-liner program to print out the same characters that are appearing as garbage when you try to print them, like this:
print u'\u2019'.encode('UTF-8')
The problem here is that, like the vast majority of Windows systems (and nothing else anyone uses in 2013), your default character set is not UTF-8, but something like CP1252.
So, when you encode your Unicode strings to UTF-8 and print those bytes to the console, the console interprets them as CP1252. Which, in this case, means you get ’ instead of ’.
Changing fonts won't help. The UTF-8 encoding of \u2013 is the three bytes \xe2, \x80, and \x99, and the CP1252 meaning of those three bytes is â, €, and ™.
If you want to encode manually for the console, you need to encode to the right character set, the one your console actually uses. You may be able to get that as sys.stdout.encoding.
Of course you may get an exception trying to encode things for the right character set, because 8-bit character sets like CP1252 can only handle about 240 of the 110K characters in Unicode. The only way to handle that is to use the errors argument to encode to either ignore them or replace them with replacement characters.
Meanwhile, if you haven't read the Unicode HOWTO, you really need to. Especially if you plan to stick with Python 2.x and Windows.
If you're wondering why a few command-line programs seem to be able to get around these problems: Microsoft's solution to the character set problem is to create a whole parallel set of APIs that use 16-bit characters instead of 8-bit, and those APIs always use UTF-16. Unfortunately, many things, like the portable stdio wrappers that Microsoft provides for talking to the console and that Python 2.x relies on, only have the 8-bit API. Which means the problem isn't solved at all. Python 3.x no longer uses those wrappers, and there have been recurring discussions on making some future version talk UTF-16 to the console. But even if that happens in 3.4 (which seems very unlikely), that won't help you as long as you're using 2.x.
#abarnert's answer contains a good explanation of the issue.
In your particular case, you could just pass encoding parameter to prettify() instead of default utf-8.
If you are printing to console, you could try to print Unicode directly:
print soup.prettify(encoding=None, formatter='html') # print Unicode
It may fail. If you pass ascii; then BeautifulSoup may use numerical character references instead of non-ascii characters:
print soup.prettify('ascii', formatter='html')
It assumes that current Windows codepage is ascii-based encoding (most of them do). It should also work if the output is redirected to a file or another program via a pipe.
For portability, you could always print Unicode (encoding=None above) and use PYTHONIOENCODING to get appropriate character encoding e.g., utf-8 for files, pipes and ascii:xmlcharrefreplace to avoid garbage in a console.

Python Beautiful Soup, saved text not displaying properly in original encoding

I'm having trouble with a saved file not displaying properly when using it's original encoding.
I'm downloading a web page, searching it for content I want and then writing that content to a file.
The encoding on the site is 'iso-8859-1' or so chrome and beautiful soup tell me and it appears perfectly when viewed using that encoding on the original site.
When I download the page and try to view it however I end up with strange characters (HTML Entities?) like these:
“ , ’
If I manually set Chromes encoding to 'Utf-8' when viewing the saved page it appears normally, as does the original page if I set that to 'Utf-8'.
I'm not sure what to do with this, I would change the encoding before writing the text to a file but I get ascii errors when I try that.
Here is a sample page (possible adult content):
http://original.adultfanfiction.net/story.php?no=600106516
And the code I am using to get the text from the page:
site = requests.post(url, allow_redirects=False)
html = site.text
soup = BeautifulSoup(html)
rawStory = soup.findAll("td",{"colspan" : '3'})
story = str(rawStory)
return story
I turn the ResultSet into a string so that I can write it to a file, I don't know if that could be part of the problem, if I print the html to the console after requesting it but before doing anything to it it displays improperly in the console as well.
I'm 90% sure that your problem is just that you're asking BeautifulSoup for a UTF-8 fragment and then trying to use it as ISO-8859-1, which obviously isn't going to work. The documentation explains all of this pretty nicely.
You're calling str. As Non pretty printing explains:
If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it… The str() function returns a string encoded in UTF-8.
As Output encoding explains:
When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with.
It then follows up with an example of almost exactly what you're doing—parsing a Latin-1 HTML document and writing it back as UTF-8—and then immediately explaining how to fix it:
If you don’t want UTF-8, you can pass an encoding into prettify()… You can also call encode() on the BeautifulSoup object, or any element in the soup, just as if it were a Python string…
So, that's all you have to do.
However, you've got another problem before you get there. When you call findAll, you don't get back a tag, you get back a ResultSet, which is basically a list of tags. Just as calling str on a list of strings gives you brackets, commas, and the repr of each string (with extraneous quotes and backlash escapes for non-printable-ASCII characters) instead of the strings themselves, calling str on a ResultSet gives you something similar. And you obviously can't call Tag methods on a ResultSet.
Finally, I'm not sure what problem you're actually trying to solve. You're creating a HTML fragment. Ignoring the fact that a fragment isn't a valid document, and browsers shouldn't strictly speaking display it in the first place, it doesn't specify an encoding, meaning the browser can only get that information from some out-of-band place, like you selecting a menu item. Changing it to Latin-1 won't actually "fix" things, it'll just mean that now you get the right display when you pick Latin-1 in the menu and wrong when you pick UTF-8 instead of vice-versa. Why not actually create a full HTML document that actually has a meta http-equiv that actually means what you want it to mean, instead of trying to figure out how to trick Chrome into guessing what you want it to guess?

Decoding unicode from Javascript in Python & Django

On a website I have the word pluș sent via POST to a Django view.
It is sent as plu%25C8%2599. So I took that string and tried to figure out a way how to make %25C8%2599 back into ș.
I tried decoding the string like this:
from urllib import unquote_plus
s = "plu%25C8%2599"
print unquote_plus(unquote_plus(s).decode('utf-8'))
The result i get is pluÈ which actually has a length of 5, not 4.
How can I get the original string pluș after it's encoded ?
edit:
I managed to do it like this
def js_unquote(quoted):
quoted = quoted.encode('utf-8')
quoted = unquote_plus(unquote_plus(quoted)).decode('utf-8')
return quoted
It looks weird but works the way I needed it.
URL-decode twice, then decode as UTF-8.
You can't unless you know what the encoding is. Unicode itself is not an encoding. You might try BeautifulSoup or UnicodeDammit, which might help you get the result you were hoping for.
http://www.crummy.com/software/BeautifulSoup/
I hope this helps!
Also take a look at:
http://www.joelonsoftware.com/articles/Unicode.html
unquote_plus(s).encode('your_lang_encoding')
I was try like that. I was tried to sent a json POST request by HTML form to directly a django URI, which is included unicode characters like "şğüöçı+" and it works. I have used iso_8859-9 encoder in encode() function.

Categories