Encoding issue with Scrapy (Python)

Encoding issue with Scrapy (Python) - python

I have created a crawlspider with Scrapy. I need to get a specific part of the page with a Xpath :
item = ExplorerItem()
item['article'] = response.xpath("//div[#class='post-content']").extract()
Then I am using this item in pipelines.py.
But item['article'] gives me a result in unicode:
`u'<div class="post-content">\n\t\t\t\t\t<h2>D\xe9signation</h2>\n<p>`
I need to convert it in UTF-8.

What you are seeing are unicode characters when you see \xe9 \xe7. These are unicode characters. You may have some luck with this module Unidecode I have used it before with success, but those characters are fine I think your console just isn't set to render them. Web pages or source data doesn't always tell the truth about its encoding. Often data is a jumble of encodings. Unidecode will do its best to represent the character in ASCII.
from unidecode import unidecode
unidecode(u"\u5317\u4EB0") # Note the u before the string on this line stands for unicode

Set FEED_EXPORT_ENCODING='utf-8' i settings.py
See docs here https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_ENCODING

Related

Display Japanese characters in Visual Studio Code using Python

According to this older answer, Python 3 strings are UTF-8 compliant by default. But in my web scraper using BeautifulSoup, when I try to print or display a URL, the Japanese characters show up as '%E3%81%82' or '%E3%81%B3' instead of the actual characters.
This Japanese website is the one I'm collecting information from, more specifically the URLs that correspond with the links in the clickable letter buttons. When you hover over for example あa, your browser will show you that the link you're about to click on is https://kokugo.jitenon.jp/cat/gojuon.php?word=あ. However, extracting the ["href"] property of the link using BeautifulSoup, I get https://kokugo.jitenon.jp/cat/gojuon.php?word=%E3%81%82.
Both versions link to the same web page, but for the sake of debugging, I'm wondering if it's possible to make sure the displayed string contains the actual Japanese character. If not, how can I convert the string to accommodate this purpose?

It's called Percent-encoding:
Percent-encoding, also known as URL encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI)
using only the limited US-ASCII characters legal within a URI.
Apply the unquote method from urllib.parse module:
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
Replace %xx escapes by their single-character equivalent. The
optional encoding and errors parameters specify how to decode
percent-encoded sequences into Unicode characters, as accepted by the
bytes.decode() method.
string must be a str. Changed in version 3.9: string parameter
supports bytes and str objects (previously only str).
encoding defaults to 'utf-8'. errors defaults to 'replace',
meaning invalid sequences are replaced by a placeholder character.
Example:
from urllib.parse import unquote
encodedUrl = 'JapaneseChars%E3%81%82or%E3%81%B3'
decodedUrl = unquote( encodedUrl )
print( decodedUrl )
JapaneseCharsあorび
One can apply the unquote method to almost any string, even if already decoded:
print( unquote(decodedUrl) )
JapaneseCharsあorび

Wrong encoding when displaying an HTML Request in Python

I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text, special characters (such as accents) are encoded (é = é for example).
Yet when I try r.encoding, I get utf-8.
In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.
Try as follows:
r = requests.get("https://gks.gs/login")
print r.text
There encoded characters which are displayed, we can see Mot de passe oublié ?.
I do not understand why. Do you think it may be because of https? How to fix this please?

These are HTML character entity references, the easiest way to decode them is:
In Python 2.x:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('oublié')
'oublié'
In Python 3.x:
>>> import html.parser
>>> html.parser.HTMLParser().unescape('oublié')
'oublié'

These are HTML escape codes, defined in the HTML Coded Character Set. Even though a certain document may be encoded in UTF-8, HTML (and its grandparent, SGML) were defined back in the good old days of ASCII. A system accessing an HTML page on the WWW may or may not natively support extended characters, and the developers needed a way to define "advanced" characters for some users, while failing gracefully for other users whose systems could not support them. Since UTF-8 standardization was only a gleam in its founders' eyes at that point, an encoding system was developed to describe characters that weren't part of ASCII. It was up to the browser developers to implement a way of displaying those extended characters, either through glyphs or through extended fonts.

Encoding special characters using &sometihg; is "legal" in any HTML and despite of looking a bit strange, they are to be considered valid.
The text is supposed to be rendered by some HTML browser and it will result in correct result, regardless if you find these character encoded using given construct or directly.
For instructions how to convert these encoded characters see HTML Entity Codes to Text

Those are HTML escape codes, often referred to as HTML entities. As you see, HTML uses its own code to replace reserved symbols.
You can use the library HTMLParser
parser = HTMLParser.HTMLParser
parsed = parser.unescape(r.text)

BeautifulSoup 4 converting HTML entities to unicode, but getting junk characters when using print

I am trying to scrape text from the web using BeautifulSoup 4 to parse it out. I am running into an issue when printing bs4 processed text out to the console. Whenever I hit a character that was originally an HTML entity, like ’ I get garbage characters on the console. I believe bs4 is converting these entities to unicode correctly because if I try using another encoding to print out the text, it will complain about the appropriate lack of unicode mapping for a character (like u'\u2019.) I'm not sure why the print function gets confused over these characters. I've tried changing around fonts, which changes the garbage characters, and am on a Windows 7 machine with US-English locale. Here is my code for reference, any help is appreciated. Thanks in advance!
#!/usr/bin/python
import json
import urllib2
import cookielib
import bs4
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Tiguan\
&page=0&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
totalpages = decoded['response']['meta']['hits']/10
for page in range(totalpages + 1):
if page>0:
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?\
q=Tiguan&page=" + str(page) + "&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
for url in decoded['response']['docs']:
print url['web_url']
urlstring = url['web_url']
art = opener.open(urlstring)
soup = bs4.BeautifulSoup(art.read())
goodstuff = soup.findAll('nyt_text')
for tag in goodstuff:
print tag.prettify().encode("UTF")

The problem has nothing to do with bs4, or HTML entities, or anything else. You could reproduce the exact same behavior, on most Windows systems, with a one-liner program to print out the same characters that are appearing as garbage when you try to print them, like this:
print u'\u2019'.encode('UTF-8')
The problem here is that, like the vast majority of Windows systems (and nothing else anyone uses in 2013), your default character set is not UTF-8, but something like CP1252.
So, when you encode your Unicode strings to UTF-8 and print those bytes to the console, the console interprets them as CP1252. Which, in this case, means you get â€™ instead of ’.
Changing fonts won't help. The UTF-8 encoding of \u2013 is the three bytes \xe2, \x80, and \x99, and the CP1252 meaning of those three bytes is â, €, and ™.
If you want to encode manually for the console, you need to encode to the right character set, the one your console actually uses. You may be able to get that as sys.stdout.encoding.
Of course you may get an exception trying to encode things for the right character set, because 8-bit character sets like CP1252 can only handle about 240 of the 110K characters in Unicode. The only way to handle that is to use the errors argument to encode to either ignore them or replace them with replacement characters.
Meanwhile, if you haven't read the Unicode HOWTO, you really need to. Especially if you plan to stick with Python 2.x and Windows.
If you're wondering why a few command-line programs seem to be able to get around these problems: Microsoft's solution to the character set problem is to create a whole parallel set of APIs that use 16-bit characters instead of 8-bit, and those APIs always use UTF-16. Unfortunately, many things, like the portable stdio wrappers that Microsoft provides for talking to the console and that Python 2.x relies on, only have the 8-bit API. Which means the problem isn't solved at all. Python 3.x no longer uses those wrappers, and there have been recurring discussions on making some future version talk UTF-16 to the console. But even if that happens in 3.4 (which seems very unlikely), that won't help you as long as you're using 2.x.

#abarnert's answer contains a good explanation of the issue.
In your particular case, you could just pass encoding parameter to prettify() instead of default utf-8.
If you are printing to console, you could try to print Unicode directly:
print soup.prettify(encoding=None, formatter='html') # print Unicode
It may fail. If you pass ascii; then BeautifulSoup may use numerical character references instead of non-ascii characters:
print soup.prettify('ascii', formatter='html')
It assumes that current Windows codepage is ascii-based encoding (most of them do). It should also work if the output is redirected to a file or another program via a pipe.
For portability, you could always print Unicode (encoding=None above) and use PYTHONIOENCODING to get appropriate character encoding e.g., utf-8 for files, pipes and ascii:xmlcharrefreplace to avoid garbage in a console.

decode URL only non-ascii character

Now I'm working on Wikipedia. In many articles, I noticed some URLs, for example, https://www.google.com/search?q=%26%E0%B8%89%E0%B8%B1%E0%B8%99, are very long. The example URL can be replaced with "https://www.google.com/search?q=%26ฉัน" (ฉัน is a Thai word) which is shorter and cleaner. However, when I use urllib.unquote function to decode URL, it decodes even %26 and get "https://www.google.com/search?q=&ฉัน" as the result. As you might have noticed, this URL is useless; it doesn't make a valid link.
Therefore, I want to know how to get decode link while it is valid. I think that decoding only non-ascii character would get the valid URL. Is it correct? and how to do that?
Thanks :)

Easiest way, you can replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder.
Another way is look for UTF-8 sequences. Your URL appears to be encoded in UTF-8, and Wikipedia uses UTF-8. You can see the Wikipedia entry for UTF-8 for how UTF-8 characters are encoded.
So, when encoded in URLs, each valid non-ascii UTF-8 character would follow one of these patterns:
(%C0-%DF)(%80-%BF)
(%E0-%EF)(%80-%BF)(%80-%BF)
(%F0-%F7)(%80-%BF)(%80-%BF)(%80-%BF)
(%F8-%FB)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
(%FC-%FD)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
So you can match these patterns in the URL and unquote each character separately.
However, remember that not all URLs are encoded in UTF-8.
In some old websites, they still use other character sets, such as Windows-874 for Thai language.
In such cases, "ฉัน" for that particular website is encoded as "%A9%D1%B9" instead of "%E0%B8%89%E0%B8%B1%E0%B8%99". If you decode it using urllib.unquote you will get some garbled text like "?ѹ" instead of "ฉัน" and that could break the link.
So you have to be careful and check if the URL decoding break the link or not. Make sure that the URL you're decoding is in UTF-8.

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?

According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.

Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word

Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8

What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1

In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.