According to this older answer, Python 3 strings are UTF-8 compliant by default. But in my web scraper using BeautifulSoup, when I try to print or display a URL, the Japanese characters show up as '%E3%81%82' or '%E3%81%B3' instead of the actual characters.
This Japanese website is the one I'm collecting information from, more specifically the URLs that correspond with the links in the clickable letter buttons. When you hover over for example あa, your browser will show you that the link you're about to click on is https://kokugo.jitenon.jp/cat/gojuon.php?word=あ. However, extracting the ["href"] property of the link using BeautifulSoup, I get https://kokugo.jitenon.jp/cat/gojuon.php?word=%E3%81%82.
Both versions link to the same web page, but for the sake of debugging, I'm wondering if it's possible to make sure the displayed string contains the actual Japanese character. If not, how can I convert the string to accommodate this purpose?
It's called Percent-encoding:
Percent-encoding, also known as URL encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI)
using only the limited US-ASCII characters legal within a URI.
Apply the unquote method from urllib.parse module:
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
Replace %xx escapes by their single-character equivalent. The
optional encoding and errors parameters specify how to decode
percent-encoded sequences into Unicode characters, as accepted by the
bytes.decode() method.
string must be a str. Changed in version 3.9: string parameter
supports bytes and str objects (previously only str).
encoding defaults to 'utf-8'. errors defaults to 'replace',
meaning invalid sequences are replaced by a placeholder character.
Example:
from urllib.parse import unquote
encodedUrl = 'JapaneseChars%E3%81%82or%E3%81%B3'
decodedUrl = unquote( encodedUrl )
print( decodedUrl )
JapaneseCharsあorび
One can apply the unquote method to almost any string, even if already decoded:
print( unquote(decodedUrl) )
JapaneseCharsあorび
Related
I have strings that I need to replace into an URL for accessing different JSON files. My problem is that some strings have special characters and I need only these as UTF-8 bytes, so I can properly find the JSON tables.
An example:
# I have this string
a = 'code - Brasilândia'
#in the JSON url it appears as
'code%20-%20Brasil%C3%A2ndia'
I managed to get the spaces converted right using urllib.quote(), but it does not convert the special characters as I need them.
print(urllib.quote('code - Brasilândia))
'code%20-%20Brasil%83ndia'
When I substitute this in the URL, I cannot reach the JSON table.
I managed to make this work using u before the string, u'code - Brasilândia', but this did not solve my issue, because the string will ultimately be a user input, and will need to be constantly changed.
I have tried several methods, but I could not get the result I need.
I'm specifically using python 2.7 for this project, and I cannot change it.
Any ideas?
You could try decoding the string as UTF-8, and if it fails, assume that it's Latin-1, or whichever 8-bit encoding you expect.
try:
yourstring.decode('utf-8')
except UnicodeDecodeError:
yourstring = yourstring.decode('latin-1').encode('utf-8')
print(urllib.quote(yourstring))
... provided you can establish the correct encoding; 0x83 seems to correspond to â only in some fairly obscure legacy encodings like code pages 437 and 850 (and those are the least obscure). See also https://tripleee.github.io/8bit/#83
(disclosure: the linked site is mine).
Demo: https://ideone.com/fjX15c
I have created a crawlspider with Scrapy. I need to get a specific part of the page with a Xpath :
item = ExplorerItem()
item['article'] = response.xpath("//div[#class='post-content']").extract()
Then I am using this item in pipelines.py.
But item['article'] gives me a result in unicode:
`u'<div class="post-content">\n\t\t\t\t\t<h2>D\xe9signation</h2>\n<p>`
I need to convert it in UTF-8.
What you are seeing are unicode characters when you see \xe9 \xe7. These are unicode characters. You may have some luck with this module Unidecode I have used it before with success, but those characters are fine I think your console just isn't set to render them. Web pages or source data doesn't always tell the truth about its encoding. Often data is a jumble of encodings. Unidecode will do its best to represent the character in ASCII.
from unidecode import unidecode
unidecode(u"\u5317\u4EB0") # Note the u before the string on this line stands for unicode
Set FEED_EXPORT_ENCODING='utf-8' i settings.py
See docs here https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_ENCODING
I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text, special characters (such as accents) are encoded (é = é for example).
Yet when I try r.encoding, I get utf-8.
In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.
Try as follows:
r = requests.get("https://gks.gs/login")
print r.text
There encoded characters which are displayed, we can see Mot de passe oublié ?.
I do not understand why. Do you think it may be because of https? How to fix this please?
These are HTML character entity references, the easiest way to decode them is:
In Python 2.x:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('oublié')
'oublié'
In Python 3.x:
>>> import html.parser
>>> html.parser.HTMLParser().unescape('oublié')
'oublié'
These are HTML escape codes, defined in the HTML Coded Character Set. Even though a certain document may be encoded in UTF-8, HTML (and its grandparent, SGML) were defined back in the good old days of ASCII. A system accessing an HTML page on the WWW may or may not natively support extended characters, and the developers needed a way to define "advanced" characters for some users, while failing gracefully for other users whose systems could not support them. Since UTF-8 standardization was only a gleam in its founders' eyes at that point, an encoding system was developed to describe characters that weren't part of ASCII. It was up to the browser developers to implement a way of displaying those extended characters, either through glyphs or through extended fonts.
Encoding special characters using &sometihg; is "legal" in any HTML and despite of looking a bit strange, they are to be considered valid.
The text is supposed to be rendered by some HTML browser and it will result in correct result, regardless if you find these character encoded using given construct or directly.
For instructions how to convert these encoded characters see HTML Entity Codes to Text
Those are HTML escape codes, often referred to as HTML entities. As you see, HTML uses its own code to replace reserved symbols.
You can use the library HTMLParser
parser = HTMLParser.HTMLParser
parsed = parser.unescape(r.text)
So I am working for a web browser type of application for my client and I just implemented bookmarking functionality, but it doesn't work as expected. When user click "Bookmark page" a little form pops up, which takes title of a webpage and puts it in a line edit. The thing is, that if the website has some foreign or unusual symbols in it's title then Python throws an error how it can't encode the string. How could I get python to handle all possible strings, no matter if it has hieroglyphs or some other weird symbols?
Library used for GUI and embedded browser: PyQT
If you're using QWebView.title to get the title of the current web-page, then it will either return a QString or a python unicode string. Which one you get depends on the PyQt API version in use. For version 1 (which is the default for Python2), it will be a QString; for version 2 (which is the default for Python3), it will be a python unicode string. Whichever it is, in order to display it correctly in the line-edit, just set it directly:
lineEdit.setText(webview.title())
Since you appear to be using Python2, I'll assume that webview.title() is returning a QString. If you want to convert this to a python unicode string (e.g. in order to use it with sqlite), then you can do the following:
title = unicode(webview.title())
Note that you should not pass an encoding (such as "utf-8") as the second argument to unicode, as this is used for decoding byte strings to unicode strings.
If you do need to get a "utf-8" encoded byte string from a QString, then you can either do:
data = unicode(webview.title()).encode('utf-8')
or:
data = webview.title().toUtf8().data()
What are you using to parse the websites? I would recommend Beautiful Soup. It will try and determine the encoding of the web page and give you back unicode. Beautiful Soup's Parsing HTML section. Edit: Also take a look at the "Beautiful Soup Gives You Unicode, Dammit" section
Using Python I need to transfer non utf-8 encoded data (specifically shift-jis) to a URL via the query string.
How should I transfer the data? Quote it? Encode in utf-8?
Thanks
Query string parameters are byte-based. Whilst IRI-to-URI and typed non-ASCII characters will typically use UTF-8, there is nothing forcing you to send or receive your own parameters in that encoding.
So for Shift-JIS (actually typically cp932, the Windows extension of that encoding):
foo= u'\u65E5\u672C\u8A9E' # 日本語
url= 'http://www.example.jp/something?foo='+urllib.quote(foo.encode('cp932'))
In Python 3 you do it in the quote function itself:
foo= '\u65E5\u672C\u8A9E'
url= 'http://www.example.jp/something?foo='+urllib.parse.quote(foo, encoding= 'cp932')
I don't know what unicode has to do with this, since the query string is a string of bytes. You can use the quoting functions in urllib to quote plain strings so that they can be passed within query strings.
By the »query string« you mean HTTP GET like in http:/{URL}?data=XYZ?
You have encoding what ever data you have via base64.b64encode using -_ as alternative character to be URL safe as an option. See here.