how to open a URL with non utf-8 arguments - python

Using Python I need to transfer non utf-8 encoded data (specifically shift-jis) to a URL via the query string.
How should I transfer the data? Quote it? Encode in utf-8?
Thanks

Query string parameters are byte-based. Whilst IRI-to-URI and typed non-ASCII characters will typically use UTF-8, there is nothing forcing you to send or receive your own parameters in that encoding.
So for Shift-JIS (actually typically cp932, the Windows extension of that encoding):
foo= u'\u65E5\u672C\u8A9E' # 日本語
url= 'http://www.example.jp/something?foo='+urllib.quote(foo.encode('cp932'))
In Python 3 you do it in the quote function itself:
foo= '\u65E5\u672C\u8A9E'
url= 'http://www.example.jp/something?foo='+urllib.parse.quote(foo, encoding= 'cp932')

I don't know what unicode has to do with this, since the query string is a string of bytes. You can use the quoting functions in urllib to quote plain strings so that they can be passed within query strings.

By the »query string« you mean HTTP GET like in http:/{URL}?data=XYZ?
You have encoding what ever data you have via base64.b64encode using -_ as alternative character to be URL safe as an option. See here.

Related

Python 2.7 convert special characters into utf-8 byes

I have strings that I need to replace into an URL for accessing different JSON files. My problem is that some strings have special characters and I need only these as UTF-8 bytes, so I can properly find the JSON tables.
An example:
# I have this string
a = 'code - Brasilândia'
#in the JSON url it appears as
'code%20-%20Brasil%C3%A2ndia'
I managed to get the spaces converted right using urllib.quote(), but it does not convert the special characters as I need them.
print(urllib.quote('code - Brasilândia))
'code%20-%20Brasil%83ndia'
When I substitute this in the URL, I cannot reach the JSON table.
I managed to make this work using u before the string, u'code - Brasilândia', but this did not solve my issue, because the string will ultimately be a user input, and will need to be constantly changed.
I have tried several methods, but I could not get the result I need.
I'm specifically using python 2.7 for this project, and I cannot change it.
Any ideas?
You could try decoding the string as UTF-8, and if it fails, assume that it's Latin-1, or whichever 8-bit encoding you expect.
try:
yourstring.decode('utf-8')
except UnicodeDecodeError:
yourstring = yourstring.decode('latin-1').encode('utf-8')
print(urllib.quote(yourstring))
... provided you can establish the correct encoding; 0x83 seems to correspond to â only in some fairly obscure legacy encodings like code pages 437 and 850 (and those are the least obscure). See also https://tripleee.github.io/8bit/#83
(disclosure: the linked site is mine).
Demo: https://ideone.com/fjX15c

Display Japanese characters in Visual Studio Code using Python

According to this older answer, Python 3 strings are UTF-8 compliant by default. But in my web scraper using BeautifulSoup, when I try to print or display a URL, the Japanese characters show up as '%E3%81%82' or '%E3%81%B3' instead of the actual characters.
This Japanese website is the one I'm collecting information from, more specifically the URLs that correspond with the links in the clickable letter buttons. When you hover over for example あa, your browser will show you that the link you're about to click on is https://kokugo.jitenon.jp/cat/gojuon.php?word=あ. However, extracting the ["href"] property of the link using BeautifulSoup, I get https://kokugo.jitenon.jp/cat/gojuon.php?word=%E3%81%82.
Both versions link to the same web page, but for the sake of debugging, I'm wondering if it's possible to make sure the displayed string contains the actual Japanese character. If not, how can I convert the string to accommodate this purpose?
It's called Percent-encoding:
Percent-encoding, also known as URL encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI)
using only the limited US-ASCII characters legal within a URI.
Apply the unquote method from urllib.parse module:
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
Replace %xx escapes by their single-character equivalent. The
optional encoding and errors parameters specify how to decode
percent-encoded sequences into Unicode characters, as accepted by the
bytes.decode() method.
string must be a str. Changed in version 3.9: string parameter
supports bytes and str objects (previously only str).
encoding defaults to 'utf-8'. errors defaults to 'replace',
meaning invalid sequences are replaced by a placeholder character.
Example:
from urllib.parse import unquote
encodedUrl = 'JapaneseChars%E3%81%82or%E3%81%B3'
decodedUrl = unquote( encodedUrl )
print( decodedUrl )
JapaneseCharsあorび
One can apply the unquote method to almost any string, even if already decoded:
print( unquote(decodedUrl) )
JapaneseCharsあorび

lxml.html parsing and utf-8 with requests

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.
r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?
i then tried
html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
it seems it's working now. but i don't know why is this happening, always use latin1 ?
what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?
You've not specified if you're using python 2 or 3. Encoding is handled quite differently depending on which version you're using. The following advice is more or less universal anyway.
The difference between r.text and r.content is in the Requests docs. Simply put Requests will attempt to figure out the character encoding for you and return Unicode after decoding it. This which is accessible via r.text. To get just the bytes use r.content.
You really need to get to grips with encodings. Read http://www.joelonsoftware.com/articles/Unicode.html and watch https://www.youtube.com/watch?v=sgHbC6udIqc to get started. Also, do a search for "Overcoming frustration: Correctly using unicode in python2" for additional help.
Just to clarify, it's not as simple as always use one encoding over another. Make a Unicode sandwich by doing any I/O in bytes and work with Unicode in your application. If you start with bytes (isinstance(mytext, str)) you need to know the encoding to decode to Unicode, if you start with Unicode (isinstance(mytext, unicode)) you should encode to UTF-8 as it will handle all the worlds characters.
Make sure your editor, files, server and database are configured to UTF-8 also otherwise you'll get more 'gibberish'.
If you want further help post the source files and output of your script.

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?
According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.
Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word
Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8
What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1
In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

Sending a List through an URL

I have a list that I need to send through a URL to a third party vendor. I don't know what language they are using.
The list prints out like this:
[u'1', u'6', u'5']
I know that the u encodes the string in utf-8 right? So a couple of questions.
Can I send a list through a URL?
Will the u's show up on the other end when going through the URL?
If so, how do I remove them?
I am not sure what keywords to search to help me out, so any resources would be helpful too.
Can I send a list through a URL?
No. A URL is just text. If you want a way to package structured information in it, you'll have to agree that with the provider you're talking to.
One standard encoding for structure in URLs, that might or might not be what you need, is the use of multiple parameters with the same name in a query string. This format comes from HTML form submissions:
http://www.example.com/script?par=1&par=6&par=5
might be considered to represent a parameter par with a three-item list as its value. Or maybe not, it's up to the receiver to decide. For example in a PHP application you would have had to name the parameter par[] to get it to accept the array value.
I know that the u encodes the string in utf-8 right?
No. a u'...' string is a native Unicode string, where each index represents a whole character and not a byte in any particular encoding. If you want UTF-8 bytes, write u'...'.encode('utf-8') before URL-encoding. UTF-8 is a good default choice, but again: what encoding the receiving side wants is up to that application.
Will the u's show up on the other end when going through the URL?
u is part of the literal representation of the string, just the same as the ' quotes themselves. They are not part of the string value and would not be echoed by print or when joined into other strings, unless you deliberately asked for the literal representation by calling repr.
u'' is not utf-8, its python unicode strings for python 2.x
To send it through url, you need to encode them with utf8 like .encode('utf-8'), and also need to urlencode, and list cannot send it through URL, you need to make it as string.
Basically, you need to do it in following steps
python list -> unicode string -> utf8 string -> url encode -> send it through proper urllib api
Incorrect. unicode literals use Python's internal encoding, decided when it was compiled.
You can't send anything "through" URLs. Pick a protocol instead. And encode before sending, probably to UTF-8.

Categories