Decoding unicode from Javascript in Python & Django - python

On a website I have the word pluș sent via POST to a Django view.
It is sent as plu%25C8%2599. So I took that string and tried to figure out a way how to make %25C8%2599 back into ș.
I tried decoding the string like this:
from urllib import unquote_plus
s = "plu%25C8%2599"
print unquote_plus(unquote_plus(s).decode('utf-8'))
The result i get is pluÈ which actually has a length of 5, not 4.
How can I get the original string pluș after it's encoded ?
edit:
I managed to do it like this
def js_unquote(quoted):
quoted = quoted.encode('utf-8')
quoted = unquote_plus(unquote_plus(quoted)).decode('utf-8')
return quoted
It looks weird but works the way I needed it.

URL-decode twice, then decode as UTF-8.

You can't unless you know what the encoding is. Unicode itself is not an encoding. You might try BeautifulSoup or UnicodeDammit, which might help you get the result you were hoping for.
http://www.crummy.com/software/BeautifulSoup/
I hope this helps!
Also take a look at:
http://www.joelonsoftware.com/articles/Unicode.html

unquote_plus(s).encode('your_lang_encoding')
I was try like that. I was tried to sent a json POST request by HTML form to directly a django URI, which is included unicode characters like "şğüöçı+" and it works. I have used iso_8859-9 encoder in encode() function.

Related

mimic web URL encode for Chinese character in python

I want to mimic URL encoding for Chinese characters. For my use case, I have a searching URL for a e-commerce site
'https://search.jd.com/Search?keyword={}'.format('ipad')
When I search a product in english, this works fine. However, I need to have input in Chinese, I tried
'https://search.jd.com/Search?keyword={}'.format('耐克t恤')
, and found the following encoding under the network tab
https://list.tmall.com/search_product.htm?q=%C4%CD%BF%CBt%D0%F4
So basically, I need to encode inputs like '耐克t恤' into '%C4%CD%BF%CBt%D0%F4'. I'm not sure which encoding the website is using? Also, how to convert Chinese characters to these encodings with python?
Update: I checked headers and it seems like content encoding is gzip?
Try using the library urllib.parse module. More specifically, urllib.parse.urlencode() function. You can pass the encoding (in this case it appears to be 'gb2312') and a dict containing the query parameters to get a valid valid url suffix which you can use directly.
In this case, your code will look something like:
import urllib.parse
keyword = '耐克t恤'
url = 'https://search.jd.com/Search?{url_suffix}'.format(url_suffix=urllib.parse.urlencode({'keyword': keyword}, encoding='gb2312'))
More info about encoding here
More info about urlencode here
The encoding used seems to be GB2312
This could help you:
def encodeGB2312(data):
hexData = data.encode(encoding='GB2312').hex().upper()
encoded = '%' + '%'.join(hexData[i:i + 2] for i in range(0, len(hexData), 2))
return encoded
output = encodeGB2312('耐克t恤')
print(output)
url = f'https://list.tmall.com/search_product.htm?q={output}'
print(url)
Output:
%C4%CD%BF%CB%74%D0%F4
https://list.tmall.com/search_product.htm?q=%C4%CD%BF%CB%74%D0%F4
The only problem with my code is that it doesn't seem to 100% corrospond with the link you are trying to achieve. It converts the t chacaracter into GB2312 encoding. While it seems to use the non encoded t character in your link. Altough it still seems to work when opening the url.
Edit:
Vignesh Bayari R his post handles the URL in the correct (intended) way. But in this case my solution works too.

python request library giving wrong value single quotes

Facing some issue in calling API using request library. Problem is described as follows
The code:.
r = requests.post(url, data=json.dumps(json_data), headers=headers)
When I perform r.text the apostrophe in the string is giving me as
like this Bachelor\u2019s Degree. This should actually give me the response as Bachelor's Degree.
I tried json.loads also but the single quote problem remains the same,
How to get the string value correctly.
What you see here ("Bachelor\u2019s Degree") is the string's inner representation, where "\u2019" is the unicode codepoint for "RIGHT SINGLE QUOTATION MARK". This is perfectly correct, there's nothing wrong here, if you print() this string you'll get what you expect:
>>> s = 'Bachelor\u2019s Degree'
>>> print(s)
Bachelor’s Degree
Learning about unicode and encodings might save you quite some time FWIW.
EDIT:
When I save in db and then on displaying on HTML it will cause issue
right?
Have you tried ?
Your database connector is supposed to encode it to the proper encoding (according to your fields, tables and client encoding settings).
wrt/ "displaying it on HTML", it mostly depends on whether you're using Python 2.7.x or Python 3.x AND on how you build your HTML, but if you're using some decent framework with a decent template engine (if not you should reconsider your stack) chances are it will work out of the box.
As I already mentionned, learning about unicode and encodings will save you a lot of time.
It's just using a UTF-8 encoding, it is not "wrong".
string = 'Bachelor\u2019s Degree'
print(string)
Bachelor’s Degree
You can decode and encode it again, but I can't see any reason why you would want to do that (this might not work in Python 2):
string = 'Bachelor\u2019s Degree'.encode().decode('utf-8')
print(string)
Bachelor’s Degree
From requests docs:
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text
On the response object, you may use .content instead of .text to get the response in UTF-8

lxml.html parsing and utf-8 with requests

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.
r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?
i then tried
html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
it seems it's working now. but i don't know why is this happening, always use latin1 ?
what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?
You've not specified if you're using python 2 or 3. Encoding is handled quite differently depending on which version you're using. The following advice is more or less universal anyway.
The difference between r.text and r.content is in the Requests docs. Simply put Requests will attempt to figure out the character encoding for you and return Unicode after decoding it. This which is accessible via r.text. To get just the bytes use r.content.
You really need to get to grips with encodings. Read http://www.joelonsoftware.com/articles/Unicode.html and watch https://www.youtube.com/watch?v=sgHbC6udIqc to get started. Also, do a search for "Overcoming frustration: Correctly using unicode in python2" for additional help.
Just to clarify, it's not as simple as always use one encoding over another. Make a Unicode sandwich by doing any I/O in bytes and work with Unicode in your application. If you start with bytes (isinstance(mytext, str)) you need to know the encoding to decode to Unicode, if you start with Unicode (isinstance(mytext, unicode)) you should encode to UTF-8 as it will handle all the worlds characters.
Make sure your editor, files, server and database are configured to UTF-8 also otherwise you'll get more 'gibberish'.
If you want further help post the source files and output of your script.

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?
According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.
Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word
Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8
What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1
In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

Turkish character problem in post data

I have two applications running on diffrent servers with diffrent DB's. I need to post some data from one to another, so ,i use post method. I concatenate related info into a string, then POST it...
My data is something like:
26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0
For turkish characters, i try to use
var1 = '26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
var1.encode('iso8859_9')
but when i receive this data on the second application and decode it, i realize that Turkish characters can not be decoded correctly, so my result is :
26AU223/AHMET DEM�O�U/18439586958/0//2011-07-31/2008-06-11/42.00/0
So İ and Ğ causes problem, and also following first letters R and L are mis-decoded too.
I tried diffrent encoding parameters for turish, also tries to POST daha without encode/decode (both applications use UTF-8) but i get a similar encoding error, with a strange � instead of İR and ĞL .
With Python 2.x, this is obviously wrong:
var1 = '26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
var1.encode('iso8859_9')
Python 2 has a bad design flaw in that it allows you to .encode() byte strings (str type). You must have a Unicode string, and then encode that before POSTing it. And using encodings other than UTF-8 is not reasonable.
var1 = u'26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
buf = var1.encode('utf-8')
# ...send buf over the network...
assert buf.decode('utf-8') == var1
And if you're constructing the POST data yourself, don't forget to do URL escaping.
I solve the problem with the easiest possible way (:
before quote my text, i cast it to string :
quote(str(var1))
And on the other side, unquote it in a similar way:
unquote(str(var1))
That solve the problem
Are you getting a Unicode string object on the remote side? In that case, your problem is that the code responsible for reading the HTTP message body assumes a wrong character set. Set the HTTP request Content-Type header to 'text/plain;charset=ISO-8859-9'.

Categories