python: about url encode and decode - python

I have a problem.
I'm trying to use urllib library in python.
but, I don't understand of it.
a = 'http%3A%2F%2Ffile%2Efir%2Enet%2F40d55cecf9a3a47851b1d0ebda3e423993c837d3ca%2F20110909%5F52%5Fblogfile%2Folsscj25%5F1315512137967%5F5tAuGI%5Fzip%2F%255B%25C0%25A9%25B5%25B5%25BF%25ECxp%255D%2B%25C0%25A9%25B5%25B5%25BF%25ECxp%2B%25BD%25C3%25B8%25AE%25BE%25F3%25B3%25D1%25B9%25F6%5F%2Ezip'
aa = unquote(unquote(a))
'http://file.fir.net/40d55cecf9a3a47851b1d0ebda3e423993c837d3ca/20110909_52_blogfile/olsscj25_1315512137967_5tAuGI_zip/[\xc0\xa9\xb5\xb5\xbf\xecxp]+\xc0\xa9\xb5\xb5\xbf\xecxp+\xbd\xc3\xb8\xae\xbe\xf3\xb3\xd1\xb9\xf6_.zip'
a1 = quote(quote(aa))
'http%253A//file.fir.net/40d55cecf9a3a47851b1d0ebda3e423993c837d3ca/20110909_52_blogfile/olsscj25_1315512137967_5tAuGI_zip/%255B%25C0%25A9%25B5%25B5%25BF%25ECxp%255D%252B%25C0%25A9%25B5%25B5%25BF%25ECxp%252B%25BD%25C3%25B8%25AE%25BE%25F3%25B3%25D1%25B9%25F6_.zip'
Why does not equal two values(a and a1).
Please let me know
Thanks.

I think you are convoluting multiple problems into 1.
First of all, the only reason you are asking this question is because you want to unquote the tail portion of the file name, which seems to be quoted twice.
Second of all, the file name, even if doubly unquoted, results in non-utf-8 encoded data and it's not printable.
Thirdly, you don't seem to understand the URL format.
An finally, you don't understand what quote and unquote are actually doing.
urllib.quote() and urllib.unquote() are intended only for the path_info portion of the URL, which is everything after http://file.fir.net/.
urllib.quote() replaces everything in the string parameter that is not "safe in a URL with percent encoding. Meaning every character that will cause problems (e.g: :~[SPACE] etc.) with a %BYTES_IN_HEX format.
Since [:] is not safe in the URL's path portion, quote() will encode it with it's percent-encoding.
All these means that you should not pass the entire URL straight into the quote() unless you happen to want to actually encode a URL into the path_info portion of a URL.
The steps to solve your problem is something like this:
Fix the file name encoding to use something printable to help you debug.
urllib.unquote() once to get back a normal URL.
When you get the unquoted URL, pass it to urlparse.urlparse() first to break the components into their appropriate portions.
urllib.unquote() the file name portion.
Now you can retrieve the original file name, you can proceed to do whatever you need to do.
References:
http://docs.python.org/library/urlparse.html
http://docs.python.org/library/urllib.html

The answer is in the documentation on quote method:
... Letters, digits, and the characters '_.-' are never quoted. ...
a and a1 differ because a probably wasn't quoted using quote() and therefore more characters where quoted than it is required. The a1 is still valid quoted string, but some characters wheren't quoted because they don't have to.

Related

Python 2.7 convert special characters into utf-8 byes

I have strings that I need to replace into an URL for accessing different JSON files. My problem is that some strings have special characters and I need only these as UTF-8 bytes, so I can properly find the JSON tables.
An example:
# I have this string
a = 'code - Brasilândia'
#in the JSON url it appears as
'code%20-%20Brasil%C3%A2ndia'
I managed to get the spaces converted right using urllib.quote(), but it does not convert the special characters as I need them.
print(urllib.quote('code - Brasilândia))
'code%20-%20Brasil%83ndia'
When I substitute this in the URL, I cannot reach the JSON table.
I managed to make this work using u before the string, u'code - Brasilândia', but this did not solve my issue, because the string will ultimately be a user input, and will need to be constantly changed.
I have tried several methods, but I could not get the result I need.
I'm specifically using python 2.7 for this project, and I cannot change it.
Any ideas?
You could try decoding the string as UTF-8, and if it fails, assume that it's Latin-1, or whichever 8-bit encoding you expect.
try:
yourstring.decode('utf-8')
except UnicodeDecodeError:
yourstring = yourstring.decode('latin-1').encode('utf-8')
print(urllib.quote(yourstring))
... provided you can establish the correct encoding; 0x83 seems to correspond to â only in some fairly obscure legacy encodings like code pages 437 and 850 (and those are the least obscure). See also https://tripleee.github.io/8bit/#83
(disclosure: the linked site is mine).
Demo: https://ideone.com/fjX15c

request.keys() not having passed params when containg # in it

I am sending my GET request to python server my query string is having
"http://192.168.4.106:3333/xx/xx/xx/xx?excelReport**&detail=&#tt**=475&dee=475&empi=&qwer=&start_date=03/01/2014&end_date=03/13/2014&SearchVar=0&report_format=D"
my query string is containing one character # so when i am doing request.keys() in my server its not showing me any params passed.Its working with other special character??
I am stuck in this problem from quite a long time??
I am using zope framework??
Please suggest??
The # character cannot be used like that in a query string.
You should encode it with %23 and decode it when you parse the string.
The reason behind that can be found at W3 site
# marks the end of the 'query' part of an URL and the start of the 'fragment'. If you need to have a '#' inside your query (that is, the GET params that you get with request.keys()), you need to encode it (with the standard urllib.urlencode or with whatever your framework provides).
I'm not sure what's the purpose of # in that URL, though. Is it supposed to be a key #tt** in your request.keys()? Is it in fact the start of the fragment?
Nowadays fragments are often used to have some routing in the client side of a webapp, since if you go from #a to #b inside a webpage, you don't need to reload the page. So if that may be the case then you can't encode the #, since it would lose its meaning. You would need then to extract the parameters you want from the fragment part manually.
You can use urllib.quote to solve your problem generally.
>>> import urllib
>>> urllib.quote('#')
'%23'

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?
According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.
Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word
Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8
What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1
In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

Turkish character problem in post data

I have two applications running on diffrent servers with diffrent DB's. I need to post some data from one to another, so ,i use post method. I concatenate related info into a string, then POST it...
My data is something like:
26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0
For turkish characters, i try to use
var1 = '26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
var1.encode('iso8859_9')
but when i receive this data on the second application and decode it, i realize that Turkish characters can not be decoded correctly, so my result is :
26AU223/AHMET DEM�O�U/18439586958/0//2011-07-31/2008-06-11/42.00/0
So İ and Ğ causes problem, and also following first letters R and L are mis-decoded too.
I tried diffrent encoding parameters for turish, also tries to POST daha without encode/decode (both applications use UTF-8) but i get a similar encoding error, with a strange � instead of İR and ĞL .
With Python 2.x, this is obviously wrong:
var1 = '26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
var1.encode('iso8859_9')
Python 2 has a bad design flaw in that it allows you to .encode() byte strings (str type). You must have a Unicode string, and then encode that before POSTing it. And using encodings other than UTF-8 is not reasonable.
var1 = u'26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
buf = var1.encode('utf-8')
# ...send buf over the network...
assert buf.decode('utf-8') == var1
And if you're constructing the POST data yourself, don't forget to do URL escaping.
I solve the problem with the easiest possible way (:
before quote my text, i cast it to string :
quote(str(var1))
And on the other side, unquote it in a similar way:
unquote(str(var1))
That solve the problem
Are you getting a Unicode string object on the remote side? In that case, your problem is that the code responsible for reading the HTTP message body assumes a wrong character set. Set the HTTP request Content-Type header to 'text/plain;charset=ISO-8859-9'.

Sending a List through an URL

I have a list that I need to send through a URL to a third party vendor. I don't know what language they are using.
The list prints out like this:
[u'1', u'6', u'5']
I know that the u encodes the string in utf-8 right? So a couple of questions.
Can I send a list through a URL?
Will the u's show up on the other end when going through the URL?
If so, how do I remove them?
I am not sure what keywords to search to help me out, so any resources would be helpful too.
Can I send a list through a URL?
No. A URL is just text. If you want a way to package structured information in it, you'll have to agree that with the provider you're talking to.
One standard encoding for structure in URLs, that might or might not be what you need, is the use of multiple parameters with the same name in a query string. This format comes from HTML form submissions:
http://www.example.com/script?par=1&par=6&par=5
might be considered to represent a parameter par with a three-item list as its value. Or maybe not, it's up to the receiver to decide. For example in a PHP application you would have had to name the parameter par[] to get it to accept the array value.
I know that the u encodes the string in utf-8 right?
No. a u'...' string is a native Unicode string, where each index represents a whole character and not a byte in any particular encoding. If you want UTF-8 bytes, write u'...'.encode('utf-8') before URL-encoding. UTF-8 is a good default choice, but again: what encoding the receiving side wants is up to that application.
Will the u's show up on the other end when going through the URL?
u is part of the literal representation of the string, just the same as the ' quotes themselves. They are not part of the string value and would not be echoed by print or when joined into other strings, unless you deliberately asked for the literal representation by calling repr.
u'' is not utf-8, its python unicode strings for python 2.x
To send it through url, you need to encode them with utf8 like .encode('utf-8'), and also need to urlencode, and list cannot send it through URL, you need to make it as string.
Basically, you need to do it in following steps
python list -> unicode string -> utf8 string -> url encode -> send it through proper urllib api
Incorrect. unicode literals use Python's internal encoding, decided when it was compiled.
You can't send anything "through" URLs. Pick a protocol instead. And encode before sending, probably to UTF-8.

Categories