how to normalize or decode an URL in python? [duplicate] - python

This question already has answers here:
Decode escaped characters in URL
(5 answers)
Url decode UTF-8 in Python
(5 answers)
Closed 9 years ago.
I have a link like below
http%253A%252F.....25252520.doc
How do i convert this to normal link in python?..the link has lots of encoded stuff..

Apply urllib.unquote twice:
>>> import urllib
>>> strs = urllib.unquote("http%253A%252F.....25252520.doc")
>>> urllib.unquote(strs)
'http:/.....25252520.doc'

Use urllib.unquote():
Replace %xx escapes by their single-character equivalent.
It looks as if you have a double or ever triple encoded URL; the http:// part has been encoded to http%253A%252F which decodes to http%3A%2F which in turn becomes http:/. The URL itself may contain another stage of encoding but you didn't share enough of the actual URL with us to determine that.

Related

How to decode this "%E3%83%9C" string in python? [duplicate]

This question already has answers here:
Url decode UTF-8 in Python
(5 answers)
Closed 6 months ago.
So I have the following string
"%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
It actually means this
ボドカさん
This string seems to be encoded in UTF-8 because when I write this in python
encoded_str = b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
print(encoded_str)
print(encoded_str.decode('utf-8'))
Here is the output I get
b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
ボドカさん
But now I would like a script that will allow me to decode any string in the initial format and here is my code.
import re
import os
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
mystr = mystr.lower()
mystr = re.sub('%', r'\\x', mystr)
encoded_str = bytes(mystr, "utf-8")
print(mystr)
print(encoded_str)
print(encoded_str.decode('utf-8'))
Output:
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
b'\\xe3\\x83\\x9c\\xe3\\x83\\x89\\xe3\\x82\\xab\\xe3\\x81\\x95\\xe3\\x82\\x93'
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
I tried so many possibilities but I couldn't find the right way to encode proprely my string like the b'STRING' thing would do. I always get extra \ characters from the encoding process that then spoil the decoding process too.
I tried all the encoding methods existing in python for the bytes() function.
I need help please. Thank you.
Stack overflow banned me for that question lol
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
encoded_str = bytes.fromhex(mystr.replace('%', ''))
print(encoded_str.decode('utf-8'))
Output:
ボドカさん

Emoji to unicode [duplicate]

This question already has answers here:
Escaped Unicode to Emoji in Python
(1 answer)
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I was looking at https://r12a.github.io/app-conversion/ and I see that they have a "JS/Java/C" section. I was wondering if anyone had the code for that in python. I can't seem to find it. Thanks!
Edit: code
b = '😀'
txt = b.encode('utf-8')
From How to work with surrogate pairs in Python? (linked from duplicate Escaped Unicode to Emoji in Python )
If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:
>>> "\uD83D\uDE00".encode('utf-16', 'surrogatepass').decode('utf-16')
'😀'
Original Answer
Perhaps you're looking for ord()?
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('€') (Euro sign) returns 8364. This is the inverse of chr().
>>> hex(ord("😀"))
'0x1f600'

How to convert unicode string to bytes Python [duplicate]

This question already has answers here:
getting bytes from unicode string in python
(6 answers)
Closed 2 years ago.
I have a string which I get from a function
>>> example = Some_function()
This Some_function return a very long combination of Unicode and ASCII string like 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'
My Problem is that when I try to convert this unicode string to bytes it gives me an error that \ud919 cannot be encoded by utf-8. I tried :
>>> further=bytes(example,encoding='utf-8')
Note: I cannot ignore this \ud919. If there is a way to solve this problem or how can I convert 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123' to 'gn1\ud123a\ud123\ud123\ud123\\ud919\ud123\ud123' to treat \ud919 as simple string not unicode.
based on the version.
print type(unicode_string), repr(unicode_string) Python 3.x : print type(unicode_string), ascii(unicode_string)
\ud919 is a surrogate character, one does not simply convert it. Use surrogatepass flag:
'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'.encode('utf-8', 'surrogatepass')
>>> b'gn1\xed\x84\xa3a\xed\x84\xa3\xed\x84\xa3\xed\x84\xa3\xed\xa4\x99\xed\x84\xa3\xed\x84\xa3'

extract some arabic/persian (unicode) words with regex using python [duplicate]

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 2 years ago.
I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.
example (the word "شرکت" means "company" and we want to extract what the company name is):
input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس
I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:
re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None
Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:
re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")
Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.
Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:
from __future__ import unicode_literals
will make unicode literals be parsed correctly with no u"".

Python: Replace URLEncoded characters in String with what they represent [duplicate]

This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 7 years ago.
I've been banging my head against the wall with this for a while. I'm trying to parse an RSS feed with Python's BeautifulSoup, and every now and then I get errors like:
I don't know what I am talking about
I can't seem to find any python library that will replace those characters with what they should be, so the resulting string looks like this:
I don't know what I am talking about
The closest I've gotten was
urllib.unquote(post_content).decode('utf-8')
But that still does not replace the url encoded character with a '. Does anyone know a good way to replace those urlencoded characters into the ascii characters they represent? There's also other errors that I get like ( and ) appearing as ( and )
Those weird strings are called html entities. You can decode them as described by this URL: Decode HTML entities in Python string?. It says to use the function unescape from the module html.parse

Categories