Python: Replace URLEncoded characters in String with what they represent [duplicate] - python

This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 7 years ago.
I've been banging my head against the wall with this for a while. I'm trying to parse an RSS feed with Python's BeautifulSoup, and every now and then I get errors like:
I don't know what I am talking about
I can't seem to find any python library that will replace those characters with what they should be, so the resulting string looks like this:
I don't know what I am talking about
The closest I've gotten was
urllib.unquote(post_content).decode('utf-8')
But that still does not replace the url encoded character with a '. Does anyone know a good way to replace those urlencoded characters into the ascii characters they represent? There's also other errors that I get like ( and ) appearing as ( and )

Those weird strings are called html entities. You can decode them as described by this URL: Decode HTML entities in Python string?. It says to use the function unescape from the module html.parse

Related

Emoji to unicode [duplicate]

This question already has answers here:
Escaped Unicode to Emoji in Python
(1 answer)
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I was looking at https://r12a.github.io/app-conversion/ and I see that they have a "JS/Java/C" section. I was wondering if anyone had the code for that in python. I can't seem to find it. Thanks!
Edit: code
b = '😀'
txt = b.encode('utf-8')
From How to work with surrogate pairs in Python? (linked from duplicate Escaped Unicode to Emoji in Python )
If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:
>>> "\uD83D\uDE00".encode('utf-16', 'surrogatepass').decode('utf-16')
'😀'
Original Answer
Perhaps you're looking for ord()?
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('€') (Euro sign) returns 8364. This is the inverse of chr().
>>> hex(ord("😀"))
'0x1f600'

extract some arabic/persian (unicode) words with regex using python [duplicate]

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 2 years ago.
I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.
example (the word "شرکت" means "company" and we want to extract what the company name is):
input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس
I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:
re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None
Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:
re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")
Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.
Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:
from __future__ import unicode_literals
will make unicode literals be parsed correctly with no u"".

Python3 saving JSON with unicode single quote [duplicate]

This question already has answers here:
Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence
(12 answers)
Closed 1 year ago.
I know this has been asked before on Stackoverflow and on other sites but I cannot seem to be able to save a JSON file using escaped Unicode characters (Python3). I have read a lot of tutorials.
What am I missing? I have tried a lot of things but nothing works. I have also tried encoding/decoding in UTF-8 but I am obviously missing something.
Just to be clear, I have managed to get it working for other characters like й (0439) but I am having trouble with a single quote being encoded..
If I have the following dict:
import json
data = {"key": "Test \u0027TEXT\u0027 around"}
I want to save it exactly as it is in a new JSON file, but no matter what I do it always ends up as a single character, which is what is encoded in Unicode.
The following 2 blocks print the exact same thing: {"key": "Test 'TEXT' around"}.
print(json.dumps(data))
print(json.dumps(data, ensure_ascii=False))
Is there any way to keep the Unicode string literal? I want to have that very string as a value: "Test \u0027TEXT\u0027 around"
The behavior you are describing has nothing to do with JSON. This is simply how Python 3 handles strings. Open the shell and write:
>>> "Test \u0027TEXT\u0027 around"
"Test 'TEXT' around"
If you do not want Python to interpret the special characters, you should use raw strings (or maybe even byte sequences):
>>> r"Test \u0027TEXT\u0027 around"
'Test \\u0027TEXT\\u0027 around'
Reference:
https://docs.python.org/2.0/ref/strings.html
https://docs.python.org/3/library/stdtypes.html#binaryseq

Why does base64.b64encode return a value of b'somestring' instead of simply 'somestring'? [duplicate]

This question already has answers here:
What is the difference between a string and a byte string?
(9 answers)
Closed 4 years ago.
Consider the following code...
base64EncodedCredentials = base64.b64encode(b"johndoe:mysecret")
print(base64EncodedCredentials)
the response I get back is
b'am9obmRvZTpteXNlY3JldA=='
Why does it have a b before the string? How can I get just the string value of 'am9obmRvZTpteXNlY3JldA==' instead?
Technically speaking - this question can be considered a duplicate of another question IF you know that the problem is about byte string vs. string. In my case, I asked the question because I did not know that there was something called byte string. For new Python programmers, this question may be beneficial because it uses language they see on their program or debugger. If they don't know what a byte string is, perhaps this question can be useful and provide the translation from their problem to the technical terms used by more fluent Python programmers. The question differs in use of vocabulary.
Refer to this post to get rid of the encoding.
print(base64EncodedCredentials.decode("utf-8"))
It is returning a bytes object. Usually base64 is used to make something 7-bit safe, and thus is often used with byte-oriented (rather than character oriented) data, for example, to shove out a socket.
You can decode it to a string, just like any other bytes object:
output.decode('ascii')
Byte and String objects are changed between each other using encode() and decode(). It is safe to use the ascii codec since base64 is guaranteed to only return 7-bit ascii.

how to normalize or decode an URL in python? [duplicate]

This question already has answers here:
Decode escaped characters in URL
(5 answers)
Url decode UTF-8 in Python
(5 answers)
Closed 9 years ago.
I have a link like below
http%253A%252F.....25252520.doc
How do i convert this to normal link in python?..the link has lots of encoded stuff..
Apply urllib.unquote twice:
>>> import urllib
>>> strs = urllib.unquote("http%253A%252F.....25252520.doc")
>>> urllib.unquote(strs)
'http:/.....25252520.doc'
Use urllib.unquote():
Replace %xx escapes by their single-character equivalent.
It looks as if you have a double or ever triple encoded URL; the http:// part has been encoded to http%253A%252F which decodes to http%3A%2F which in turn becomes http:/. The URL itself may contain another stage of encoding but you didn't share enough of the actual URL with us to determine that.

Categories