This question already has answers here:
Url decode UTF-8 in Python
(5 answers)
Closed 6 months ago.
So I have the following string
"%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
It actually means this
ボドカさん
This string seems to be encoded in UTF-8 because when I write this in python
encoded_str = b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
print(encoded_str)
print(encoded_str.decode('utf-8'))
Here is the output I get
b'\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93'
ボドカさん
But now I would like a script that will allow me to decode any string in the initial format and here is my code.
import re
import os
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
mystr = mystr.lower()
mystr = re.sub('%', r'\\x', mystr)
encoded_str = bytes(mystr, "utf-8")
print(mystr)
print(encoded_str)
print(encoded_str.decode('utf-8'))
Output:
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
b'\\xe3\\x83\\x9c\\xe3\\x83\\x89\\xe3\\x82\\xab\\xe3\\x81\\x95\\xe3\\x82\\x93'
\xe3\x83\x9c\xe3\x83\x89\xe3\x82\xab\xe3\x81\x95\xe3\x82\x93
I tried so many possibilities but I couldn't find the right way to encode proprely my string like the b'STRING' thing would do. I always get extra \ characters from the encoding process that then spoil the decoding process too.
I tried all the encoding methods existing in python for the bytes() function.
I need help please. Thank you.
Stack overflow banned me for that question lol
mystr = "%E3%83%9C%E3%83%89%E3%82%AB%E3%81%95%E3%82%93"
encoded_str = bytes.fromhex(mystr.replace('%', ''))
print(encoded_str.decode('utf-8'))
Output:
ボドカさん
Related
This question already has answers here:
Print without b' prefix for bytes in Python 3
(8 answers)
Closed 4 years ago.
I am new in python programming and i am a bit confused. I try to get the bytes from a string to hash and encrypt but i got
b'...'
b character in front of string just like the below example. Is any way avoid this?.Can anyone give a solution? Sorry for this silly question
import hashlib
text = "my secret data"
pw_bytes = text.encode('utf-8')
print('print',pw_bytes)
m = hashlib.md5()
m.update(pw_bytes)
OUTPUT:
print b'my secret data'
This should do the trick:
pw_bytes.decode("utf-8")
Here u Go
f = open('test.txt','rb+')
ch=f.read(1)
ch=str(ch,'utf-8')
print(ch)
Decoding is redundant
You only had this "error" in the first place, because of a misunderstanding of what's happening.
You get the b because you encoded to utf-8 and now it's a bytes object.
>> type("text".encode("utf-8"))
>> <class 'bytes'>
Fixes:
You can just print the string first
Redundantly decode it after encoding
This question already has answers here:
Print without b' prefix for bytes in Python 3
(8 answers)
Closed 4 years ago.
I am new in python programming and i am a bit confused. I try to get the bytes from a string to hash and encrypt but i got
b'...'
b character in front of string just like the below example. Is any way avoid this?.Can anyone give a solution? Sorry for this silly question
import hashlib
text = "my secret data"
pw_bytes = text.encode('utf-8')
print('print',pw_bytes)
m = hashlib.md5()
m.update(pw_bytes)
OUTPUT:
print b'my secret data'
This should do the trick:
pw_bytes.decode("utf-8")
Here u Go
f = open('test.txt','rb+')
ch=f.read(1)
ch=str(ch,'utf-8')
print(ch)
Decoding is redundant
You only had this "error" in the first place, because of a misunderstanding of what's happening.
You get the b because you encoded to utf-8 and now it's a bytes object.
>> type("text".encode("utf-8"))
>> <class 'bytes'>
Fixes:
You can just print the string first
Redundantly decode it after encoding
This question already has answers here:
Chinese and Japanese character support in python
(3 answers)
Closed 7 years ago.
I have used Python to get some info through urllib2, but the info is unicode string.
I've tried something like below:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print unicode(a).encode("gb2312")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.encode("utf-8").decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print u""+a
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).encode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.decode("utf-8").encode("gb2312")
but all results are the same:
\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
And I want to get the following Chinese text:
方法,删除存储在
You need to convert the string to a unicode string.
First of all, the backslashes in a are auto-escaped:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a # Prints \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
a # Prints '\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
So playing with the encoding / decoding of this escaped string makes no difference.
You can either use unicode literal or convert the string into a unicode string.
To use unicode literal, just add a u in the front of the string:
a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
To convert existing string into a unicode string, you can call unicode, with unicode_escape as the encoding parameter:
print unicode(a, encoding='unicode_escape') # Prints 方法,删除存储在
I bet you are getting the string from a JSON response, so the second method is likely to be what you need.
BTW, the unicode_escape encoding is a Python specific encoding which is used to
Produce a string that is suitable as Unicode literal in Python source
code
Where are you getting this data from? Perhaps you could share the method by which you are downloading and extracting it.
Anyway, it kind of looks like a remnant of some JSON encoded string? Based on that assumption, here is a very hacky (and not entirely serious) way to do it:
>>> a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
>>> a
'\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
>>> s = '"{}"'.format(a)
>>> s
'"\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728"'
>>> import json
>>> json.loads(s)
u'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'
>>> print json.loads(s)
方法,删除存储在
This involves recreating a valid JSON encoded string by wrapping the given string in a in double quotes, then decoding the JSON string into a Python unicode string.
This question already has answers here:
Decode escaped characters in URL
(5 answers)
Url decode UTF-8 in Python
(5 answers)
Closed 9 years ago.
I have a link like below
http%253A%252F.....25252520.doc
How do i convert this to normal link in python?..the link has lots of encoded stuff..
Apply urllib.unquote twice:
>>> import urllib
>>> strs = urllib.unquote("http%253A%252F.....25252520.doc")
>>> urllib.unquote(strs)
'http:/.....25252520.doc'
Use urllib.unquote():
Replace %xx escapes by their single-character equivalent.
It looks as if you have a double or ever triple encoded URL; the http:// part has been encoded to http%253A%252F which decodes to http%3A%2F which in turn becomes http:/. The URL itself may contain another stage of encoding but you didn't share enough of the actual URL with us to determine that.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Character reading from file in Python
I want to strip a input string from a file from all special characters, except for actual letters (even Cyrillic letters shouldn't be stripped). The solution I found manually declares the string as unicode and the pattern with the re.UNICODE flag so actual letters from different languages are detected.
# -*- coding: utf-8 -*-
import re
pattern = re.compile("[^\w\d]",re.UNICODE)
n_uni = 'ähm whatßs äüöp ×äØü'
uni = u'ähm whatßs äüöp ×äØü'
words = pattern.split(n_uni) #doesn't work
u_words = pattern.split(uni) #works
So if I write the string directly in the source and manually define it as Unicode it gives me the desired output while the non-Unicode string gives me just garbage:
"ähm whatßs äüöp äØü" -> unicode
"hm what s ü p ü" -> non-unicode even with some invalid characters
My question is now how do I define the input from a file as Unicode?
My question is now how do I define the input from a file as unicode?
Straight from the docs.
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)