Convert str to unicode str

Convert str to unicode str - python

I need to convert a str to text in Python 2.7
a = u'"\u0274\u1d1c\u0274\u1d04\u1d00 \u1d00\u028f\u1d1c\u1d05\u1d07s \u1d00 \u1d1c\u0274 \u0274\u026a\xf1\u1d0f \u1d0f \u1d1c\u0274\u1d00 \u0274\u026a\xf1\u1d00 \u1d04\u1d0f\u0274 \u1d1c\u0274\u1d00 \u1d1b\u1d00\u0280\u1d07\u1d00 \u1d07\u0274 \u029f\u1d00 \u01eb\u1d1c\u1d07 s\u026a\u1d07\u0274\u1d1b\u1d07 \u01eb\u1d1c\u1d07 \u1d18\u1d1c\u1d07\u1d05\u1d07 \u1d1b\u1d07\u0274\u1d07\u0280 \u1d07x\u026a\u1d1b\u1d0f"'
I try with a.decode('utf8') but the truth is I don't know what kind of code is the str a
The output I need is:
"ɴᴜɴᴄᴀ ᴀʏᴜᴅᴇs ᴀ ᴜɴ ɴɪñᴏ ᴏ ᴜɴᴀ ɴɪñᴀ ᴄᴏɴ ᴜɴᴀ ᴛᴀʀᴇᴀ ᴇɴ ʟᴀ ǫᴜᴇ sɪᴇɴᴛᴇ ǫᴜᴇ ᴘᴜᴇᴅᴇ ᴛᴇɴᴇʀ ᴇxɪᴛᴏ"
ERROR:
>>> print(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "F:\WinPython-64bit-2.7.13.1Zero\python-2.7.13.amd64\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-5: character maps to <undefined>

Since you are on Python2, you have to encode the string contents - which are already text, to your terminal encoding.
So, if you are on windows, print(a.encode("cp-850")), if you are on Linux, Mac-OS, or other O.S.: print(a.encode("utf-8"))
On Python3 the encoding should be done automatically.
Also, it is important to understand that characters codified like \uNNNN in Python correspond to Unicode codepoints - and not to specific character encodings like "utf-8", "latin1" or "utf-16". In Python 3 most readable characters encoding like this will be shown even with the string internal representation, which is displayed by default in a Python interactive session (otherwise use the built-in repr call to see it). By using the built-in "str" or a call to print, you see the rendered string, and all \uXXXX, \UXXXXXXXX, \xNN and \N{unicode character name} tokens are rendered as the actual characters. (In Python2 you need to manually encode this representation to the character encoding used in your device)
In other words, if you are using Python 3, this is as simple as:
In [15]: a = u'"\u0274\u1d1c\u0274\u1d04\u1d00 \u1d00\u028f\u1d1c\u1d05\u1d07s \u1d00 \u1d1c\u0274 \u0274\u026a\xf1\u1d0f \u1d0f \u1d1c\u0274\u1d00 \u0274\u026a\xf1\u1d00 \u1d04\u1d0f\u0274 \u1d1c\u0274\u1d00 \u1d1b\u1d00\u0280\u1d07\u1d00 \u1d07\u0274 \u029f\u1d00 \u01eb\u1d1c\u1d07 s\u026a\u1d07\u0274\u1d1b\u1d07 \u01eb\u1d1c\u1d07 \u1d18\u1d1c\u1d07\u1d05\u1d07 \u1d1b\u1d07\u0274\u1d07\u0280 \u1d07x\u026a\u1d1b\u1d0f"'
...:
In [16]: a
Out[16]: '"ɴᴜɴᴄᴀ ᴀʏᴜᴅᴇs ᴀ ᴜɴ ɴɪñᴏ ᴏ ᴜɴᴀ ɴɪñᴀ ᴄᴏɴ ᴜɴᴀ ᴛᴀʀᴇᴀ ᴇɴ ʟᴀ ǫᴜᴇ sɪᴇɴᴛᴇ ǫᴜᴇ ᴘᴜᴇᴅᴇ ᴛᴇɴᴇʀ ᴇxɪᴛᴏ"'
Or:
In [17]: print(a)
"ɴᴜɴᴄᴀ ᴀʏᴜᴅᴇs ᴀ ᴜɴ ɴɪñᴏ ᴏ ᴜɴᴀ ɴɪñᴀ ᴄᴏɴ ᴜɴᴀ ᴛᴀʀᴇᴀ ᴇɴ ʟᴀ ǫᴜᴇ sɪᴇɴᴛᴇ ǫᴜᴇ ᴘᴜᴇᴅᴇ ᴛᴇɴᴇʀ ᴇxɪᴛᴏ"

Related

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 42: ordinal not in range(128)

def main():
client = ##client_here
db = client.brazil
rio_bus = client.tweets
result_cursor = db.tweets.find()
first = result_cursor[0]
ordered_fieldnames = first.keys()
with open('brazil_tweets.csv','wb') as csvfile:
csvwriter = csv.DictWriter(csvfile,fieldnames = ordered_fieldnames,extrasaction='ignore')
csvwriter.writeheader()
for x in result_cursor:
print x
csvwriter.writerow( {k: str(x[k]).encode('utf-8') for k in x})
#[ csvwriter.writerow(x.encode('utf-8')) for x in result_cursor ]
if __name__ == '__main__':
main()
Basically the issue is that the tweets contain a bunch of characters in Portuguese. I tried to correct for this by encoding everything into unicode values before putting them in the dictionary that was to be added to the row. However this doesn't work. Any other ideas for formatting these values so that csv reader and dictreader can read them?

str(x[k]).encode('utf-8') is the problem.
str(x[k]) will convert a Unicode string to an byte string using the default ascii codec in Python 2:
>>> x = u'résumé'
>>> str(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Non-Unicode values, like booleans, will be converted to byte strings, but then Python will implicitly decode the byte string to a Unicode string before calling .encode(), because you can only encode Unicode strings. This usually won't cause an error because most non-Unicode objects have an ASCII representation. Here's an example where a custom object returns a non-ASCII str() representation:
>>> class Test(object):
... def __str__(self):
... return 'r\xc3\xa9sum\xc3\xa9'
...
>>> x=Test()
>>> str(x)
'r\xc3\xa9sum\xc3\xa9'
>>> str(x).encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Note the above was a decode error instead of an encode error.
If str() is only there to coerce booleans to a string, coerce it to a Unicode string instead:
unicode(x[k]).encode('utf-8')
Non-Unicode values will be converted to Unicode strings, which can then be correctly encoded, but Unicode strings will remain unchanged, so they will also be encoded correctly.
>>> x = True
>>> unicode(x)
u'True'
>>> unicode(x).encode('utf8')
'True'
>>> x = u'résumé'
>>> unicode(x).encode('utf8')
'r\xc3\xa9sum\xc3\xa9'
P.S. Python 3 does not do implicit encode/decode between byte and Unicode strings and makes these errors easier to spot.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

I'm trying to save concrete content of the dictionary to a file but when I try to write it, I get the following error:
Traceback (most recent call last):
File "P4.py", line 83, in <module>
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)
And here is the code:
from collections import Counter
with open("corpus.txt") as inf:
wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf)
with open("lexic.txt", "w") as outf:
outf.write('Palabra\tTag\tApariciones\n'.encode("utf-8"))
for word,count in wordtagcount.iteritems():
outf.write(u"{}\t{}\n".format(word, count).encode("utf-8"))
"""
2) TAGGING USING THE MODEL
Dados los ficheros de test, para cada palabra, asignarle el tag mas
probable segun el modelo. Guardar el resultado en ficheros que tengan
este formato para cada linea: Palabra Prediccion
"""
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}
"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
aux = linea.decode('latin_1').encode('utf-8')
sintagma = aux.split('\t') # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary
aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
aux_list_else = ([sintagma[1],sintagma[2]])
diccionario.update({sintagma[0]:aux_list_else})
"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])
For retrieve the information from diccionario, we have to keep in mind:
In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:
diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
with open("estimation.txt", "w") as outfile:
for keyword in diccionario:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword
maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
suma = float(diccionario.get(keyword)[1])
for i in range (2, len(diccionario.get(keyword))):
suma += float(diccionario.get(keyword)[i][1])
if (diccionario.get(keyword)[i][1] > maximo):
tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8'))
maximo = float(diccionario.get(keyword)[i][1])
probabilidad = float(maximo/suma);
diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})
else:
diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8"))
The desired output will look like this:
keyword(String) tagSugerido(String):
Hello NC
Friend N
Run V
...etc
The conflictive line is:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
Thank you.

Like zmo suggested:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
A note on unicode in Python 2
Your software should only work with unicode strings internally, converting to a particular encoding on output.
Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.
The difference between ASCII and UTF-8 encoding:
Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.
ascii (default)
1 If the code point is < 128, each byte is the same as the value of the code point.
2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)
utf-8 (unicode transformation format)
1 If the code point is <128, it’s represented by the corresponding byte value.
2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
The difference between str and unicode objects:
You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.
str vs. unicode
1 str = byte string (8-bit) - uses \x and two digits
2 unicode = unicode string - uses \u and four digits
3 basestring
/\
/ \
str unicode
If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:
Rules
1 encode(): Gets you from Unicode -> bytes
encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2 decode(): Gets you from bytes -> Unicode
decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4 u”: Makes your string literals into Unicode objects rather than byte sequences.
5 unicode(string[, encoding, errors])
Warning: Don’t use encode() on bytes or decode() on Unicode objects
And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.

As you're not giving a simple concise code to illustrate your question, I'll just give you a general advice on what should be the error:
If you're getting a decode error, it's that tagSugerido is read as ASCII and not as Unicode. To fix that, you should do:
tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8'))
to store it as an unicode.
Then you're likely to get an encode error at the write() stage, and you should fix your write the following way:
outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
should be:
outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))
I litterally answered a very similar question moments ago. And when working with unicode strings, switch to python3, it'll make your life easier!
If you cannot switch to python3 just yet, you can make your python2 behave like it is almost python3, using the python-future import statement:
from __future__ import absolute_import, division, print_function, unicode_literals
N.B.: instead of doing:
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
which will fail to close properly the file descriptor upon failure during readlines, you should better do:
with open("lexic.txt", "r") as f:
data=f.readlines()
which will take care of always closing the file even upon failure.
N.B.2: Avoid using file as this is a python type you're shadowing, but use f or lexic_file…

Decode an ENCODED unicode string in Python

I need to decode a "UNICODE" encoded string:
>>> id = u'abcdß'
>>> encoded_id = id.encode('utf-8')
>>> encoded_id
'abcd\xc3\x9f'
The problem I have is:
Using Pylons routing, I get the encoded_id variable as a unicode string u'abcd\xc3\x9f' instead of a just a regular string 'abcd\xc3\x9f':
Using python, how can I decode my encoded_id variable which is a unicode string?
>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/test/vng/lib64/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)

You have UTF-8 encoded data (there is no such thing as UNICODE encoded data).
Encode the unicode value to Latin-1, then decode from UTF8:
encoded_id.encode('latin1').decode('utf8')
Latin 1 maps the first 255 unicode points one-on-one to bytes.
Demo:
>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.encode('latin1').decode('utf8')
u'abcd\xdf'
>>> print encoded_id.encode('latin1').decode('utf8')
abcdß

how to show the right word in my code, my code is : os.urandom(64)

My code is:
print os.urandom(64)
which outputs:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
\xd0\xc8=<\xdbD'
\xdf\xf0\xb3>\xfc\xf2\x99\x93
=S\xb2\xcd'\xdbD\x8d\xd0\\xbc{&YkD[\xdd\x8b\xbd\x82\x9e\xad\xd5\x90\x90\xdcD9\xbf9.\xeb\x9b>\xef#n\x84
which isn't readable, so I tried this:
print os.urandom(64).decode("utf-8")
but then I get:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
Traceback (most recent call last):
File "D:\zjm_code\a.py", line 17, in <module>
print os.urandom(64).decode("utf-8")
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-3: invalid data
What should I do to get human-readable output?

No shortage of choices. Here's a couple:
>>> os.urandom(64).encode('hex')
'0bf760072ea10140d57261d2cd16bf7af1747e964c2e117700bd84b7acee331ee39fae5cff6f3f3fc3ee3f9501c9fa38ecda4385d40f10faeb75eb3a8f557909'
>>> os.urandom(64).encode('base64')
'ZuYDN1BiB0ln73+9P8eoQ3qn3Q74QzCXSViu8lqueKAOUYchMXYgmz6WDmgJm1DyTX598zE2lClX\n4iEXXYZfRA==\n'

os.urandom is giving you a 64-bytes string. Encoding it in hex is probably the best way to make it "human readable" to some extent. E.g.:
>>> s = os.urandom(64)
>>> s.encode('hex')
'4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e'
Of course this gives you 128 characters in the result, which may be too long a line to read comfortably; it's easy to split it up, though -- e.g.:
>>> print s[:32].encode('hex')
4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef
>>> print s[32:].encode('hex')
4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e
two chunks of 64 characters each shown on separate lines may be easier on the eye.

Random bytes are not likely to be unicode characters, so I'm not suprised that you get encoding errors. Instead you need to convert them somehow. If all you're trying to do is see what they are, then something like:
print [ord(o) for o in os.urandom(64)]
Or, if you'd prefer to have it as hex 0-9a-f:
print ''.join( [hex(ord(o))[2:] for o in os.urandom(64)] )

purpose of '"sss".decode("base64").decode("zlib")'

ACTIVATE_THIS = """
eJx1UsGOnDAMvecrIlYriDRlKvU20h5aaY+teuilGo1QALO4CwlKAjP8fe1QGGalRoLEefbzs+Mk
Sb7NcvRo3iTcoGqwgyy06As+HWSNVciKaBTFywYoJWc7yit2ndBVwEkHkIzKCV0YdQdmkvShs6YH
E3IhfjFaaSNLoHxQy2sLJrL0ow98JQmEG/rAYn7OobVGogngBgf0P0hjgwgt7HOUaI5DdBVJkggR
3HwSktaqWcCtgiHIH7qHV+esW2CnkRJ+9R5cQGsikkWEV/J7leVGs9TV4TvcO5QOOrTHYI+xeCjY
JR/m9GPDHv2oSZunUokS2A/WBelnvx6tF6LUJO2FjjlH5zU6Q+Kz/9m69LxvSZVSwiOlGnT1rt/A
77j+WDQZ8x9k2mFJetOle88+lc8sJJ/AeerI+fTlQigTfVqJUiXoKaaC3AqmI+KOnivjMLbvBVFU
1JDruuadNGcPmkgiBTnQXUGUDd6IK9JEQ9yPdM96xZP8bieeMRqTuqbxIbbey2DjVUNzRs1rosFS
TsLAdS/0fBGNdTGKhuqD7mUmsFlgGjN2eSj1tM3GnjfXwwCmzjhMbR4rLZXXk+Z/6Hp7Pn2+kJ49
jfgLHgI4Jg==
""".decode("base64").decode("zlib")
my code:
import zlib
print 'dsss'.decode('base64').decode('zlib')#error
Traceback (most recent call last):
File "D:\zjm_code\b.py", line 4, in <module>
print 'dsss'.decode('base64').decode('zlib')
File "D:\Python25\lib\encodings\zlib_codec.py", line 43, in zlib_decode
output = zlib.decompress(input)
zlib.error: Error -3 while decompressing data: unknown compression method
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 7, in <module>
a.decode('base64')
File "D:\Python25\lib\encodings\base64_codec.py", line 42, in base64_decode
output = base64.decodestring(input)
File "D:\Python25\lib\base64.py", line 321, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
a='dsss'
a=a.encode('zlib')
#print a
a=a.decode('zlib')
print a#its ok
i think the 'print a' encode the a with 'uhf-8'.
so:
#encoding:utf-8
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('utf-8')#but error.
a=a.decode('zlib')
print a#
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 5, in <module>
a=a.decode('utf-8')
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 1: unexpected code byte

The data in the strings is encoded and compressed binary data. The .decode("base64").decode("zlib") unencodes and decompresses it.
The error you got was because 'dsss' decoded from base64 is not valid zlib compressed data.

What is the purpose of x.decode(”base64”).decode(”zlib”) for x in ("sss", "dsss", random_garbage)? Excuse me, you should know; you are the one who is doing it!
Edit after OP's addition of various puzzles
Puzzle 1
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
Resolution: all 3 statements of the form
a.XXcode('encoding')
should be
a = a.XXcode('encoding')
Puzzle 2
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
But it does print 'dsss':
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x£K)..♠ ♦F☺¥
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
Puzzle 3
"""i think the 'print a' encode the a with 'uhf-8'."""
Resolution: You think extremely incorrectly. What follows the print is an expression. There are no such side effects. What do you imagine happens when you do this:
print 'start text ' + a + 'end text'
?
What do you imagine happens if you do print a twice? Encoding the already-encoded text again? Why don't you stop imagining and try it out?
In any case, note that the output of str.encode('zlib') is an str object, not a unicode object:
>>> print repr('dsss'.encode('zlib'))
'x\x9cK)..\x06\x00\x04F\x01\xbe'
Getting from that to UTF-8 is going to be somewhat difficult ... it would have to be decoded into unicode first -- with what codec? ascii and utf8 are going to have trouble with the '\x9c' and the '\xbe' ...

It is the reverse of:
original_message.encode('zlib').encode('base64')
zlib is a binary compression algorithm. base64 is a text encoding of binary data, which is useful to send binary message through text protocols like SMTP.
After 'dsss' was decoded from base64 (the three bytes 76h, CBh, 2Ch), the result was not valid zlib compressed data so it couldn't be decoded.
Try printing ACTIVATE_THIS to see the result of the decoding. It turns out to be some Python code.

.decode('base64') can be called only on a string that's encoded as "base-64, in order to retrieve the byte sequence that was there encoded. Presumably that byte sequence, in the example you bring, was zlib-compressed, and so the .decode('zlib') part decompresses it.
Now, for your case:
>>> 'dsss'.decode('base64')
'v\xcb,'
But 'v\xcv,' is not a zlib-compressed string! And so of course you cannot ask zlib to "decompress" it. Fortunately zlib recognizes the fact (that 'v\xcv,' could not possibly have been produced by applying any of the compression algorithms zlib knows about to any input whatsoever) and so gives you a helpful error message (instead of a random-ish string of bytes, which you might well have gotten if you had randomly supplied a different but equally crazy input string!-)
Edit: the error in
a.encode('base64')
print a
a.decode('base64')#error
is obviously due to the fact that strings are immutable: just calling a.encode (or any other method) does not alter a, it produces a new string object (and here you're just printing it).
In the next snippet, the error is only in the OP's mind:
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x?K)..F?
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
that "why can't print" question is truly peculiar, applied to code that does print 'dsss'. Finally,
i think the 'print a' encode the a
with 'uhf-8'.
You think wrongly: there's no such thing as "uhf-8" (you mean "utf-8" maybe?), and anyway print a does not alter a, any more than just calling a.encode does.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert str to unicode str - python

Related

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 42: ordinal not in range(128)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

Decode an ENCODED unicode string in Python

how to show the right word in my code, my code is : os.urandom(64)

purpose of '"sss".decode("base64").decode("zlib")'

Categories

Resources