Unicode-Ascii mixed string in python

Unicode-Ascii mixed string in python - python

I have a string that is stored in DB as:
FB (\u30a8\u30a2\u30eb\u30fc)
when I load this row from python code, I am unable to format it correctly.
# x = load that string
print x # returns u'FB (\\u30a8\\u30a2\\u30eb\\u30fc)'
Notice two "\" This messes up the unicode chars on frontend
Instead of showing the foreign chars, html shows it as \u30a8\u30a2\u30eb\u30fc
However, if I load append some characters to convert it into a json format and load the json, I get the expected result.
s = '{"a": "%s"}'%x
json.loads(s)['a']
#prints u'FB (\u30a8\u30a2\u30eb\u30fc)'
Notice the difference between this result (which shows up correctly on frontend) and directly printing x (which has extra ).
So though this hacky solution works, I want a cleaner solution.
I played around a lot with x.encode('utf-8') etc, but none has worked yet.
Thank you!

Since you already have a Unicode string, encode it back to ASCII and decode it with the unicode_escape codec:
>>> s = u'FB (\\u30a8\\u30a2\\u30eb\\u30fc)'
>>> s
u'FB (\\u30a8\\u30a2\\u30eb\\u30fc)'
>>> print s
FB (\u30a8\u30a2\u30eb\u30fc)
>>> s.encode('ascii').decode('unicode_escape')
u'FB (\u30a8\u30a2\u30eb\u30fc)'
>>> print s.encode('ascii').decode('unicode_escape')
FB (エアルー)

raw_string = '\u30a8\u30a2\u30eb\u30fc'
string = ''.join([unichr(int(r, 16)) for r in raw_string.split('\\u') if r])
print(string)
A way to solve this, expecting a better answer.

Related

Converting Byte to String and Back Properly in Python3?

Given a random byte (i.e. not only numbers/characters!), I need to convert it to a string and then back to the inital byte without loosing information. This seems like a basic task, but I ran in to the following problems:
Assuming:
rnd_bytes = b'w\x12\x96\xb8'
len(rnd_bytes)
prints: 4
Now, converting it to a string. Note: I need to set backslashreplace as it otherwise returns a 'UnicodeDecodeError' or would loose information setting it to another flag value.
my_str = rnd_bytes.decode('utf-8' , 'backslashreplace')
Now, I have the string.
I want to convert it back to exactly the original byte (size 4!):
According to python ressources and this answer, there are different possibilities:
conv_bytes = bytes(my_str, 'utf-8')
conv_bytes = my_str.encode('utf-8')
But len(conv_bytes) returns 10.
I tried to analyse the outcome:
>>> repr(rnd_bytes)
"b'w\\x12\\x96\\xb8'"
>>> repr(my_str)
"'w\\x12\\\\x96\\\\xb8'"
>>> repr(conv_bytes)
"b'w\\x12\\\\x96\\\\xb8'"
It would make sense to replace '\\\\'. my_str.replace('\\\\','\\') doesn't change anything. Probably, because four backslashes represent only two. So, my_str.replace('\\','\') would find the '\\\\', but leads to
SyntaxError: EOL while scanning string literal
due to the last argument '\'. This had been discussed here, where the following suggestion came up:
>>> my_str2=my_str.encode('utf_8').decode('unicode_escape')
>>> repr(my_str2)
"'w\\x12\\x96¸'"
This replaces the '\\\\' but seems to add / change some other characters:
>>> conv_bytes2 = my_str2.encode('utf8')
>>> len(conv_bytes2)
6
>>> repr(conv_bytes2)
"b'w\\x12\\xc2\\x96\\xc2\\xb8'"
There must be a prober way to convert a (complex) byte to a string and back. How can I achieve that?

Note: Some codes found on the Internet.
You could try to convert it to hex format. Then it is easy to convert it back to byte format.
Sample code to convert bytes to string:
hex_str = rnd_bytes.hex()
Here is how 'hex_str' looks like:
'771296b8'
And code for converting it back to bytes:
new_rnd_bytes = bytes.fromhex(hex_str)
The result is:
b'w\x12\x96\xb8'
For processing you can use:
readable_str = ''.join(chr(int(hex_str[i:i+2], 16)) for i in range(0, len(hex_str), 2))
But newer try to encode readable string, here is how readable string looks like:
'w\x12\x96¸'
After processing readable string convert it back to hex format before converting it back to bytes string like:
hex_str = ''.join([str(hex(ord(i)))[2:4] for i in readable_str])

Now, converting it to a string. Note: I need to set backslashreplace as it otherwise returns a 'UnicodeDecodeError' or would loose information setting it to another flag value.
The UTF-8 encoding cannot interpret every possible sequence of bytes as a string. Using backslashreplace gives you a string that preserves the information for bytes that couldn't be converted:
>>> rnd_bytes = b'w\x12\x96\xb8'
>>> rnd_bytes.decode('utf-8', 'backslashreplace')
'w\x12\\x96\\xb8'
but that representation is not very useful for converting back.
Instead, use an encoding that does interpret every possible sequence of bytes as a string. The most straightforward of these is ISO-8859-1, which simply maps each byte one at a time to the first 256 Unicode code points respectively.
>>> rnd_bytes.decode('iso-8859-1')
'w\x12\x96¸'
>>> rnd_bytes.decode('iso-8859-1').encode('iso-8859-1') == rnd_bytes
True

Python request .json() method change symbols

i will be very appreciated for any help
I'm getting info from website... And generally everyting works fine... But, just look at code:
r = requests.get(s_url)
print r.text
>>>[{"nameID":"D1","text":"I’ll understand what actually happened here."}]
print r.json()
>>>[{u'nameID':u'D1',u'text':u'I\u2019ll understand what actually happened
here.'"}]
As you see after r.json() ' was changed to \u2019.
Why it did it? How I can force to don't make such changes?
Thanks

It is not an apostrophe (') in your string, but a (unicode) character "right single quotation mark" (’). Which may not be as obvious to a naked eye (depending on a font used easier or more difficult to spot), but it is still a completely different character to the computer.
>>> u"’" == u"'"
False
The difference you see between displaying text attributed and what json() method returns is just a matter of representation of that character.
For instance:
>>> s = u'’'
>>> print s
’
But when you do the same with a dictionary returned by json() individual keys and values thereof are formatted using repr() resulting in:
>>> d = {'k': s}
>>> print d
{'k': u'\u2019'}
Just as in:
>>> print repr(s)
u'\u2019'
Nonetheless, it is still the same thing:
>>> d['k'] == s
True
For that matter:
>>> u'’' == u'\u2019'
True
You may want to have a look at __str__() and __repr__() methods as well as description of print for more details:
https://docs.python.org/2.7/reference/datamodel.html#object.str
https://docs.python.org/2.7/reference/datamodel.html#object.repr
https://docs.python.org/2.7/reference/simple_stmts.html#print

What is happening to the output of this conversion, and why isn't it what I think it should be?

I'm trying to write a short program in Python to convert a string from hex into bytes, and then from bytes into base64. I've got a base string in hexadecimal, and the equivalent of the string in base64. The program looks like this:
import codecs
basecode = "49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d"
DesiredResult = "SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t"
e = codecs.decode(basecode, 'hex')
print(e)
y = codecs.encode(e, 'base64')
print(y)
z = bytes.decode(y, 'utf-8')
print(z)
print(DesiredResult)
if y == DesiredResult:
print("Success!")
The output I'm getting from the program is as follows:
b"I'm killing your brain like a poisonous mushroom"
b'SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t\n'
SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t
SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t
As I'm sure you can imagine, this is frustrating, because it means that while I've come close, I haven't actually done what I set out to do.
So, how can I modify this program to make sure that the output of my conversion is exactly the same as the variable DesiredResult?
I know it has to do with the \n that appears in the conversion, but I don't know how it gets there, and I don't know how to get rid of it.
Any and all suggestions greatly appreciated.

The base64 codec always adds a \n byte at the end.
Convert operand to MIME base64 (the result always includes a trailing '\n')
To strip the last \n byte, use the Python slicing operator like this:
>>> y
'SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t\n'
>>> DesiredResult
'SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t'
>>> y[:-1]
'SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t'
>>> y[:-1] == DesiredResult
True

Python3 : unescaping non ascii characters

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.

I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()

You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')

import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

Python: convert a dot separated hex values to string?

In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.

you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here

Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')

Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode-Ascii mixed string in python - python

raw_string = '\u30a8\u30a2\u30eb\u30fc' string = ''.join([unichr(int(r, 16)) for r in raw_string.split('\\u') if r]) print(string) A way to solve this, expecting a better answer.

Related

Converting Byte to String and Back Properly in Python3?

Python request .json() method change symbols

What is happening to the output of this conversion, and why isn't it what I think it should be?

Python3 : unescaping non ascii characters

Python: convert a dot separated hex values to string?

Categories

Resources