How do you decode an ascii string in python? - python

For example, in your python shell(IDLE):
>>> a = "\x3cdiv\x3e"
>>> print a
The result you get is:
<div>
but if a is an ascii encoded string:
>>> a = "\\x3cdiv\\x3e" ## it's the actual \x3cdiv\x3e string if you read it from a file
>>> print a
The result you get is:
\x3cdiv\x3e
Now what i really want from a is <div>, so I did this:
>>> b = a.decode("ascii")
>>> print b
BUT surprisingly I did NOT get the result I want, it's still:
\x3cdiv\x3e
So basically what do I do to convert a, which is \x3cdiv\x3e to b, which should be <div>?
Thanks

>>> a = rb"\x3cdiv\x3e"
>>> a.decode('unicode_escape')
'<div>'
Also check out some interesting codecs.

With python 3.x, you would adapt Kabie answer to
a = b"\x3cdiv\x3e"
a.decode('unicode_escape')
or
a = b"\x3cdiv\x3e"
a.decode('ascii')
both give
>>> a
b'<div>'
What is b prefix for ?
Bytes literals are always prefixed with 'b' or 'B'; they produce an
instance of the bytes type instead of the str type. They may only
contain ASCII characters; bytes with a numeric value of 128 or greater
must be expressed with escapes.

Related

How to convert String(Word) to hexadecimal Binary with leading \x in Python 3

I have following question with Python 3.
How i can convert a String (Word) in Hexadecimal with leading \x in Python3.x ?
Example:
with integer:
>>> x = 319
>>> x_hex = '{0:04x}'.format(x)
now it looks so
>>> print(x_hex)
013f
and for convert in the right format:
>>> y = bytearray.fromhex(x_hex)
>>> print(y)
b'\x01?'
Now my Question:
How to do this with a word or long numbers ?
When i using the binascii.hexlify tool, the string is wrong for my task:
Example:
>>> word = "hello012"
>>> word_2byte = bytes(word, encodiung='ascii')
>>> word_hex = binascii.hexlify(word_2byte)
>>> print(word_hex)
b'68656c6c6f303132'
The output from binascii.hexlify is correct, but how do i get this format?:
b'\x68\x65\x6c\x6c\x6f\x30\x31\x32'
Thank you for any help :-)
Encoding to bytes is all that is required; there is no difference between b'\x68' and b'h', b'\x65' and b'e', etc.
If you want the representation as a string to be like that then you will need to further encode yourself.
>>> ''.join('\\x{:02x}'.format(c) for c in word_2byte)
'\\x68\\x65\\x6c\\x6c\\x6f\\x30\\x31\\x32'

string.decode() function in python2

So I am converting some code from python2 to python3. I don't understand the python2 encode/decode functionality enough to even determine what I should be doing in python3
In python2, I can do the following things:
>>> c = '\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
帐户
>>> c.decode('utf8')
u'\u5e10\u6237'
What did I just do there? Doesn't the 'u' prefix mean unicode? Shouldn't the utf8 be '\xe5\xb8\x90\xe6\x88\xb7' since that is what I input in the first place?
Your variable c was not declared as a unicode (with prefix 'u'). If you decode it using the 'latin1' encoding you will get the same result:
>>> c.decode('latin1')
u'\xe5\xb8\x90\xe6\x88\xb7'
Note that the result of decode is a unicode string:
>>> type(c)
<type 'str'>
>>> type(c.decode('latin1'))
<type 'unicode'>
If you declare c as a unicode and keep the same input, you will not print the same characters:
>>> c=u'\xe5\xb8\x90\xe6\x88\xb7'
>>> print c
å¸æ·
If you use the input '\u5e10\u6237', you will print the initial characters:
>>> c=u'\u5e10\u6237'
>>> print c
帐户
Encoding and decoding is just a matter of using a table of correspondence value<->character. The thing is that the same value does not render the same character according to the encoding (ie table) used.
The main difficulty is when you don't know the encoding of an input string that you have to handle. Some tools can try to guess it, but it is not always successful (see https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding).

Remove '\x' from bytes

I'm currently reading bytes from a file and I want to put two of these bytes into a list and convert them into an integer. Say the two bytes I want to read are \x02 and \x00 I want to join these bytes together before I convert them into an integer such as 0x0200 but am having difficulty doing so as I cannot remove the \x from the bytes.
I've tried using: .replace('\\x', '') though this doesn't work as python treats the bytes as one object rather than a list. I've also considered using struct although I'm unsure whether this would work in my situation.
It's also not possible to iterate through each byte and remove the first two items as python still treats the entire byte as one object.
Here is the list I have after appending it with both bytes:
While it looks like two strings, they do not behave as strings. I then iterated over the list using:
for x in a:
print a
The two lines below the list are the outputs of 'print a' (a blank space & special character). As you can see they do not print as normal strings.
Below is a code snippet showing how I add the bytes to the array, nothing complicated (test being the array in this case).
for i in openFile.read(512):
....
....
elif 10 < count < 13:
test.insert(0, i[0:])
You could use ord to extract each character's numeric value, then combine them with simple arithmetic.
>>> a = '\x02'
>>> b = '\x00'
>>> c = ord(a)*256 + ord(b)
>>> c == 0x0200
True
>>> print hex(c)
0x200
An alternate way to do this for standard-length types is to use the struct module to convert from strings of bytes to Python types.
For example:
>>> import struct
>>> byte_arr = ['\x02', '\x00']
>>> byte_str = ''.join(byte_arr)
>>> byte_str
'\x02\x00'
>>> num, = struct.unpack('>H', byte_str)
>>> num
512
In this example, the format string '>H' indicates a big-endian unsigned 2-byte integer. Other format strings can be used to specify other sizes, endianness, and signed/unsigned status.
new_str=str(your_byte_like_object).split('\\x')
print("".join(new_str))
You can convert the byte object to str and split it separator use \x
and you get a list and join it.
that's all.
output is like this:
eigioer #text
b'0\x1e\xd7\xe8\xdf\xc1\xd7\x90o3`mD\x92U\xf5\xca\xe7l\xe5"TM' #raw byte
["b'0", '1e', 'd7', 'e8', 'df', 'c1', 'd7', '90o3`mD', '92U', 'f5', 'ca', 'e7l', 'e5"TM\''] #list
b'01ed7e8dfc1d790o3`mD92Uf5cae7le5"TM' #after joining
I had the same problem. I had a "bytes" object and I needed to remove the \xs to be able to upload my file to Cassandra, and all I needed to do was to use this:
my_bytes.hex()
We know that it always starts with \x, and the 'it' is a string. So we can just do...
>>> num = "\\x02"
>>> num = num[2:]
>>> print num
02
Update:
>>> num = [r"\x02", r"\x20"]
>>> num = [ n[2:] for n in num ]
>>> num
['02', '20']

Decode unicode string in python

I'd like to decode the following string:
t\u028c\u02c8m\u0251\u0279o\u028a\u032f
It should be the IPA of 'tomorrow' as given in a JSON string from http://rhymebrain.com/talk?function=getWordInfo&word=tomorrow
My understanding is that it should be something like:
x = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f'
print x.decode()
I have tried the solutions from here , here , here, and here (and several other that more or less apply), and several permutations of its parts, but I can't get it to work.
Thank you
You need a u before your string (in Python 2.x, which you appear to be using) to indicate that this is a unicode string:
>>> x = u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # note the u
>>> print x
tʌˈmɑɹoʊ̯
If you have already stored the string in a variable, you can use the following constructor to convert the string into unicode:
>>> s = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # your string has a unicode-escape encoding but is not unicode
>>> x = unicode(s, encoding='unicode-escape')
>>> print x
tʌˈmɑɹoʊ̯
>>> x
u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # a unicode string

Python: convert a dot separated hex values to string?

In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.
you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here
Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')
Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>

Categories