Convert bytes to readable strings in Python 3 - python

I have a .bin file that holds data, however I am not sure of what format or encoding. I want to be able to transform the data into something readable. Formatting is not a problem, I can do that later.
My issue is parsing the file. I've tried to use struct, binascii and codecs with no such luck.
with open(sys.argv[1], 'rb') as f:
data = f.read()
lists = list(data)
# Below returns that each item is class 'bytes' and a number that appears to be <255
# However, if I add type(i) == bytes it spits an error
for i in lists:
print("Type: ", type(data))
print(i, "\n")
# Below returns that the class is 'bytes' and prints like this: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xdd\x07\x00\x00\x0b\x00\x00\x00\x18\x00\x00\x00\x00\x00\x00\x00\x0e\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08#\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xa0=\xa1D\xc0\x00\x00\x00\x00t\xdfe#
# To my knowledge, this looks like hex notation.
print("Data type: ", type(data))
print(data)
However, there should be someway to convert this into characters I can read i.e. letters or numbers, represented in a string. I seem to be over-complicating things, as I'm sure there's an inbuilt method that is being elusive.

Use binascii.hexlify:
>>> import binascii
>>> binascii.hexlify(b'\x00t\xdfe#')
b'0074df6540'

Related

Save bytes in a .txt and read out as bytes later

I have written a small python script which encrypts a message with rsa.
Now I want to save the bytes in a txt to read them later.
But when I use str(...) on it I don't know how to convert the string back.
For example I encrypted "Test" to b'Y\xf8\xbc\xca\x14\x0f\x80\xd3\xc6\xce\xecE\x14\xc1\xaf\xbd\x82\xd24\xcf\x04\xe2\x9a\x81NF\xbeXi\x85\xef\xc4\xbbl\xd3(5\x80\xe4\xde3\x8eC\xd2jR*\xb7.gq\x8c\x8b\xa12\x1a\x10+\xbf\xefHZ\n/'
and saved it as a string.
When I aply bytes(...) on it I get the Error: TypeError: string argument without an encoding.
What can I do in order to do this?
You've saved the Python string representation of a binary byte array (bytestring).
To get the actual bytes back from such a representation, pass it through ast.literal_eval():
>>> import ast
>>> s = r"b'Y\xf8\xbc\xca\x14\x0f\x80\xd3\xc6\xce\xecE\x14\xc1\xaf\xbd\x82\xd24\xcf\x04\xe2\x9a\x81NF\xbeXi\x85\xef\xc4\xbbl\xd3(5\x80\xe4\xde3\x8eC\xd2jR*\xb7.gq\x8c\x8b\xa12\x1a\x10+\xbf\xefHZ\n/'"
>>> b = ast.literal_eval(s)
b'Y\xf8\xbc\xca\x14\x0f\x80\xd3\xc6\xce\xecE\x14\xc1\xaf\xbd\x82\xd24\xcf\x04\xe2\x9a\x81NF\xbeXi\x85\xef\xc4\xbbl\xd3(5\x80\xe4\xde3\x8eC\xd2jR*\xb7.gq\x8c\x8b\xa12\x1a\x10+\xbf\xefHZ\n/'
Better yet, just save the binary bytes to your file without passing through a string:
encrypted_bytes = my_rsa("Test")
with open("encrypted.bin", "wb") as f:
f.write(encrypted_bytes)
# ...
with open("encrypted.bin", "rb") as f:
encrypted_bytes = f.read()
If you really want a "text-safe" format for those bytes, use base64.b64encode() and base64.b64decode().

Python: Bytes not being converted properly?

I'm VERY new to binary stuff, and I'm struggling a little bit.
I'm trying to convert a binary file to text. So far, this is my code:
with open(file_path, 'rb') as f:
data = f.read()
temp_data = str(data)
if temp_data[-1] == '\\':
temp_data = temp_data[:-1]
temp_data = bytes(temp_data, 'utf-8')
text = temp_data.decode('utf-8')
It seems to be working... partially. I see some things in the long byte string that I want to see, like a file name and timestamp. However, I'm still
seeing a lot of byte values. The value of the text variable is:
b'\x00\x00\x00\x00T\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x004\x01\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00X\x01\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00x\x01\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00TCODEF1001.DAR_MeasLog.2019-03-05+01:10:45.2019-03-05+01:11:21.1.100.0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x95\xcc}\\\xba\xcc}\\LOG\x00\x00\x00\x00\x00\x00\x00\x00\x00OKL\x00\x04\x00\x00\x00\x01\x00\x00\x00VKL\x00\x05\x00\x00\x00\x01\x00\x00\x00YKL\x00\x06\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00h\xcc}\\\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\xa4\xcc}\\\x02\x00\x00\x00\x02\x00\x00\x00\x01\x00\x00\x00M\x00\x00\x00\x95\xcc}\\\xb9\xcc}\\'
I have no idea how to fix this, or what any of this means.
Note: I needed to parse the string for the last character '\' because the decoding was giving me an error " could not decode because last character is '\'", or something along those lines.
Thank you!
EDIT: I changed the code so now it looks like this:
with open(file_path, 'rb') as f:
data = f.read()
readable_str = data.decode('utf-16')
bytes_again = readable_str.encode('utf-16')
When I print readable_str, I'm getting non-ASCII values which should not happen at all. I get text like this:
TĴŘŸ䍔䑏䙅〱㄰䐮剁䵟慥䱳杯㈮㄰ⴹ㌰〭⬵㄰ㄺ㨰㔴㈮㄰ⴹ㌰〭⬵㄰ㄺ㨱ㄲㄮㄮ〰〮첕屽첺屽佌G䭏L䭖L䭙L챨屽첤屽M첕屽첹屽
The decoding does not work with 'utf-8' or 'utf-32'. Is there a way to tell what decoding to use based of this? Are there other encodings out there that I have not tried? Thanks!
The approach in Python3 for reading and writing data is much more explicit than what it used to be. Almost always assume bytes, decode before working with the data in the script and then encode back to bytes before writing out.
I highly recommend you watch nedbat's talk about Python's unicode and how to correctly work with bytes input/output.
Regardless, what you want to do is
with open('file.txt', 'rb') as fo:
data = fo.read() # This is in bytes
# We "decipher" the bytes into something we can work with
readable_str = data.decode('utf-8')
bytes_again = readable_str.encode('utf-8')
with open('other_file.txt', 'wb') as fw:
fw.write(bytes_again)

How to convert bytes data to string without changing data using python3

How can I convert bytes to string without changing data ?
E.g
Input:
file_data = b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
Output:
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
I want to write an image data using StringIO with some additional data, Below is my code snippet,
img_buf = StringIO()
f = open("Sample_image.jpg", "rb")
file_data = f.read()
img_buf.write('\r\n' + file_data + '\r\n')
This works fine with python 2.7 but I want it to be working with python 3.4.
on read operation file_data = f.read() returns bytes object data something like this
b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
While writting data using img_buf it accepts only String data, so unable to write file_data with some additional characters.
So I want to convert file_data as it is in String object without changing its data. Something like this
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
so that I can concat and write the image data.
I don't want to decode or encode data. Any suggestions would be helpful for me. thanks in advance.
It is not clear what kind of output you desire. If you are interested in aesthetically translating bytes to a string representation without encoding:
s = str(file_data)[1:]
print(s)
# '\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
This is the informal string representation of the original byte string (no conversion).
Details
The official string representation looks like this:
s
# "'\\xb4\\xeb7s\\x14q[\\xc4\\xbb\\x8e\\xd4\\xe0\\x01\\xec+\\x8f\\xf8c\\xff\\x00 \\xeb\\xff'"
String representation handles how a string looks. Double escape characters and double quotes are implicitly interpreted in Python to do the right thing so that the print function outputs a formatted string.
String intrepretation handles what a string means. Each block of characters means something different depending on the applied encoding. Here we interpret these blocks of characters (e.g. \\xb4, \\xeb, 7, s) with the UTF-8 encoding. Blocks unrecognized by this encoding are replaced with a default character, �:
file_data.decode("utf-8", "replace")
# '��7s\x14q[Ļ���\x01�+��c�\x00 ��'
Converting from bytes to strings is required for reliably working with strings.
In short, there is a difference in string output between how it looks (representation) and what it means (interpretation). Clarify which you prefer and proceed accordingly.
Addendum
If your question is "how do I concatenate a byte string?", here is one approach:
buffer = io.BytesIO()
with buffer as f:
f.write(b"\r\n")
f.write(file_data)
f.write(b"\r\n")
print(buffer.getvalue())
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'
Equivalently:
buffer = b""
buffer += b"\r\n"
buffer += file_data
buffer += b"\r\n"
buffer
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'

How not to decode escaped sequences when reading from file but keep the string representation

I am reading in a text file that contains lines with binaray data dumped in a encoded fashion, but still as a string (at least in emacs):
E.g.:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
This is perfectly fine for me and when I read in that file I want to keep this string and not decode or change it in any way. However, when I am reading in the file python does the decoding. How can I prevent that?
with open("/path/to/file") as file:
for line in file:
print line
the output will look like:
'���k���G�r��#�\0320^��\021�C\035\000�\016ׁ��'
but should look like:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
Edit: However, this encoded data is not the only data contained but part of a larger text dump.
You can read the file as binary with 'rb' option and it will retain the data as it is
EX:
with open(PathToFile, 'rb') as file:
raw_binary_data = file.read()
print(raw_binary_data)
If you really want the octal representation you can define a fuction that prints it back out.
import string
def octal_print(s):
print(''.join(map(lambda x: x if x in string.printable else '\\'+oct(ord(x))[2:], s)))
s = '\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207'
octal_print(s)
# prints:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\320^\242\367\21\227C\35\0\207
based on the answer of James I adapted the octal_print function to discriminate between actual octals and innocent characters.
def octal_print(s):
charlist = list()
for character in s:
try:
character.decode('ascii')
charlist.append(character)
except:
charlist.append('\\'+oct(ord(character))[1:])
return ''.join(charlist)

How to read binary files as hex in Python?

I want to read a file with data, coded in hex format:
01ff0aa121221aff110120...etc
the files contains >100.000 such bytes, some more than 1.000.000 (they comes form DNA sequencing)
I tried the following code (and other similar):
filele=1234563
f=open('data.geno','r')
c=[]
for i in range(filele):
a=f.read(1)
b=a.encode("hex")
c.append(b)
f.close()
This gives each byte separate "aa" "01" "f1" etc, that is perfect for me!
This works fine up to (in this case) byte no 905 that happen to be "1a". I also tried the ord() function that also stopped at the same byte.
There might be a simple solution?
Simple solution is binascii:
import binascii
# Open in binary mode (so you don't read two byte line endings on Windows as one byte)
# and use with statement (always do this to avoid leaked file descriptors, unflushed files)
with open('data.geno', 'rb') as f:
# Slurp the whole file and efficiently convert it to hex all at once
hexdata = binascii.hexlify(f.read())
This just gets you a str of the hex values, but it does it much faster than what you're trying to do. If you really want a bunch of length 2 strings of the hex for each byte, you can convert the result easily:
hexlist = map(''.join, zip(hexdata[::2], hexdata[1::2]))
which will produce the list of len 2 strs corresponding to the hex encoding of each byte. To avoid temporary copies of hexdata, you can use a similar but slightly less intuitive approach that avoids slicing by using the same iterator twice with zip:
hexlist = map(''.join, zip(*[iter(hexdata)]*2))
Update:
For people on Python 3.5 and higher, bytes objects spawned a .hex() method, so no module is required to convert from raw binary data to ASCII hex. The block of code at the top can be simplified to just:
with open('data.geno', 'rb') as f:
hexdata = f.read().hex()
Just an additional note to these, make sure to add a break into your .read of the file or it will just keep going.
def HexView():
with open(<yourfilehere>, 'rb') as in_file:
while True:
hexdata = in_file.read(16).hex() # I like to read 16 bytes in then new line it.
if len(hexdata) == 0: # breaks loop once no more binary data is read
break
print(hexdata.upper()) # I also like it all in caps.
If the file is encoded in hex format, shouldn't each byte be represented by 2 characters? So
c=[]
with open('data.geno','rb') as f:
b = f.read(2)
while b:
c.append(b.decode('hex'))
b=f.read(2)
Thanks for all interesting answers!
The simple solution that worked immediately, was to change "r" to "rb",
so:
f=open('data.geno','r') # don't work
f=open('data.geno','rb') # works fine
The code in this case is actually only two binary bites, so one byte contains four data, binary; 00, 01, 10, 11.
Yours!

Categories