Checking if a byte is ascii printable - python

I am reading in a file using binary settings:
with open(filename, 'rb') as f:
I am then reading the entire file into a variable:
x = f.read()
My problem is that I want to check if the bytes in x are ascii printable. So i want to compare the bytes to see if they are within the range of say 32-128 in decimal notation. What would be the easiest way to go about doing this?
I have toyed around with the ord() function, various hex functions since I have previously converted the bytes into hex elsewhere in my project, but nothing seems to be working.
I'm new to python but have experience in other languages. Can anyone point me in the right direction? Thanks.

You could check each byte against string.printable.
>>> import string
>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
printable_chars = bytes(string.printable, 'ascii')
with open(filename, 'rb') as f:
printable = all(char in printable_chars for char in f.read())
For greater efficiency, O(1) vs O(n) for the set vs string lookup, use a set:
printable_chars = set(bytes(string.printable, 'ascii'))
with open(filename, 'rb') as f:
printable = all(char in printable_chars for char in f.read())

Related

encoding with open parameter vs encode string method byte size

I have come across something that I can't get my head around.
So I try to encode my string with one unicode character using both string encode method and open encoding parameter. For some reason, there is a difference regarding written byte size between these two methods.
Here is sample code:
with open("in.txt", "wb") as f:
no = f.write("Wlazł".encode("utf-8"))
print(no) # -> 6
with open("in.txt", "w", encoding="utf-8") as f:
no = f.write("Wlazł")
print(no) # -> 5
Does anyone know why this is so?
When you open a file in binary mode you get an instance of io.RawIOBase, and RawIOBase.write returns the number of bytes written.
When you open a file in text mode you get an instance of io.TextIOBase, and TextIOBase.write returns the number of characters written.
So the reason for the difference is that one is a count of bytes, the other of characters.

why is my hex string not behaving like a string?

my purpose here is to turn a small file into a qrcode.
So I used binascii.hexlify() to get the hexadecimal of the file.
With pillow I will then build the qr code, this qr code will be read by an other script that will turn it back into a file.
import binascii
with open(r"D:\test.png", 'rb') as f:
content = f.read()
hexstr = str(binascii.hexlify(content))
#print(hexstr)
print(hexstr[:5])
the weird thing here is that the hexstr is 64eded8d8d8d6ad3bcefd7a616864b4aea169786434393975eecb73a1b896cae4e80da592d7dcf2... but the hexstr[:5] is b'895 (i was expecting 64ede)
why is that?
Thanks.
ps : I'm using python 3.6x64 on a windows 10 machine
I'm not sure why you're getting b'895, but when you run hexstr = str(binascii.hexlify(content) it gives hexstr the value "b'64ede...'". The string representation of the bytes sequence includes the b' prefix. I think what you want is hexstr = binascii.hexlify(content).decode(). This will decode the binary string into the corresponding ascii.
import binascii
with open(r"D:\test.png", 'rb') as f:
content = f.read()
hexstr = binascii.hexlify(content).decode()
#print(hexstr)
print(hexstr[:5])

How to convert bytes data to string without changing data using python3

How can I convert bytes to string without changing data ?
E.g
Input:
file_data = b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
Output:
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
I want to write an image data using StringIO with some additional data, Below is my code snippet,
img_buf = StringIO()
f = open("Sample_image.jpg", "rb")
file_data = f.read()
img_buf.write('\r\n' + file_data + '\r\n')
This works fine with python 2.7 but I want it to be working with python 3.4.
on read operation file_data = f.read() returns bytes object data something like this
b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
While writting data using img_buf it accepts only String data, so unable to write file_data with some additional characters.
So I want to convert file_data as it is in String object without changing its data. Something like this
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
so that I can concat and write the image data.
I don't want to decode or encode data. Any suggestions would be helpful for me. thanks in advance.
It is not clear what kind of output you desire. If you are interested in aesthetically translating bytes to a string representation without encoding:
s = str(file_data)[1:]
print(s)
# '\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
This is the informal string representation of the original byte string (no conversion).
Details
The official string representation looks like this:
s
# "'\\xb4\\xeb7s\\x14q[\\xc4\\xbb\\x8e\\xd4\\xe0\\x01\\xec+\\x8f\\xf8c\\xff\\x00 \\xeb\\xff'"
String representation handles how a string looks. Double escape characters and double quotes are implicitly interpreted in Python to do the right thing so that the print function outputs a formatted string.
String intrepretation handles what a string means. Each block of characters means something different depending on the applied encoding. Here we interpret these blocks of characters (e.g. \\xb4, \\xeb, 7, s) with the UTF-8 encoding. Blocks unrecognized by this encoding are replaced with a default character, �:
file_data.decode("utf-8", "replace")
# '��7s\x14q[Ļ���\x01�+��c�\x00 ��'
Converting from bytes to strings is required for reliably working with strings.
In short, there is a difference in string output between how it looks (representation) and what it means (interpretation). Clarify which you prefer and proceed accordingly.
Addendum
If your question is "how do I concatenate a byte string?", here is one approach:
buffer = io.BytesIO()
with buffer as f:
f.write(b"\r\n")
f.write(file_data)
f.write(b"\r\n")
print(buffer.getvalue())
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'
Equivalently:
buffer = b""
buffer += b"\r\n"
buffer += file_data
buffer += b"\r\n"
buffer
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'

How not to decode escaped sequences when reading from file but keep the string representation

I am reading in a text file that contains lines with binaray data dumped in a encoded fashion, but still as a string (at least in emacs):
E.g.:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
This is perfectly fine for me and when I read in that file I want to keep this string and not decode or change it in any way. However, when I am reading in the file python does the decoding. How can I prevent that?
with open("/path/to/file") as file:
for line in file:
print line
the output will look like:
'���k���G�r��#�\0320^��\021�C\035\000�\016ׁ��'
but should look like:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
Edit: However, this encoded data is not the only data contained but part of a larger text dump.
You can read the file as binary with 'rb' option and it will retain the data as it is
EX:
with open(PathToFile, 'rb') as file:
raw_binary_data = file.read()
print(raw_binary_data)
If you really want the octal representation you can define a fuction that prints it back out.
import string
def octal_print(s):
print(''.join(map(lambda x: x if x in string.printable else '\\'+oct(ord(x))[2:], s)))
s = '\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207'
octal_print(s)
# prints:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\320^\242\367\21\227C\35\0\207
based on the answer of James I adapted the octal_print function to discriminate between actual octals and innocent characters.
def octal_print(s):
charlist = list()
for character in s:
try:
character.decode('ascii')
charlist.append(character)
except:
charlist.append('\\'+oct(ord(character))[1:])
return ''.join(charlist)

How to read binary files as hex in Python?

I want to read a file with data, coded in hex format:
01ff0aa121221aff110120...etc
the files contains >100.000 such bytes, some more than 1.000.000 (they comes form DNA sequencing)
I tried the following code (and other similar):
filele=1234563
f=open('data.geno','r')
c=[]
for i in range(filele):
a=f.read(1)
b=a.encode("hex")
c.append(b)
f.close()
This gives each byte separate "aa" "01" "f1" etc, that is perfect for me!
This works fine up to (in this case) byte no 905 that happen to be "1a". I also tried the ord() function that also stopped at the same byte.
There might be a simple solution?
Simple solution is binascii:
import binascii
# Open in binary mode (so you don't read two byte line endings on Windows as one byte)
# and use with statement (always do this to avoid leaked file descriptors, unflushed files)
with open('data.geno', 'rb') as f:
# Slurp the whole file and efficiently convert it to hex all at once
hexdata = binascii.hexlify(f.read())
This just gets you a str of the hex values, but it does it much faster than what you're trying to do. If you really want a bunch of length 2 strings of the hex for each byte, you can convert the result easily:
hexlist = map(''.join, zip(hexdata[::2], hexdata[1::2]))
which will produce the list of len 2 strs corresponding to the hex encoding of each byte. To avoid temporary copies of hexdata, you can use a similar but slightly less intuitive approach that avoids slicing by using the same iterator twice with zip:
hexlist = map(''.join, zip(*[iter(hexdata)]*2))
Update:
For people on Python 3.5 and higher, bytes objects spawned a .hex() method, so no module is required to convert from raw binary data to ASCII hex. The block of code at the top can be simplified to just:
with open('data.geno', 'rb') as f:
hexdata = f.read().hex()
Just an additional note to these, make sure to add a break into your .read of the file or it will just keep going.
def HexView():
with open(<yourfilehere>, 'rb') as in_file:
while True:
hexdata = in_file.read(16).hex() # I like to read 16 bytes in then new line it.
if len(hexdata) == 0: # breaks loop once no more binary data is read
break
print(hexdata.upper()) # I also like it all in caps.
If the file is encoded in hex format, shouldn't each byte be represented by 2 characters? So
c=[]
with open('data.geno','rb') as f:
b = f.read(2)
while b:
c.append(b.decode('hex'))
b=f.read(2)
Thanks for all interesting answers!
The simple solution that worked immediately, was to change "r" to "rb",
so:
f=open('data.geno','r') # don't work
f=open('data.geno','rb') # works fine
The code in this case is actually only two binary bites, so one byte contains four data, binary; 00, 01, 10, 11.
Yours!

Categories