ascii txt file to binary bin file - python

I have a txt file with a stream of HEX data, I would like to convert it in binary fomart in order to save space on the disk.
this is my simple script just to test the decoding and the binary storage
hexstr = "12ab"
of = open('outputfile.bin','wb')
for i in hexstr:
#this is how I convert an ASCII char to 7 bit representation
x = '{0:07b}'.format(ord(i))
of.write(x)
of.close()
I exect that outputfile.bin has a size of 28 bit, instead the results is 28 byte.
I guess the problem is that x is a string and not a bit sequence.
How should I do?
Thanks in advance

First of all, you will not get a file size that is not a multiple of 8 bits on any popular platform.
Second, you really have to brush up an what "binary" actually means. You confuse two different concepts: representing a number in the binary number system and writing out data in a "non human readable" form.
Actually, you are confusing two even more fundamental concepts: data and the representation of data. "12ab" is a representation of the four bytes in memory, as is "\x31\x32\x61\x62".
Your problem is that x contains 28 bytes of data that can either be represented as "0110001011001011000011100010" or as "\x30\x31\x31\x30\x30...\x30\x30\x31\x30".
Maybe this will help you:
>>> hexstr = "12ab"
>>> len(hexstr)
4
>>> ['"%s": %x' % (c, ord(c)) for c in hexstr]
['"1": 31', '"2": 32', '"a": 61', '"b": 62']
>>> i = 42
>>> hex(i)
'0x2a'
>>> x = '{0:07b}'.format(i)
>>> x
'0101010'
>>> [hex(ord(c)) for c in x]
['0x30', '0x31', '0x30', '0x31', '0x30', '0x31', '0x30']
>>> hex(ord('0')), hex(ord('1'))
('0x30', '0x31')
>>> import binascii
>>> [hex(ord(c)) for c in binascii.unhexlify(hexstr)]
['0x12', '0xab']
That said, thhe binascii module has a method you can use:
import binascii
data = binascii.unhexlify(hexstr)
with open('outputfile.bin', 'wb') as f:
f.write(data)
This will encode your data in 8bit instead of 7bit, but usually it is not worth the effort to use 7bit for compression reasons anyway.

Is this what you want? "12ab" should be written as \x01\x02\x0a\x0b, right?
import struct
hexstr = "12ab"
of = open('outputfile.bin','w')
for i in hexstr:
of.write(struct.pack('B', int(i, 16)))
of.close()

Related

Best way to output a binary sequence in Python

I have a binary sequence, for example: 10010111010101. I need to output this sequence to a file and then read it later but I want it to be compressed as much as possible, what is the easiest way to do this?
I have tried to take every 8 bits (byte) in the sequence together and output the byte value and then when I read it later, I cut it bit by bit, is there an easier way? or a module that does this readily?
The best textual encoding for binary data is either base64 or ascii85.
ASCII85
import base64
import sys
# Length of the binary string in bytes (32 bytes will let you have a 256 digit binary character stream)
# Keep it as low as possible to save space
length = 32
binary_string = input('Enter binary string : ')
integer = eval('0b'+binary_string)
data = integer.to_bytes(length, sys.byteorder, signed=False)
print(base64.a85encode(data).decode('utf-8'))
Base64
import base64
import sys
# Length of the binary string in bytes (32 bytes will let you have a 256 digit binary character stream)
# Keep it as low as possible to save space
length = 32
binary_string = input('Enter binary string : ')
integer = eval('0b'+binary_string)
data = integer.to_bytes(length, sys.byteorder, signed=False)
print(base64.b64encode(data).decode('utf-8'))
WARNING: Typically sys.byteorder is little-endian, so you might run into problems when you try to load up the file.

How to decode a string representation of a bytes object?

I have a string which includes encoded bytes inside it:
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
I want to decode it, but I can't since it has become a string. Therefore I want to ask whether there is any way I can convert it into
str2 = b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
Here str2 is a bytes object which I can decode easily using
str2.decode('utf-8')
to get the final result:
'Output file 문항분석.xlsx Created'
You could use ast.literal_eval:
>>> print(str1)
b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
>>> type(str1)
<class 'str'>
>>> from ast import literal_eval
>>> literal_eval(str1).decode('utf-8')
'Output file 문항분석.xlsx Created'
Based on the SyntaxError mentioned in your comments, you may be having a testing issue when attempting to print due to the fact that stdout is set to ascii in your console (and you may also find that your console does not support some of the characters you may be trying to print). You can try something like the following to set sys.stdout to utf-8 and see what your console will print (just using string slice and encode below to get bytes rather than the ast.literal_eval approach that has already been suggested):
import codecs
import sys
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
s = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
b = s[2:-1].encode().decode('utf-8')
A simple way is to assume that all the characters of the initial strings are in the [0,256) range and map to the same Unicode value, which means that it is a Latin1 encoded string.
The conversion is then trivial:
str1[2:-1].encode('Latin1').decode('utf8')
Finally I have found an answer where i use a function to cast a string to bytes without encoding.Given string
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
now i take only actual encoded text inside of it
str1[2:-1]
and pass this to the function which convert the string to bytes without encoding its values
import struct
def rawbytes(s):
"""Convert a string to raw bytes without encoding"""
outlist = []
for cp in s:
num = ord(cp)
if num < 255:
outlist.append(struct.pack('B', num))
elif num < 65535:
outlist.append(struct.pack('>H', num))
else:
b = (num & 0xFF0000) >> 16
H = num & 0xFFFF
outlist.append(struct.pack('>bH', b, H))
return b''.join(outlist)
So, calling the function would convert it to bytes which then is decoded
rawbytes(str1[2:-1]).decode('utf-8')
will give the correct output
'Output file 문항분석.xlsx Created'

How to decode longest sub-bytes into str?

Suppose I read a long bytes object from somewhere, knowing it is utf-8 encoded. But the read may not fully consume the available content so that the last character in the stream may be incomplete. Calling bytes.decode() on this object may result in a decode error. But what really fails is only the last few bytes. Is there a function that works in this case, returning the longest decoded string and the remaining bytes?
utf-8 encodes a character into at most 4 bytes, so trying to decode truncated bytes should work, but a vast majority of computation will be wasted, and I don't really like this solution.
To give a simple but concrete example:
>>> b0 = b'\xc3\x84\xc3\x96\xc3'
>>> b1 = b'\x9c\xc3\x84\xc3\x96\xc3\x9c'
>>> (b0 + b1).decode()
>>> 'ÄÖÜÄÖÜ'
(b0 + b1).decode() is fine, but b0.decode() will raise. The solution should be able to decode b0 for as much as possible and return the bytes that cannot be decoded.
You are describing the basic usage of io.TextIOWrapper: a buffered text stream over a binary stream.
>>> import io
>>> txt = 'before\N{PILE OF POO}after'
>>> b = io.BytesIO(txt.encode('utf-8'))
>>> t = io.TextIOWrapper(b)
>>> t.read(5)
'befor'
>>> t.read(1)
'e'
>>> t.read(1)
'💩'
>>> t.read(1)
'a'
Contrast with reading a bytes stream directly, where it would be possible to read halfway through an encoded pile of poo:
>>> b.seek(0)
0
>>> b.read(5)
b'befor'
>>> b.read(1)
b'e'
>>> b.read(1)
b'\xf0'
>>> b.read(1)
b'\x9f'
>>> b.read(1)
b'\x92'
>>> b.read(1)
b'\xa9'
>>> b.read(1)
b'a'
Specify encoding="utf-8" if you want to be explicit. The default encoding, i.e. locale.getpreferredencoding(False), would usually be utf-8 anyway.
As I mentioned in the comments under #wim's answer, I think you could use the codecs.iterdecode() incremental decoder to do this. Since it's a generator function, there's no need to manually save and restore its state between iterative calls to it.
Here's how how it might be used to handle a situation like the one you described:
import codecs
from random import randint
def reader(sequence):
""" Yield random length chunks of sequence until exhausted. """
plural = lambda word, n, ending='s': (word+ending) if n > 1 else word
i = 0
while i < len(sequence):
size = randint(1, 4)
chunk = sequence[i: i+size]
hexrepr = '0x' + ''.join('%02X' % b for b in chunk)
print('read {} {}: {}'.format(size, plural('byte', len(chunk)), hexrepr))
yield chunk
i += size
bytes_obj = b'\xc3\x84\xc3\x96\xc3\x9c\xc3\x84\xc3\x96\xc3\x9c' # 'ÄÖÜÄÖÜ'
for decoded in codecs.iterdecode(reader(bytes_obj), 'utf-8'):
print(decoded)
Sample output:
read 3 bytes: 0xC384C3
Ä
read 1 byte: 0x96
Ö
read 1 byte: 0xC3
read 3 bytes: 0x9CC384
ÜÄ
read 2 bytes: 0xC396
Ö
read 4 bytes: 0xC39C
Ü

Python struct string to bin

I´m learning Python.
Seeing the module struct, I found a doubt:
Is it possible to convert a "string" to "bin" without giving the length.
For the case (with chars length)
F = open("data.bin", "wb")
import struct
data = struct.pack("24s",b"This is an unknown string")
print(data)
F.write(data)
F.close()
I´m trying to do the same but with unknown length.
Thanks a lot!
If you have the string, use len to determine the length of the string.
i.e
data = struct.pack("{0}s".format(len(unknown_string)), unknown_string)
The Bytes type is a binary data type, it just stores a bunch of 8bit characters. Note that the code with struct.pack ends up creating a bytes object:
>>> import struct
>>> data = struct.pack("24s",b"This is an unknown string")
>>> type(data)
<class 'bytes'>
>>> len(data)
24
The length of this is 24 as per your format specifier. If you just want to place the bytes-string directly into the file without doing any length checking you don't even need to use the struct module, you can just write it directly to the file:
F = open("data.bin", "wb")
F.write(b"This will work")
If however you wanted to keep the 24 bytes length you could keep using struct.pack:
>>> data = struct.pack("24s",b"This is an unknown st")
>>> len(data)
24
>>> print(data)
b'This is an unknown st\x00\x00\x00'
>>> data = struct.pack("24s",b"This is an unknown string abcdef")
>>> print(data)
b'This is an unknown strin'
In the case of supplying a bytes that is too short struct.pack pads the remainder with 0's and in the case where it's too long it truncates it.
If you don't mind getting the missing space padded out with zeros you can just pass in the bytes object directly to struct.pack and it will handle it.
Thanks to both...
My new code:
F = open("data.bin", "wb")
strs = b"This is an unkown string"
import struct
data = struct.pack("{0}s".format(len(strs)),strs)
print(data)
F.write(data)
F.close()

How to get bytes of image in StringIO container

I am am getting an image from email attachment and will never touch disk. Image will be placed into a StringIO container and processed by PIL. How do I get the file size in bytes?
image_file = StringIO('image from email')
im = Image.open(image_file)
Use StringIO's .tell() method by seeking to the end of the file:
>>> from StringIO import StringIO
>>> s = StringIO("foobar")
>>> s.tell()
0
>>> s.seek(0, 2)
>>> s.tell()
6
In your case:
image_file = StringIO('image from email')
image_file.seek(0, 2) # Seek to the end
bytes = image_file.tell() # Get no. of bytes
image_file.seek(0) # Seek to the start
im = Image.open(image_file)
Suppose you have:
>>> s = StringIO("cat\u2014jack")
>>> s.seek(0, os.SEEK_END)
>>> print(s.tell())
8
This however is incorrect. \u2014 is an emdash character - a single character but it is 3 bytes long.
>>> len("\u2014")
1
>>> len("\u2014".encode("utf-8"))
3
StringIO may not use utf-8 for storage either. It used to use UCS-2 or UCS-4 but since pep 393 may be using utf-8 or some other encoding depending on the circumstance.
What ultimately matters is the binary representation you ultimately go with. If you're encoding text, and you'll eventually write the file out encoded as utf-8 then you have to encode the value in it's entirety to know how many bytes it will take up. This is because as a variable-length encoding, utf-8 characters may require multiple bytes to encode.
You could do something like:
>>> s = StringIO("cat\u2014jack")
>>> size = len(s.getvalue().encode('utf-8'))
10

Categories