I've encoded an 8x8 block of numbers using RLE and got this string of bits that I need to write to a file. The last coefficient of the sequence is padded accordingly so that the whole thing is divisible by 8.
In the image above, you can see the array of numbers, the RLE encoding(86 bits long), the padded version so its divisible by 8 (88 bits long) and the concatenated string that I am to write to a file.
0000110010001100110110111100001000111011001100010000111101111000011000110011101000100011
Would the best way to write this concatenated string be to divide the thing into 8 bit long substrings and write those individually as bytes or is there a simpler, faster way to do this? When it comes to binary data, the only module I've worked with so far is struct, this is the first time i've tackled something like this. Any and all advice would be appreciated.
I would convert it to a list using regex.
import re
file = open(r"insert file path here", "a")
bitstring = "0000110010001100110110111100001000111011001100010000111101111000011000110011101000100011"
bitlist = re.findall('........', bitstring) #seperates bitstring into a list of each item being 8 characters long
for i in range(len(bitlist)):
file.write(bitlist[i])
file.write("\n")
Let me know if this is what you mean.
I would also like to mention how pulling data from a text file will not be the most efficient way of store it. The fastest would ideally be kept in an array such as [["10110100"], ["00101010"]] and pull the data by doing...
bitarray = [["10110100"], ["00101010"]]
>>> print(bitarray[0][0])
10110100
Related
I'll be happy for your help with some problem that I have.
Goal: To read a .h264 file (I extracted the raw bitstream to a file using ffmpeg) using python, and save it in some data structure (probably a list, I'll be happy for suggestions).
I want to read the data as hexa, for example I'll show how the data looks like:
What I want is to feed each byte(2 hexa digits), into a list, or some other data structure.
But any step forward will help me.
My Attempts:
First I tried to read the way I know:
with open(path, 'r') as fp:
data = fp.read()
Didn't work, got just ".
After a lot of changes, I tried something else, I saw online:
with open(path, 'r') as fp:
hex_list = ["{:02}".format(ord(c)) for c in fp.read()]
Still got an empty list.
I'll be happy for you help.
Thanks a lot.
EDIT:
Thanks to the comment below, I tried to open using 'rb', but still with no luck.
If you have an h264 mp4 file, you can open it and get a hexadecimal string representation like this using binascii.hexlify():
import binascii
with open('test.mp4', 'rb') as fin:
hexa = binascii.hexlify(fin.read())
print(hexa[0:1000])
hexa will be a python bytes object, and you can easily get back the binary representation by doing binascii.unhexlify(hexa). This will be much more efficient than storing the hex representation as strings in a list(), both in terms of space and time. You can access the bytes array with indices/slices, so whatever you were intending to do with the list will probably work fine with this (it will just be much faster and use a lot less memory).
One thing to keep in mind though is to get the the first hexadecimal digit from a bytes object, you don't do hexa[0], but rather hexa[0:1]. To get the first pair of hexadecimal digits (byte), you do: hexa[0:2]. The second byte is hexa[2:4] etc. As explained in the docs for hex():
Since bytes objects are sequences of integers (akin to a tuple), for a
bytes object b, b[0] will be an integer, while b[0:1] will be a bytes
object of length 1. (This contrasts with text strings, where both
indexing and slicing will produce a string of length 1)
As a part of a bigger project, I want to save a sequence of bits in a file so that the file is as small as possible. I'm not talking about compression, I want to save the sequence as it is but using the least amount of characters. The initial idea was to turn mini-sequences of 8 bits into chars using ASCII encoding and saving those chars, but due to some unknown problem with strange characters, the characters retrieved when reading the file are not the same that were originally written. I've tried opening the file with utf-8 encoding, latin-1 but none seems to work. I'm wondering if there's any other way, maybe by turning the sequence into a hexadecimal number?
technically you can not write less than a byte because the os organizes memory in bytes (write individual bits to a file in python), so this is binary file io, see https://docs.python.org/2/library/io.html there are modules like struct
open the file with the 'b' switch, indicates binary read/write operation, then use i.e. the to_bytes() function (Writing bits to a binary file) or struct.pack() (How to write individual bits to a text file in python?)
with open('somefile.bin', 'wb') as f:
import struct
>>> struct.pack("h", 824)
'8\x03'
>>> bits = "10111111111111111011110"
>>> int(bits[::-1], 2).to_bytes(4, 'little')
b'\xfd\xff=\x00'
if you want to get around the 8 bit (byte) structure of the memory you can use bit manipulation and techniques like bitmasks and BitArrays
see https://wiki.python.org/moin/BitManipulation and https://wiki.python.org/moin/BitArrays
however the problem is, as you said, to read back the data if you use BitArrays of differing length i.e. to store a decimal 7 you need 3 bit 0x111 to store a decimal 2 you need 2 bit 0x10. now the problem is to read this back.
how can your program know if it has to read the value back as a 3 bit value or as a 2 bit value ? in unorganized memory the sequence decimal 72 looks like 11110 that translates to 111|10 so how can your program know where the | is ?
in normal byte ordered memory decimal 72 is 0000011100000010 -> 00000111|00000010 this has the advantage that it is clear where the | is
this is why memory on its lowest level is organized in fixed clusters of 8 bit = 1 byte. if you want to access single bits inside a bytes/ 8 bit clusters you can use bitmasks in combination with logic operators (http://www.learncpp.com/cpp-tutorial/3-8a-bit-flags-and-bit-masks/). in python the easiest way for single bit manipulation is the module ctypes
if you know that your values are all 6 bit maybe it is worth the effort, however this is also tough...
(How do you set, clear, and toggle a single bit?)
(Why can't you do bitwise operations on pointer in C, and is there a way around this?)
I´ve a "binary" file with variable size record. Each record is composed of an amount of little endians 2 byte-sized integer numbers. I know the start position of each record and it´s size.
What´s the fastest way to read this to a Python array of integer?
I don't think you can do better than opening the file and reading the size of each recorde then use struct.unpack('<i', buff) for each integer you want to read, file.read(2), will get you 2 integers.
So I have several very large files which represent each position in the human genome. Both of these files are binary masks for a certain type of "score" for each position in the genome and I am interested in getting a new mask where both scores are "1" i.e. the intersection of the two masks.
For example:
File 1: 00100010101
File 2: 11111110001
Desired output: 00100010001
In python, it is really fast to read these big files (they contain between 50-250 million characters) into strings. However, I can't just & the strings together. I CAN do something like
bin(int('0001',2) & int('1111', 2))
but is there a more direct way that doesn't require that I pad in the extra 0's and convert back to a string in the end?
I think the conversion to builtin integer types for the binary-and operation is likely to make it much faster than working character by character (because Python's int is written in C rather than Python). I'd suggest working on each line of your input files, rather than the whole multi-million-character strings at once. The binary-and operation doesn't require any carrying, so there's no issue working with each line separately.
To avoid awkward string operations to pad the result out the the right length, you can the str.format method to convert your integer to a binary string of the right length in one go. Here's an implementation that writes the output out to a new file:
import itertools
with open(filename1) as in1, open(filename2) as in2, open(filename3, "w") as out:
for line1, line2 in itertools.izip(in1, in2):
out.write("{0:0{1}b}\n".format(long(line1, 2) & long(line2, 2), len(line1) - 1))
I'm using one of the neat features of the string formatting mini-language to use a second argument to pass a desired length for the converted number. If you can rely upon the lines always having exactly 50 binary digits (including at the end of the files), you could hard code the length with {:050b} rather than computing it from the length of an input line.
I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?
You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.