So I have several very large files which represent each position in the human genome. Both of these files are binary masks for a certain type of "score" for each position in the genome and I am interested in getting a new mask where both scores are "1" i.e. the intersection of the two masks.
For example:
File 1: 00100010101
File 2: 11111110001
Desired output: 00100010001
In python, it is really fast to read these big files (they contain between 50-250 million characters) into strings. However, I can't just & the strings together. I CAN do something like
bin(int('0001',2) & int('1111', 2))
but is there a more direct way that doesn't require that I pad in the extra 0's and convert back to a string in the end?
I think the conversion to builtin integer types for the binary-and operation is likely to make it much faster than working character by character (because Python's int is written in C rather than Python). I'd suggest working on each line of your input files, rather than the whole multi-million-character strings at once. The binary-and operation doesn't require any carrying, so there's no issue working with each line separately.
To avoid awkward string operations to pad the result out the the right length, you can the str.format method to convert your integer to a binary string of the right length in one go. Here's an implementation that writes the output out to a new file:
import itertools
with open(filename1) as in1, open(filename2) as in2, open(filename3, "w") as out:
for line1, line2 in itertools.izip(in1, in2):
out.write("{0:0{1}b}\n".format(long(line1, 2) & long(line2, 2), len(line1) - 1))
I'm using one of the neat features of the string formatting mini-language to use a second argument to pass a desired length for the converted number. If you can rely upon the lines always having exactly 50 binary digits (including at the end of the files), you could hard code the length with {:050b} rather than computing it from the length of an input line.
Related
I encounter weird problem and could not solve it for days. I have created byte array that contains values from 1 to 250 and write it to binary file from C# using WriteAllBytes.
Later i read it from Python using np.fromfile(filename, dtype=np.ubyte). However, i realize this functions was adding arbitrary comma (see the image). Interestingly it is not visible in array property. And if i call numpy.array2string, comma turns '\n'. One solution is to replace comma with none, however i have very long sequences it will take forever on 100gb of data to use replace function. I also recheck the files by reading using .net Core, i'm quite sure comma is not there.
What could i be missing?
Edit:
I was trying to read all byte values to array and cast each member to or entire array to string. I found out that most reliable way to do this is:
list(map(str, (ubyte_array))
Above code returns string list that its elements without any arbitrary comma or blank space.
I've encoded an 8x8 block of numbers using RLE and got this string of bits that I need to write to a file. The last coefficient of the sequence is padded accordingly so that the whole thing is divisible by 8.
In the image above, you can see the array of numbers, the RLE encoding(86 bits long), the padded version so its divisible by 8 (88 bits long) and the concatenated string that I am to write to a file.
0000110010001100110110111100001000111011001100010000111101111000011000110011101000100011
Would the best way to write this concatenated string be to divide the thing into 8 bit long substrings and write those individually as bytes or is there a simpler, faster way to do this? When it comes to binary data, the only module I've worked with so far is struct, this is the first time i've tackled something like this. Any and all advice would be appreciated.
I would convert it to a list using regex.
import re
file = open(r"insert file path here", "a")
bitstring = "0000110010001100110110111100001000111011001100010000111101111000011000110011101000100011"
bitlist = re.findall('........', bitstring) #seperates bitstring into a list of each item being 8 characters long
for i in range(len(bitlist)):
file.write(bitlist[i])
file.write("\n")
Let me know if this is what you mean.
I would also like to mention how pulling data from a text file will not be the most efficient way of store it. The fastest would ideally be kept in an array such as [["10110100"], ["00101010"]] and pull the data by doing...
bitarray = [["10110100"], ["00101010"]]
>>> print(bitarray[0][0])
10110100
I have a string that looks like this
'\x00\x03\x10B\x00\x0e12102 G1103543\x10T\x07\x21'
I have been able to match the data I want which is "12102 G1103543" with this.
re.findall('\x10\x42(.*)\x10\x54', data)
Which will output this
'\x00\x0e12102 G1103543'
The problem im having is that \x10\x54 is not always at the end of the data I want. However what I have noticed is that the first two hex digits correspond to how long the data length will be. I.E. \x00\x0e = 14 so the data length is 14char long.
Is there a better way to do this, like matching the first part then cutting the next 14 characters? I should also say that the length will vary as im looking to match several things.
Also is there a way to output the string in all hex so its easier for me to read when working in a python shell I.E. \x10B == \x10\x42
Thank You!
Edit: I managed to come up with this working solution.
newdata = re.findall('\x10\x42(.*)', data)
newdata[0][2:int(newdata[0][0:2].encode('hex'))]
Please, note that you have an structured binary file at your hands, and it is foolish to try to use regular expressions to extract data from it.
First of all the "hex data" you talk about is not "hex data" -it is just bytes
in your stream outside the ASCII range - therefore Python2 will display these characters as a \x10 and so on - but internally it is just a single byte with the value 16 (when viewed as decimal). The \x42you write corresponds to the ASCII letter B and that is why you see B in your representation.
So your best bet there would be to get the file specification, and read the data you want from there using the struct module and byte-string slicing.
If you can't have the file spec, so it is a reverse-engineering work to find out the fields of interest -just like you are already doing. But even then, you should write some code with the struct module to get your values, since field lenghts (and most likely offsets) are encoded in the byte stream itself.
In this example, your marker "\x10\x42" will rarely be a marker per se - it is most likely its position is indicated by other factors in the file (either a fixed place in the file definition, or by an offset earlier on the file.
But - if you are correctly using this as a marker, you could make use of regular expressions just to findout all offsets of the "\x10\x42" marker as you are doing, and them interpreting the following two bytes as the message length:
import struct, re
def get_data(data, sep=b"\x10B"):
results = []
for match in re.finditer(sep, data):
offset = match.start()
msglen = struct.unpack(">H", data[offset + 2: offset + 4])[0]
print(msglen)
results.append(data[offset + 4: offset + 4 + msglen])
return results
I need to display data from two files (with equal sizes) to be able to visually compare them. For this, I made a new Tk widget consisting of four Text widgets. The first widget contains characters representing bytes from the first file, the second one contains hexadecimal values of the bytes in the left widget, and the same goes for the third and four one respective (containing data/hex values for the second file). The input data to be displayed are two bytearrays.
To fill the Text widgets, I have to process the input data (bytearrays), because
I have to get rid of unprintable characters and some characters that caused misalignment of the respective lines in the four widgets,
I have to fill the second/fourth widgets with hex values of the bytes, therefore I have to convert the byte values to hex numbers.
The code I used does the functionality described, and it works quite well for small files (several hundreds of kilobytes). However when I try to load bigger files (several megabytes), the time it takes to process and load the data is unacceptable (tens of seconds).
An example of my widget for displaying the data can be seen here:
To process the input data, I use the following code. _ldata and _rdata are bytearrays with the input data, ldata and rdata are strings to be loaded in the first and third Text widgets, lhexdata and rhexdata are strings with the hexadecimal values to be loaded in the second and fourth Text widget. wrap is an integer determining how many bytes will be represented on one line. The print_chars function replaces all the characters that caused misalignment or couldn't be selected in the Text widgets.
def print_chars(self, byte):
if (byte < 0x20 or
(byte > 0x7E and byte < 0xB1)):
return 0x07
else:
return byte
...
ldata = "\n".join(["".join(map(chr,
map(self.print_chars, _ldata[i:i+wrap])))
for i in range(0, len(_ldata), wrap)])
rdata = "\n".join(["".join(map(chr,
map(self.print_chars, _rdata[i:i+wrap])))
for i in range(0, len(_rdata), wrap)])
lhexdata = "\n".join([" ".join(map("{0:02X}".format, _ldata[i:i+wrap]))
for i in range(0, len(_ldata), wrap)])
rhexdata = "\n".join([" ".join(map("{0:02X}".format, _rdata[i:i+wrap]))
for i in range(0, len(_rdata), wrap)])
I think there is a way to speed things up, but can't figure out any. Before I implemented the list comprehension, I had used for cycles for the data processing, and it was a real pain in the neck even for very short inputs. The list comprehensions vere a big improvement in performance, yet not sufficient. Thanks for any advices.
I think your first two lines can be improved by using bytearray.translate with an appropriate translation table rather than using your own escaping and converting system. Then you can turn it into a string with bytearray.decode. You still need an additional step to split the text into lines and recombine it, but I suspect that it will be faster if you've done the translation work already.
table = bytearray.maketrans(bytes(range(0x20))+bytes(range(0x7f, 0xb1)),
b"\x07"*(0x20+0xb1-0x7f))
ldata_string = _ldata.translate(table).decode("latin-1") # pick some 8-bit encoding
ldata = "\n".join(ldata_string[i:i+wrap] for i in range(0, len(ldata), wrap))
You can do something similar for the hex output, using the b16encode function from the base64 module to convert to hex, then decode to make the bytes output into a string. The splitting and rejoining gets a bit more complicated due to the need for spaces between each pair of hex digits, but I suspect it will still be faster than encoding each byte separately.
import base64
lhexdata_string = base64.b16encode(_ldata).decode("ascii") # hex will always be ASCII
lhexdata = "\n".join(" ".join(hexdata_string[i+j:i+j+2] for i in range(0, 2*wrap, 2))
for j in range(0, len(lhexdata_string), 2*wrap))
Note that the code above assumes that you're using Python 3. If you're using Python 2 you'll need to change a few things (such as working around the lack of maketrans and not needing to decode).
I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?
You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.