Speeding up (simple) text processing - python

I need to display data from two files (with equal sizes) to be able to visually compare them. For this, I made a new Tk widget consisting of four Text widgets. The first widget contains characters representing bytes from the first file, the second one contains hexadecimal values of the bytes in the left widget, and the same goes for the third and four one respective (containing data/hex values for the second file). The input data to be displayed are two bytearrays.
To fill the Text widgets, I have to process the input data (bytearrays), because
I have to get rid of unprintable characters and some characters that caused misalignment of the respective lines in the four widgets,
I have to fill the second/fourth widgets with hex values of the bytes, therefore I have to convert the byte values to hex numbers.
The code I used does the functionality described, and it works quite well for small files (several hundreds of kilobytes). However when I try to load bigger files (several megabytes), the time it takes to process and load the data is unacceptable (tens of seconds).
An example of my widget for displaying the data can be seen here:
To process the input data, I use the following code. _ldata and _rdata are bytearrays with the input data, ldata and rdata are strings to be loaded in the first and third Text widgets, lhexdata and rhexdata are strings with the hexadecimal values to be loaded in the second and fourth Text widget. wrap is an integer determining how many bytes will be represented on one line. The print_chars function replaces all the characters that caused misalignment or couldn't be selected in the Text widgets.
def print_chars(self, byte):
if (byte < 0x20 or
(byte > 0x7E and byte < 0xB1)):
return 0x07
else:
return byte
...
ldata = "\n".join(["".join(map(chr,
map(self.print_chars, _ldata[i:i+wrap])))
for i in range(0, len(_ldata), wrap)])
rdata = "\n".join(["".join(map(chr,
map(self.print_chars, _rdata[i:i+wrap])))
for i in range(0, len(_rdata), wrap)])
lhexdata = "\n".join([" ".join(map("{0:02X}".format, _ldata[i:i+wrap]))
for i in range(0, len(_ldata), wrap)])
rhexdata = "\n".join([" ".join(map("{0:02X}".format, _rdata[i:i+wrap]))
for i in range(0, len(_rdata), wrap)])
I think there is a way to speed things up, but can't figure out any. Before I implemented the list comprehension, I had used for cycles for the data processing, and it was a real pain in the neck even for very short inputs. The list comprehensions vere a big improvement in performance, yet not sufficient. Thanks for any advices.

I think your first two lines can be improved by using bytearray.translate with an appropriate translation table rather than using your own escaping and converting system. Then you can turn it into a string with bytearray.decode. You still need an additional step to split the text into lines and recombine it, but I suspect that it will be faster if you've done the translation work already.
table = bytearray.maketrans(bytes(range(0x20))+bytes(range(0x7f, 0xb1)),
b"\x07"*(0x20+0xb1-0x7f))
ldata_string = _ldata.translate(table).decode("latin-1") # pick some 8-bit encoding
ldata = "\n".join(ldata_string[i:i+wrap] for i in range(0, len(ldata), wrap))
You can do something similar for the hex output, using the b16encode function from the base64 module to convert to hex, then decode to make the bytes output into a string. The splitting and rejoining gets a bit more complicated due to the need for spaces between each pair of hex digits, but I suspect it will still be faster than encoding each byte separately.
import base64
lhexdata_string = base64.b16encode(_ldata).decode("ascii") # hex will always be ASCII
lhexdata = "\n".join(" ".join(hexdata_string[i+j:i+j+2] for i in range(0, 2*wrap, 2))
for j in range(0, len(lhexdata_string), 2*wrap))
Note that the code above assumes that you're using Python 3. If you're using Python 2 you'll need to change a few things (such as working around the lack of maketrans and not needing to decode).

Related

Python np.fromfile() adding arbitrary random comma when reading from binary file

I encounter weird problem and could not solve it for days. I have created byte array that contains values from 1 to 250 and write it to binary file from C# using WriteAllBytes.
Later i read it from Python using np.fromfile(filename, dtype=np.ubyte). However, i realize this functions was adding arbitrary comma (see the image). Interestingly it is not visible in array property. And if i call numpy.array2string, comma turns '\n'. One solution is to replace comma with none, however i have very long sequences it will take forever on 100gb of data to use replace function. I also recheck the files by reading using .net Core, i'm quite sure comma is not there.
What could i be missing?
Edit:
I was trying to read all byte values to array and cast each member to or entire array to string. I found out that most reliable way to do this is:
list(map(str, (ubyte_array))
Above code returns string list that its elements without any arbitrary comma or blank space.

The size parameter for gzip.open().read()

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:
with gzip.open(filename) as bytestream:
bytestream.read(16)
buf = bytestream.read(
IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.
This is the documentation for the read() function:
Read at most n characters from stream.
Read from underlying buffer until we have n characters or we hit EOF.
If n is negative or omitted, read until EOF.
If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?
I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.
In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?
My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.
Reproducible details
My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.
The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of \x1c`) and the data begins:
import gzip
import numpy as np
train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1
def extract_data(filename, num_images):
with gzip.open(filename) as bytestream:
first30 = bytestream.read(30)
return first30
first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'\x00\x00\x08\x03\x00\x00\xea`\x00\x00\x00\x1c\x00\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.
Any reference or advice would be very appreciated!
From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.
Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.
That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
Thus, if you want to skip all four of those items, you would take 16 bytes off the top.
From the code snippet, bytestream.read(16) reads or skips the first 16 bytes of bytestream. When you quoted that read() reads at most n characters from the stream, it does so, but also it appears that python stores a single character in 1 byte, making 16 characters occupy 16 bytes.
See more on chars and bytes https://pymotw.com/3/gzip/#reading-compressed-data
The code snippet is primarily interested in the contents of buf, skipping the first 16 bytes of the stream. To understand how to determine the parameter that goes into first bytestream.read() AKA determine how many bytes of the compressed image file to skip, we must understand what the rest of the code does. Particularly, what file are we reading and what are we trying to accomplish with numpy(?) library (saving rgb images in a 1D numpy array?).
I am definitely not an expert on image processing, but it seems that bytestream.read(16) is a unique solution for a unique problem of processing some unique compressed image file. Thus, it is hard to tell how to determine how many bytes to skip without seeing more code and understanding more logic behind the snippet.

Parsing IFF- style data using Python

I have an IFF- style file (see below) whose contents I need to inspect in Python.
https://en.wikipedia.org/wiki/Interchange_File_Format
I can iterate through the file using the following code
from chunk import Chunk
def chunks(f):
while True:
try:
c=Chunk(f, align=False, bigendian=False)
yield c
c.skip()
except EOFError:
break
if __name__=="__main__":
for c in chunks(file("sample.iff", 'rb')):
name, sz, value = c.getname(), c.getsize(), c.read()
print (name, sz, value)
Now I need to parse the different values. I have had some success using Python's 'struct' module, unpacking different fields as follows
struct.unpack('<I', value)
or
struct.unpack('BBBB', value)
by experimenting with different formatting characters shown in the struct module documentation
https://docs.python.org/2/library/struct.html
This works with some of the simpler fields but not with the more complex ones. It is all very trial-and-error. What I need is some systematic way of unpacking the different values, some way of knowing or inspecting the type of data they represent. I am not a C datatype expert.
Any ideas ? Many thanks.
SVOXVERS BVER BPM }SPEDTGRDGVOL`NAME2017-02-15 16-38MSCLMZOOMXOFMYOFLMSKCURLTIMESELSLGENPATNPATTPATLPDTAa � 1pQ 10 `q !#QP! 0A �`A PCHNPLIN PYSZ PFLGPICO �m�!�a��Q�1:\<<<<:\�1�Q��a�!�mPFGCPBGC���PFFFPXXXPYYYPENDSFFFCSNAM OutputSFINSRELSXXXDSYYYhSZZZSSCLSVPRSCOL���SMICSMIB����SMIP����SLNK����SENDSFFFISNAM FMSTYPFMSFINSRELSXXX�SYYY8SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNKCVAL�CVAL0CVAL�CVALCVALCVALCVALCVALGCVALnCVAL\CVALCVAL&CVALoCVALDCVALCVALCVALCMID������������������SENDSFFFQSNAM EchoSTYPEchoSFINSRELSXXX�SYYY SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVALCVALCVALCMID0������SENDSFFFQSNAM ReverbSTYPReverbSFINSRELSXXX\SYYY�SZZZSSCLSVPRSCOL��SMICSMIB����SMIP����SLNK����CVALCVALCVAL�CVAL�CVALCVALCVALCVALCVALCMIDH���������SENDSENDSENDSENDSEND
If it's really an IFF file, it needs alignment and big-endian turned on, and the file would contain a single FORM chunk that in turn contains the FORM type such as SVOX and the contents chunks. (Or it could contain a LIST or CAT container chunk.)
An IFF chunk has:
A four-character chunk-type code
A four-byte big-endian integer: length
length number of data bytes
A pad byte for alignment if length is odd
This is documented in "EA IFF 85". See the "EA IFF-85" Repository for the original IFF docs. [I wrote them.]
Some file formats like RIFF and PNG are variations on the IFF design, not conforming applications of the IFF standard. They vary the chunk format details, which is why Python's Chunk reader library lets you pick alignment, endian, and when to recurse into chunks.
By looking at your file in a hex/ascii dump and mapping out the chunk spans, you should be able to deduce whether it uses big-endian or little-endian length fields, whether each odd-length chunk is followed by a pad byte for alignment, and whether there are chunks within chunks.
Now to the contents. A chunk's type signals the format and semantics of its contents. Those contents could be a simple C struct or could contain variable-length strings. IFF itself does not provide metadata on that level of structure, unlike JSON and TIFF.
So try to find the documentation for the file format (SVOX?).
Otherwise try to reverse engineer the data. If you put sample data into an application that generates these files, you can try special cases, look for the expected values in the file, change just one parameter, then look for what changed in the file.
Finally, your code should call c.close(). c.close() will call c.skip() for you and also handle chunk closing, which includes safety checks for attempts to read the chunk afterwards.

Is there some way to AND two strings in python?

So I have several very large files which represent each position in the human genome. Both of these files are binary masks for a certain type of "score" for each position in the genome and I am interested in getting a new mask where both scores are "1" i.e. the intersection of the two masks.
For example:
File 1: 00100010101
File 2: 11111110001
Desired output: 00100010001
In python, it is really fast to read these big files (they contain between 50-250 million characters) into strings. However, I can't just & the strings together. I CAN do something like
bin(int('0001',2) & int('1111', 2))
but is there a more direct way that doesn't require that I pad in the extra 0's and convert back to a string in the end?
I think the conversion to builtin integer types for the binary-and operation is likely to make it much faster than working character by character (because Python's int is written in C rather than Python). I'd suggest working on each line of your input files, rather than the whole multi-million-character strings at once. The binary-and operation doesn't require any carrying, so there's no issue working with each line separately.
To avoid awkward string operations to pad the result out the the right length, you can the str.format method to convert your integer to a binary string of the right length in one go. Here's an implementation that writes the output out to a new file:
import itertools
with open(filename1) as in1, open(filename2) as in2, open(filename3, "w") as out:
for line1, line2 in itertools.izip(in1, in2):
out.write("{0:0{1}b}\n".format(long(line1, 2) & long(line2, 2), len(line1) - 1))
I'm using one of the neat features of the string formatting mini-language to use a second argument to pass a desired length for the converted number. If you can rely upon the lines always having exactly 50 binary digits (including at the end of the files), you could hard code the length with {:050b} rather than computing it from the length of an input line.

How do I search for a set amount of hex and non hex data in python

I have a string that looks like this
'\x00\x03\x10B\x00\x0e12102 G1103543\x10T\x07\x21'
I have been able to match the data I want which is "12102 G1103543" with this.
re.findall('\x10\x42(.*)\x10\x54', data)
Which will output this
'\x00\x0e12102 G1103543'
The problem im having is that \x10\x54 is not always at the end of the data I want. However what I have noticed is that the first two hex digits correspond to how long the data length will be. I.E. \x00\x0e = 14 so the data length is 14char long.
Is there a better way to do this, like matching the first part then cutting the next 14 characters? I should also say that the length will vary as im looking to match several things.
Also is there a way to output the string in all hex so its easier for me to read when working in a python shell I.E. \x10B == \x10\x42
Thank You!
Edit: I managed to come up with this working solution.
newdata = re.findall('\x10\x42(.*)', data)
newdata[0][2:int(newdata[0][0:2].encode('hex'))]
Please, note that you have an structured binary file at your hands, and it is foolish to try to use regular expressions to extract data from it.
First of all the "hex data" you talk about is not "hex data" -it is just bytes
in your stream outside the ASCII range - therefore Python2 will display these characters as a \x10 and so on - but internally it is just a single byte with the value 16 (when viewed as decimal). The \x42you write corresponds to the ASCII letter B and that is why you see B in your representation.
So your best bet there would be to get the file specification, and read the data you want from there using the struct module and byte-string slicing.
If you can't have the file spec, so it is a reverse-engineering work to find out the fields of interest -just like you are already doing. But even then, you should write some code with the struct module to get your values, since field lenghts (and most likely offsets) are encoded in the byte stream itself.
In this example, your marker "\x10\x42" will rarely be a marker per se - it is most likely its position is indicated by other factors in the file (either a fixed place in the file definition, or by an offset earlier on the file.
But - if you are correctly using this as a marker, you could make use of regular expressions just to findout all offsets of the "\x10\x42" marker as you are doing, and them interpreting the following two bytes as the message length:
import struct, re
def get_data(data, sep=b"\x10B"):
results = []
for match in re.finditer(sep, data):
offset = match.start()
msglen = struct.unpack(">H", data[offset + 2: offset + 4])[0]
print(msglen)
results.append(data[offset + 4: offset + 4 + msglen])
return results

Categories