I have a textfile containing a range of bits, in ascii:
cat myFile.txt
0101111011100011001...
I would like to write this to an other file in binary mode, so that i can read it in an hexeditor. How could I reach that? I tried already to convert it with code like:
f2=open(fileOut, 'wb')
with open(fileIn) as f:
while True:
c = f.read(1)
byte = byte+str(c)
if not c:
print "End of file"
break
if count % 8 is 0:
count = 0
print hex(int(byte,2))
try:
f2.write('\\x'+hex(int(byte,2))[2:]).zfill(2)
except:
pass
byte = ''
count += 1
but that didn't achieve what I planed to do. Do you have any hint?
Reading and writing one byte at a time is painfully slow. You may get around ~45x speedup of your code simply by reading more data from the file per call to f.read and f.write:
|------------------+--------------------|
| using_loop_20480 | 8.34 msec per loop |
| using_loop_8 | 354 msec per loop |
|------------------+--------------------|
using_loop is the code shown at the bottom of this post. using_loop_20480 is the code with chunksize = 1024*20. This means that 20480 bytes are read from the file at a time. using_loop_1 is the same code with chunksize = 1.
Regarding count % 8 is 0: Don't use is to compare numerical values; use == instead. Here are some examples why is may give you wrong results (maybe not in the code you posted, but in general, is is not appropriate here):
In [5]: 1L is 1
Out[5]: False
In [6]: 1L == 1
Out[6]: True
In [7]: 0.0 is 0
Out[7]: False
In [8]: 0.0 == 0
Out[8]: True
Instead of
struct.pack('{n}B'.format(n = len(bytes)), *bytes)
you could use
bytearray(bytes)
Not only is it less typing, it is a slight bit faster too.
|------------------------------+--------------------|
| using_loop_20480 | 8.34 msec per loop |
| using_loop_with_struct_20480 | 8.59 msec per loop |
|------------------------------+--------------------|
bytearrays are a good match for this job because it bridges the
gap between regarding the data as a string and as a sequence of
numbers.
In [16]: bytearray([97,98,99])
Out[16]: bytearray(b'abc')
In [17]: print(bytearray([97,98,99]))
abc
As you can see above, bytearray(bytes) allows you to
define the bytearray by passing it a sequence of ints (in
range(256)), and allows you to write it out as though it were a
string: g.write(bytearray(bytes)).
def using_loop(output, chunksize):
with open(filename, 'r') as f, open(output, 'wb') as g:
while True:
chunk = f.read(chunksize)
if chunk == '':
break
bytes = [int(chunk[i:i+8], 2)
for i in range(0, len(chunk), 8)]
g.write(bytearray(bytes))
Make sure chunksize is a multiple of 8.
This is the code I used to create the tables. Note that prettytable also does something similar to this, and it may be advisable to use their code rather than my hack: table.py
This is the module I used to time the code: utils_timeit.py. (It uses table.py).
And here is the code I used to time using_loop (and other variants): timeit_bytearray_vs_struct.py
Use struct:
import struct
...
f2.write(struct.pack('b', int(byte,2))) # signed 8 bit int
or
f2.write(struct.pack('B', int(byte,2))) # unsigned 8 bit int
Related
So, basically, I have a 256MB file which contains "int32" numbers written as 4-byte sequences, and I have to sort them into another file.
The thing I struggle with the most is how I read the file into an array of 4-byte sequences. First I thought this method is slow because it reads elements one by one:
for i in range(numsPerFile):
buffer = currFile.read(4)
arr.append(buffer)
Then I made this:
buffer = currFile.read()
arr4 = []
for i in range(numsPerFile):
arr4.append(bytes(buffer[i*4 : i*4+4]))
And it wasn't any better when I measured the time (both read 128000 numbers in ~0.8 sec on my pc). So, is there a faster method to do this?
Read and write tasks are often the slow part of an activity.
I have done some tests and removed the reading from file activity from the timings.
Test0 was repeating the test you did to calibrate it for my old slow
machine.
Test1 uses list comprehension to build the list of four byte chunks
the same as test0
Test2 builds a list of integers. As you mentioned the data was
int32 I've turned the four byte chunks in to integers to be used in Python.
Test3 uses the array library to unpack the bytes as int32 as suggested by the original poster. Used frombytes rather than fromfile to keep it consistent with the other tests.
Test3 was the fastest by some margin, followed by test2, test1, and test0 was always the slowest on my machine.
I only created 128MB file for test but it should give you an idea.
This is the code I used for my testing:
import array
import time
from pathlib import Path
import secrets
import struct
tmp_txt = Path('/tmp/test.txt')
def gen_test():
data = secrets.token_bytes(128_000_000)
tmp_txt.write_bytes(data)
print(f"Generated {len(data):,} bytes")
def test0(data):
arr4 = []
for i in range(int(len(data)/4)):
arr4.append(bytes(data[i*4:i*4+4]))
return arr4
def test1(data):
return [data[i:i+4] for i in range(0, len(data), 4)]
def test2(data):
return [_[0] for _ in struct.iter_unpack('>i', data)]
def test3(data):
arr4 = array.array('i')
arr4.frombytes(data)
arr4.byteswap()
return arr4
def run_test():
raw_data = tmp_txt.read_bytes()
all_test = [
test0,
test1,
test2,
test3,
]
for idx, test in enumerate(all_test):
print(f"test{idx}")
n0 = time.perf_counter()
x0 = test(raw_data)
n1 = time.perf_counter()
print(f"\tTest{idx} took {n1 - n0:.2f}")
if isinstance(x0, array.array):
print(f"\tdata[{x0.buffer_info()[1]:,}] = [{x0[0]}, ..., {x0[-1]}]")
else:
print(f"\tdata[{len(x0):,}] = [{x0[0]}, ..., {x0[-1]}]")
if __name__ == '__main__':
gen_test()
run_test()
This gave me the following transcript:
Generated 128,000,000 bytes
test0
Test0 took 10.50
data[32,000,000] = [b'\x9f\x1cOO', ..., b'\x94\x8e\xd2?']
test1
Test1 took 4.40
data[32,000,000] = [b'\x9f\x1cOO', ..., b'\x94\x8e\xd2?']
test2
Test2 took 3.31
data[32,000,000] = [-1625534641, ..., -1802579393]
test3
Test3 took 0.41
data[32,000,000] = [-1625534641, ..., -1802579393]
def bindata():
array = []
count = 0
with open('file.txt', 'r') as f:
while count < 64:
f.seek(count)
array.append(f.read(8))
count = count + 8
print(array)
bindata()
file.txt data: FFFFFFFFAAAAAAAABBBBBBBBCCCCCCCCDDDDDDDDAAAAAAAAEEEEEEEEFFFFFFFF
I think I must have misunderstood your question. I can create a test file hopefully similar to yours like this:
import numpy as np
# Make 256MB array
a = np.random.randint(0,1000,int(256*1024*1024/4, dtype=np.uint32)
# Write to disk
a.tofile('BigBoy.bin')
And now time how long it takes to read it from disk
%timeit b = np.fromfile('BigBoy.bin', dtype=np.uint32)
That gives 33 milliseconds.
33.2 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Check all the same
print(np.all(a==b))
True
Or, if for some reason, you don't like Numpy, you can use a struct:
d = open('BigBoy.bin', 'rb').read()
fmt = f'{len(d)//4}I'
data = struct.unpack(fmt, d)
I'm trying to read the data from stdin, actually I'm using Ctrl+C, Ctrl+V to pass the values into cmd, but it stops the process at some point. It's always the same point. The input file is .in type, formating is that the first row is one number and next 3 rows contains the set of numbers separated with space. I'm using Python 3.9.9. Also this problem occurs with longer files (number of elements in sets > 10000), with short input everything is fine. It seems like the memory just run out. I had following aproach:
def readData():
# Read input
for line in range(5):
x = list(map(int, input().rsplit()))
if(line == 0):
nodes_num = x[0]
if(line == 1):
masses_list = x
if(line == 2):
init_seq_list = x
if(line == 3):
fin_seq_list = x
return nodes_num, masses_list, init_seq_list, fin_seq_list
and the data which works:
6
2400 2000 1200 2400 1600 4000
1 4 5 3 6 2
5 3 2 4 6 1
and the long input file:
https://pastebin.com/atAcygkk
it stops at the sequence: ... 2421 1139 322], so it's like a part of 4th row.
To read input from "standard input", you just need to use the stdin stream. Since your data is all on lines you can just read until the EOL delimiter, not having to track lines yourself with some index number. This code will work when run as python3.9 sowholeinput.py < atAcygkk.txt, or cat atAcygkk.txt| python3.9 sowholeinput.py.
def read_data():
stream = sys.stdin
num = int(stream.readline())
masses = [int(t) for t in stream.readline().split()]
init_seq = [int(t) for t in stream.readline().split()]
fin_seq = [int(t) for t in stream.readline().split()]
return num, masses, init_seq, fin_seq
Interestingly, it does not work, as you describe, when pasting the text using the terminal cut-and-paste. This implies a limitation with that method, not Python itself.
Background
I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.
Simplified code to reproduce the problem
This code, iotest.py
Generates a file of a similar structure to what I have in my studies
Reads it using numpy.fromfile
Reads it using numpy.frombuffer
Compares timing of both
import numpy as np
import os
def generate_binary_file(filename, nrecords):
n_records = np.random.poisson(lam = nrecords)
record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
with open(filename, 'wb') as f:
s = 0
for i in range(n_records):
f.write(record_lengths[i].tobytes())
f.write(x[s:s+record_lengths[i]].tobytes())
s += record_lengths[i]
# Trick for testing: make sum of records equal to 0
f.write(np.array([1], dtype = 'i4').tobytes())
f.write(np.array([-x.sum()], dtype = 'd').tobytes())
return os.path.getsize(filename)
def read_binary_npfromfile(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.fromfile(f, 'i4', 1)[0]
x = np.fromfile(f, 'd', record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
def read_binary_npfrombuffer(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
if __name__ == '__main__':
from timeit import Timer
from functools import partial
fname = 'testfile.tmp'
print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
fsize = generate_binary_file(fname, i)
t1 = Timer(partial(read_binary_npfromfile, fname))
t2 = Timer(partial(read_binary_npfrombuffer, fname))
a1 = np.array(t1.repeat(5, 1))
a2 = np.array(t2.repeat(5, 1))
print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
Results
Conclusions
In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.
In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.
In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.
Questions
In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?
TL;DR: np.fromfile and np.frombuffer are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.
Analysis
The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while loops calls 601_214 times np.fromfile and np.frombuffer. The timing on my machine are 10.5s for read_binary_npfromfile and 1.2s for read_binary_npfrombuffer. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).
Faster Numpy code
The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:
def read_binary_faster_numpy(filename):
buff = np.fromfile(filename, dtype=np.uint8)
buff_int32 = buff.view(np.int32)
buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
nblocks = buff.size // 4 # Number of 4-byte blocks
pos = 0 # Displacement by block of 4 bytes
lst = []
while pos < nblocks:
record_length = buff_int32[pos]
pos += 1
if pos + record_length * 2 > nblocks:
break
offset = pos // 2
if pos % 2 == 0: # Aligned with buff_double_1
x = buff_double_1[offset:offset+record_length]
else: # Aligned with buff_double_2
x = buff_double_2[offset:offset+record_length]
lst.append(x) # np.sum is too expensive here
pos += record_length * 2
checksum = np.sum(np.concatenate(lst))
assert(np.abs(checksum) < 1e-6)
The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos] or buff_double_1[offset:offset+record_length]. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...
Better performance with... a basic pure-Python code
It turns out that the following pure-python implementation is faster, safer and simpler:
from struct import unpack_from
def read_binary_python_struct(filename):
checksum = 0.0
with open(filename, 'rb') as f:
data = f.read()
offset = 0
while offset < len(data):
record_length = unpack_from('#i', data, offset)[0]
checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
offset += 4 + record_length * 8
assert(np.abs(checksum) < 1e-6)
This is because the overhead of unpack_from is far lower than the one of Numpy functions but it is still not great.
In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.
Writing a high-performance code with Numba
We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:
#nb.njit('float64(uint8[::1])')
def decode_buffer(buff):
checksum = 0.0
offset = 0
while offset + 4 < buff.size:
record_length = buff[offset:offset+4].view(np.int32)[0]
start = offset + 4
end = start + record_length * 8
if end > buff.size:
break
x = buff[start:end].view(np.float64)
checksum += x.sum()
offset = end
return checksum
def read_binary_numba(filename):
buff = np.fromfile(filename, dtype=np.uint8)
checksum = decode_buffer(buff)
assert(np.abs(checksum) < 1e-6)
Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile which need to be called outside a Numba-compiled function.
Benchmark
Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.
read_binary_npfromfile: 10616 ms ( x1)
read_binary_npfrombuffer: 1132 ms ( x9)
read_binary_faster_numpy: 509 ms ( x21)
read_binary_python_struct: 222 ms ( x48)
read_binary_numba: 12 ms ( x885)
Optimal time: 7 ms (x1517)
One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile and only 4 ms in decode_buffer!
I've got a folder full of very large files that need to be byte flipped by a power of 4. So essentially, I need to read the files as a binary, adjust the sequence of bits, and then write a new binary file with the bits adjusted.
In essence, what I'm trying to do is read a hex string hexString that looks like this:
"00112233AABBCCDD"
And write a file that looks like this:
"33221100DDCCBBAA"
(i.e. every two characters is a byte, and I need to flip the bytes by a power of 4)
I am very new to python and coding in general, and the way I am currently accomplishing this task is extremely inefficient. My code currently looks like this:
import binascii
with open(myFile, 'rb') as f:
content = f.read()
hexString = str(binascii.hexlify(content))
flippedBytes = ""
inc = 0
while inc < len(hexString):
flippedBytes += file[inc + 6:inc + 8]
flippedBytes += file[inc + 4:inc + 6]
flippedBytes += file[inc + 2:inc + 4]
flippedBytes += file[inc:inc + 2]
inc += 8
..... write the flippedBytes to file, etc
The code I pasted above accurately accomplishes what I need (note, my actual code has a few extra lines of: "hexString.replace()" to remove unnecessary hex characters - but I've left those out to make the above easier to read). My ultimate problem is that it takes EXTREMELY long to run my code with larger files. Some of my files I need to flip are almost 2gb in size, and the code was going to take almost half a day to complete one single file. I've got dozens of files I need to run this on, so that timeframe simply isn't practical.
Is there a more efficient way to flip the HEX values in a file by a power of 4?
.... for what it's worth, there is a tool called WinHEX that can do this manually, and only takes a minute max to flip the whole file.... I was just hoping to automate this with python so we didn't have to manually use WinHEX each time
You want to convert your 4-byte integers from little-endian to big-endian, or vice-versa. You can use the struct module for that:
import struct
with open(myfile, 'rb') as infile, open(myoutput, 'wb') as of:
while True:
d = infile.read(4)
if not d:
break
le = struct.unpack('<I', d)
be = struct.pack('>I', *le)
of.write(be)
Here is a little struct awesomeness to get you started:
>>> import struct
>>> s = b'\x00\x11\x22\x33\xAA\xBB\xCC\xDD'
>>> a, b = struct.unpack('<II', s)
>>> s = struct.pack('>II', a, b)
>>> ''.join([format(x, '02x') for x in s])
'33221100ddccbbaa'
To do this at full speed for a large input, use struct.iter_unpack
I'm using an old version of python on an embedded platform ( Python 1.5.2+ on Telit platform ). The problem that I have is my function for converting a string to hex. It is very slow. Here is the function:
def StringToHexString(s):
strHex=''
for c in s:
strHex = strHex + hexLoookup[ord(c)]
return strHex
hexLookup is a lookup table (a python list) containing all the hex representation of each character.
I am willing to try everything (a more compact function, some language tricks I don't know about). To be more clear here are the benchmarks (resolution is 1 second on that platform):
N is the number of input characters to be converted to hex and the time is in seconds.
N | Time (seconds)
50 | 1
150 | 3
300 | 4
500 | 8
1000 | 15
1500 | 23
2000 | 31
Yes, I know, it is very slow... but if I could gain something like 1 or 2 seconds it would be a progress.
So any solution is welcomed, especially from people who know about python performance.
Thanks,
Iulian
PS1: (after testing the suggestions offered - keeping the ord call):
def StringToHexString(s):
hexList=[]
hexListAppend=hexList.append
for c in s:
hexListAppend(hexLoookup[ord(c)])
return ''.join(hexList)
With this function I obtained the following times: 1/2/3/5/12/19/27 (which is clearly better)
PS2 (can't explain but it's blazingly fast) A BIG thank you Sven Marnach for the idea !!!:
def StringToHexString(s):
return ''.join( map(lambda param:hexLoookup[param], map(ord,s) ) )
Times:1/1/2/3/6/10/12
Any other ideas/explanations are welcome!
Make your hexLoookup a dictionary indexed by the characters themselves, so you don't have to call ord each time.
Also, don't concatenate to build strings – that used to be slow. Use join on a list instead.
from string import join
def StringToHexString(s):
strHex = []
for c in s:
strHex.append(hexLoookup[c])
return join(strHex, '')
Building on Petr Viktorin's answer, you could further improve the performance by avoiding global vairable and attribute look-ups in favour of local variable look-ups. Local variables are optimized to avoid a dictionary look-up on each access. (They haven't always been, by I just double-checked this optimization was already in place in 1.5.2, released in 1999.)
from string import join
def StringToHexString(s):
strHex = []
strHexappend = strHex.append
_hexLookup = hexLoookup
for c in s:
strHexappend(_hexLoookup[c])
return join(strHex, '')
Constantly reassigning and adding strings together using the + operator is very slow. I guess that Python 1.5.2 isn't yet optimizing for this. So using string.join() would be preferable.
Try
import string
def StringToHexString(s):
listhex = []
for c in s:
listhex.append(hexLookup[ord(c)])
return string.join(listhex, '')
and see if that is any faster.
Try:
from string import join
def StringToHexString(s):
charlist = []
for c in s:
charlist.append(hexLoookup[ord(c)])
return join(charlist, '')
Each string addition takes time proportional to the length of the string so, while join will also take time proportional to the length of the entire string, but you only have to do it once.
You could also make hexLookup a dict mapping characters to hex values, so you don't have to call ord for every character. It's a micro-optimization, so probably won't be significant.
def StringToHexString(s):
return ''.join( map(lambda param:hexLoookup[param], map(ord,s) ) )
Seems like this is the fastest! Thank you Sven Marnach!