I'm working on a huffman encoder/decoder in Python, and am experiencing some unexpected (at least for me) behavior in my code. Encoding the file is fine, the problem occurs when decoding the file. Below is the associated code:
def decode(cfile):
with open(cfile,"rb") as f:
enc = f.read()
len_dkey = int(bin(ord(enc[0]))[2:].zfill(8) + bin(ord(enc[1]))[2:].zfill(8),2) # length of dictionary
pad = ord(enc[2]) # number of padding zeros at end of message
dkey = { int(k): v for k,v in json.loads(enc[3:len_dkey+3]).items() } # dictionary
enc = enc[len_dkey+3:] # actual message in bytes
com = []
for b in enc:
com.extend([ bit=="1" for bit in bin(ord(b))[2:].zfill(8)]) # actual encoded message in bits (True/False)
cnode = 0 # current node for tree traversal
dec = "" # decoded message
for b in com:
cnode = 2 * cnode + b + 1 # array implementation of tree
if cnode in dkey:
dec += dkey[cnode]
cnode = 0
with codecs.open("uncompressed_"+cfile,"w","ISO-8859-1") as f:
f.write(dec)
The first with open(cfile,"rb") as f call runs very quickly for all file sizes (tested sizes are 1.2MB, 679KB, and 87KB), but the part that slows down the code significantly is the for b in com loop. I've done some timing and I honestly don't know what's going on.
I've timed the whole decode function on each file, as shown below:
87KB 1.5 sec
679KB 6.0 sec
1.2MB 384.7 sec
first of all, I don't even know how to assign this complexity. Next, I've timed a single run through of the problematic loop, and got that the line cnode = 2*cnode + b + 1 takes 2e-6 seconds while the if cnode in dkey line takes 0.0 seconds (according to time.clock() on OSX). So it seems as if the arithmetic is slowing down my program significantly...? Which I feel like doesn't make sense.
I actually have no idea what is going on, and any help at all would be super welcome
I found a solution to my problem, but I am still left with confusion afterwards. I solved the problem by changing the dec from "" to [], and then changing the dec += dkey[cnode] line to dec.append(dkey[cnode]). This resulted in the following times:
87KB 0.11 sec
679KB 0.21 sec
1.2MB 1.01 sec
As you can see, this has immensely cut down the time, so in that aspect, this was a success. However, I am still confused as to why python's string concatenation seems to be the problem here.
Related
Background
I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.
Simplified code to reproduce the problem
This code, iotest.py
Generates a file of a similar structure to what I have in my studies
Reads it using numpy.fromfile
Reads it using numpy.frombuffer
Compares timing of both
import numpy as np
import os
def generate_binary_file(filename, nrecords):
n_records = np.random.poisson(lam = nrecords)
record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
with open(filename, 'wb') as f:
s = 0
for i in range(n_records):
f.write(record_lengths[i].tobytes())
f.write(x[s:s+record_lengths[i]].tobytes())
s += record_lengths[i]
# Trick for testing: make sum of records equal to 0
f.write(np.array([1], dtype = 'i4').tobytes())
f.write(np.array([-x.sum()], dtype = 'd').tobytes())
return os.path.getsize(filename)
def read_binary_npfromfile(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.fromfile(f, 'i4', 1)[0]
x = np.fromfile(f, 'd', record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
def read_binary_npfrombuffer(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
if __name__ == '__main__':
from timeit import Timer
from functools import partial
fname = 'testfile.tmp'
print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
fsize = generate_binary_file(fname, i)
t1 = Timer(partial(read_binary_npfromfile, fname))
t2 = Timer(partial(read_binary_npfrombuffer, fname))
a1 = np.array(t1.repeat(5, 1))
a2 = np.array(t2.repeat(5, 1))
print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
Results
Conclusions
In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.
In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.
In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.
Questions
In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?
TL;DR: np.fromfile and np.frombuffer are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.
Analysis
The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while loops calls 601_214 times np.fromfile and np.frombuffer. The timing on my machine are 10.5s for read_binary_npfromfile and 1.2s for read_binary_npfrombuffer. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).
Faster Numpy code
The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:
def read_binary_faster_numpy(filename):
buff = np.fromfile(filename, dtype=np.uint8)
buff_int32 = buff.view(np.int32)
buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
nblocks = buff.size // 4 # Number of 4-byte blocks
pos = 0 # Displacement by block of 4 bytes
lst = []
while pos < nblocks:
record_length = buff_int32[pos]
pos += 1
if pos + record_length * 2 > nblocks:
break
offset = pos // 2
if pos % 2 == 0: # Aligned with buff_double_1
x = buff_double_1[offset:offset+record_length]
else: # Aligned with buff_double_2
x = buff_double_2[offset:offset+record_length]
lst.append(x) # np.sum is too expensive here
pos += record_length * 2
checksum = np.sum(np.concatenate(lst))
assert(np.abs(checksum) < 1e-6)
The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos] or buff_double_1[offset:offset+record_length]. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...
Better performance with... a basic pure-Python code
It turns out that the following pure-python implementation is faster, safer and simpler:
from struct import unpack_from
def read_binary_python_struct(filename):
checksum = 0.0
with open(filename, 'rb') as f:
data = f.read()
offset = 0
while offset < len(data):
record_length = unpack_from('#i', data, offset)[0]
checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
offset += 4 + record_length * 8
assert(np.abs(checksum) < 1e-6)
This is because the overhead of unpack_from is far lower than the one of Numpy functions but it is still not great.
In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.
Writing a high-performance code with Numba
We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:
#nb.njit('float64(uint8[::1])')
def decode_buffer(buff):
checksum = 0.0
offset = 0
while offset + 4 < buff.size:
record_length = buff[offset:offset+4].view(np.int32)[0]
start = offset + 4
end = start + record_length * 8
if end > buff.size:
break
x = buff[start:end].view(np.float64)
checksum += x.sum()
offset = end
return checksum
def read_binary_numba(filename):
buff = np.fromfile(filename, dtype=np.uint8)
checksum = decode_buffer(buff)
assert(np.abs(checksum) < 1e-6)
Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile which need to be called outside a Numba-compiled function.
Benchmark
Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.
read_binary_npfromfile: 10616 ms ( x1)
read_binary_npfrombuffer: 1132 ms ( x9)
read_binary_faster_numpy: 509 ms ( x21)
read_binary_python_struct: 222 ms ( x48)
read_binary_numba: 12 ms ( x885)
Optimal time: 7 ms (x1517)
One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile and only 4 ms in decode_buffer!
I'm a Python noobie trying to find my way around using it for Dynamo. I've had quite a bit of success using simple while loops/nested ifs to neaten my Dynamo scripts; however, I've been stumped by this recent error.
I'm attempting to pull in lists of data (flow rates from pipe fittings) and then output the maximum flow rate of each fitting by comparing the indices of each list (a cross fitting would have 4 flow rates in Revit, I'm comparing each pipe inlet/outlet flow rate and calculating the maximum for sizing purposes). For some reason, adding lists in the while loop and iterating the indices gives me the "unexpected token" error which I presume is related to "i += 1" according to online debuggers.
I've been using this while loop code format for a bit now and it has always worked for non-listed related iterations. Can anyone give me some guidance here?
Thank you in advance!
Error in Dynamo:
Warning: IronPythonEvaluator.EvaluateIronPythonScript
operation failed.
unexpected token 'i'
Code used:
import sys
import clr
clr.AddReference('ProtoGeometry')
from Autodesk.DesignScript.Geometry import *
dataEnteringNode = IN
a = IN[0]
b = IN[1]
c = IN[2]
d = IN[3]
start = 0
end = 3
i = start
y=[]
while i < end:
y.append(max( (a[i], b[i], c[i] ))
i += 1
OUT = y
I've got a folder full of very large files that need to be byte flipped by a power of 4. So essentially, I need to read the files as a binary, adjust the sequence of bits, and then write a new binary file with the bits adjusted.
In essence, what I'm trying to do is read a hex string hexString that looks like this:
"00112233AABBCCDD"
And write a file that looks like this:
"33221100DDCCBBAA"
(i.e. every two characters is a byte, and I need to flip the bytes by a power of 4)
I am very new to python and coding in general, and the way I am currently accomplishing this task is extremely inefficient. My code currently looks like this:
import binascii
with open(myFile, 'rb') as f:
content = f.read()
hexString = str(binascii.hexlify(content))
flippedBytes = ""
inc = 0
while inc < len(hexString):
flippedBytes += file[inc + 6:inc + 8]
flippedBytes += file[inc + 4:inc + 6]
flippedBytes += file[inc + 2:inc + 4]
flippedBytes += file[inc:inc + 2]
inc += 8
..... write the flippedBytes to file, etc
The code I pasted above accurately accomplishes what I need (note, my actual code has a few extra lines of: "hexString.replace()" to remove unnecessary hex characters - but I've left those out to make the above easier to read). My ultimate problem is that it takes EXTREMELY long to run my code with larger files. Some of my files I need to flip are almost 2gb in size, and the code was going to take almost half a day to complete one single file. I've got dozens of files I need to run this on, so that timeframe simply isn't practical.
Is there a more efficient way to flip the HEX values in a file by a power of 4?
.... for what it's worth, there is a tool called WinHEX that can do this manually, and only takes a minute max to flip the whole file.... I was just hoping to automate this with python so we didn't have to manually use WinHEX each time
You want to convert your 4-byte integers from little-endian to big-endian, or vice-versa. You can use the struct module for that:
import struct
with open(myfile, 'rb') as infile, open(myoutput, 'wb') as of:
while True:
d = infile.read(4)
if not d:
break
le = struct.unpack('<I', d)
be = struct.pack('>I', *le)
of.write(be)
Here is a little struct awesomeness to get you started:
>>> import struct
>>> s = b'\x00\x11\x22\x33\xAA\xBB\xCC\xDD'
>>> a, b = struct.unpack('<II', s)
>>> s = struct.pack('>II', a, b)
>>> ''.join([format(x, '02x') for x in s])
'33221100ddccbbaa'
To do this at full speed for a large input, use struct.iter_unpack
I'm profiling node.js vs python in file (48KB) reading synchronously.
Node.js code
var fs = require('fs');
var stime = new Date().getTime() / 1000;
for (var i=0; i<1000; i++){
var content = fs.readFileSync('npm-debug.log');
}
console.log("Total time took is: " + ((new Date().getTime() / 1000) - stime));
Python Code
import time
stime = time.time()
for i in range(1000):
with open('npm-debug.log', mode='r') as infile:
ax = infile.read();
print("Total time is: " + str(time.time() - stime));
Timings are as follows:
$ python test.py
Total time is: 0.5195660591125488
$ node test.js
Total time took is: 0.25799989700317383
Where is the difference?
In File IO or
Python list ds allocation
Or Am I not comparing apples to apples?
EDIT:
Updated python's readlines() to read() for a good comparison
Changed the iterations to 1000 from 500
PURPOSE:
To understand the truth in node.js is slower than python is slower than C kind of things and if so slow at which place in this context.
readlines returns a list of lines in the file, so it has to read the data char by char, constantly comparing the current character to any of the newline characters, and keep composing a list of lines.
This is more complicated than simple file.read(), which would be the equivalent of what Node.js does.
Also, the length calculated by your Python script is the number of lines, while Node.js gets the number of characters.
If you want even more speed, use os.open instead of open:
import os, time
def Test_os(n):
for x in range(n):
f = os.open('Speed test.py', os.O_RDONLY)
data = ""
t = os.read(f, 1048576).decode('utf8')
while t:
data += t
t = os.read(f, 1048576).decode('utf8')
os.close(f)
def Test_open(n):
for x in range(n):
with open('Speed test.py') as f:
data = f.read()
s = time.monotonic()
Test_os(500000)
print(time.monotonic() - s)
s = time.monotonic()
Test_open(500000)
print(time.monotonic() - s)
On my machine os.open is several seconds faster than open. The output is as follows:
53.68909174999999
58.12600833400029
As you can see, open is 4.4 seconds slower than os.open, although as the number of runs decreases, so does this difference.
Also, you should try tweaking the buffer size of the os.read function as different values may give very different timings:
Here 'operation' means a single call to Test_os.
If you get rid of bytes' decoding and use io.BytesIO instead of mere bytes objects, you'll get a considerable speedup:
def Test_os(n, buf):
for x in range(n):
f = os.open('test.txt', os.O_RDONLY)
data = io.BytesIO()
while data.write(os.read(f, buf)):
...
os.close(f)
Thus, the best result is now 0.038 seconds per call instead of 0.052 (~37% speedup).
I've got a 250 MB CSV file I need to read with ~7000 rows and ~9000 columns. Each row represents an image, and each column is a pixel (greyscale value 0-255)
I started with a simple np.loadtxt("data/training_nohead.csv",delimiter=",") but this gave me a memory error. I thought this was strange since I'm running 64-bit Python with 8 gigs of memory installed and it died after using only around 512 MB.
I've since tried SEVERAL other tactics, including:
import fileinput and read one line at a time, appending them to an array
np.fromstring after reading in the entire file
np.genfromtext
Manual parsing of the file (since all data is integers, this was fairly easy to code)
Every method gave me the same result. MemoryError around 512 MB. Wondering if there was something special about 512MB, I created a simple test program which filled up memory until python crashed:
str = " " * 511000000 # Start at 511 MB
while 1:
str = str + " " * 1000 # Add 1 KB at a time
Doing this didn't crash until around 1 gig. I also, just for fun, tried: str = " " * 2048000000 (fill 2 gigs) - this ran without a hitch. Filled the RAM and never complained. So the issue isn't the total amount of RAM I can allocate, but seems to be how many TIMES I can allocate memory...
I google'd around fruitlessly until I found this post: Python out of memory on large CSV file (numpy)
I copied the code from the answer exactly:
def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)
data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data
Calling iter_loadtxt("data/training_nohead.csv") gave a slightly different error this time:
MemoryError: cannot allocate array memory
Googling this error I only found one, not so helpful, post: Memory error (MemoryError) when creating a boolean NumPy array (Python)
As I'm running Python 2.7, this was not my issue. Any help would be appreciated.
With some help from #J.F. Sebastian I developed the following answer:
train = np.empty([7049,9246])
row = 0
for line in open("data/training_nohead.csv")
train[row] = np.fromstring(line, sep=",")
row += 1
Of course this answer assumed prior knowledge of the number of rows and columns. Should you not have this information before-hand, the number of rows will always take a while to calculate as you have to read the entire file and count the \n characters. Something like this will suffice:
num_rows = 0
for line in open("data/training_nohead.csv")
num_rows += 1
For number of columns if every row has the same number of columns then you can just count the first row, otherwise you need to keep track of the maximum.
num_rows = 0
max_cols = 0
for line in open("data/training_nohead.csv")
num_rows += 1
tmp = line.split(",")
if len(tmp) > max_cols:
max_cols = len(tmp)
This solution works best for numerical data, as a string containing a comma could really complicate things.
This is an old discussion, but might help people in present.
I think I know why str = str + " " * 1000 fails fester than str = " " * 2048000000
When running the first one, I believe OS needs to allocate in memory the new object which is str + " " * 1000, and only after that it reference the name str to it. Before referencing the name 'str' to the new object, it cannot get rid of the first one.
This means the OS needs to allocate about the 'str' object twice in the same time, making it able to do it just for 1 gig, instead of 2 gigs.
I believe using the next code will get the same maximum memory out of your OS as in single allocation:
str = " " * 511000000
while(1):
l = len(str)
str = " "
str = " " * (len + 1000)
Feel free to roccet me if I am wrong