Split a binary file into 4-byte sequences (python) - python

So, basically, I have a 256MB file which contains "int32" numbers written as 4-byte sequences, and I have to sort them into another file.
The thing I struggle with the most is how I read the file into an array of 4-byte sequences. First I thought this method is slow because it reads elements one by one:
for i in range(numsPerFile):
buffer = currFile.read(4)
arr.append(buffer)
Then I made this:
buffer = currFile.read()
arr4 = []
for i in range(numsPerFile):
arr4.append(bytes(buffer[i*4 : i*4+4]))
And it wasn't any better when I measured the time (both read 128000 numbers in ~0.8 sec on my pc). So, is there a faster method to do this?

Read and write tasks are often the slow part of an activity.
I have done some tests and removed the reading from file activity from the timings.
Test0 was repeating the test you did to calibrate it for my old slow
machine.
Test1 uses list comprehension to build the list of four byte chunks
the same as test0
Test2 builds a list of integers. As you mentioned the data was
int32 I've turned the four byte chunks in to integers to be used in Python.
Test3 uses the array library to unpack the bytes as int32 as suggested by the original poster. Used frombytes rather than fromfile to keep it consistent with the other tests.
Test3 was the fastest by some margin, followed by test2, test1, and test0 was always the slowest on my machine.
I only created 128MB file for test but it should give you an idea.
This is the code I used for my testing:
import array
import time
from pathlib import Path
import secrets
import struct
tmp_txt = Path('/tmp/test.txt')
def gen_test():
data = secrets.token_bytes(128_000_000)
tmp_txt.write_bytes(data)
print(f"Generated {len(data):,} bytes")
def test0(data):
arr4 = []
for i in range(int(len(data)/4)):
arr4.append(bytes(data[i*4:i*4+4]))
return arr4
def test1(data):
return [data[i:i+4] for i in range(0, len(data), 4)]
def test2(data):
return [_[0] for _ in struct.iter_unpack('>i', data)]
def test3(data):
arr4 = array.array('i')
arr4.frombytes(data)
arr4.byteswap()
return arr4
def run_test():
raw_data = tmp_txt.read_bytes()
all_test = [
test0,
test1,
test2,
test3,
]
for idx, test in enumerate(all_test):
print(f"test{idx}")
n0 = time.perf_counter()
x0 = test(raw_data)
n1 = time.perf_counter()
print(f"\tTest{idx} took {n1 - n0:.2f}")
if isinstance(x0, array.array):
print(f"\tdata[{x0.buffer_info()[1]:,}] = [{x0[0]}, ..., {x0[-1]}]")
else:
print(f"\tdata[{len(x0):,}] = [{x0[0]}, ..., {x0[-1]}]")
if __name__ == '__main__':
gen_test()
run_test()
This gave me the following transcript:
Generated 128,000,000 bytes
test0
Test0 took 10.50
data[32,000,000] = [b'\x9f\x1cOO', ..., b'\x94\x8e\xd2?']
test1
Test1 took 4.40
data[32,000,000] = [b'\x9f\x1cOO', ..., b'\x94\x8e\xd2?']
test2
Test2 took 3.31
data[32,000,000] = [-1625534641, ..., -1802579393]
test3
Test3 took 0.41
data[32,000,000] = [-1625534641, ..., -1802579393]

def bindata():
array = []
count = 0
with open('file.txt', 'r') as f:
while count < 64:
f.seek(count)
array.append(f.read(8))
count = count + 8
print(array)
bindata()
file.txt data: FFFFFFFFAAAAAAAABBBBBBBBCCCCCCCCDDDDDDDDAAAAAAAAEEEEEEEEFFFFFFFF

I think I must have misunderstood your question. I can create a test file hopefully similar to yours like this:
import numpy as np
# Make 256MB array
a = np.random.randint(0,1000,int(256*1024*1024/4, dtype=np.uint32)
# Write to disk
a.tofile('BigBoy.bin')
And now time how long it takes to read it from disk
%timeit b = np.fromfile('BigBoy.bin', dtype=np.uint32)
That gives 33 milliseconds.
33.2 ms ± 336 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Check all the same
print(np.all(a==b))
True
Or, if for some reason, you don't like Numpy, you can use a struct:
d = open('BigBoy.bin', 'rb').read()
fmt = f'{len(d)//4}I'
data = struct.unpack(fmt, d)

Related

Writing Trees, number of baskets and compression (uproot)

I am trying to optimize the way trees are written in pyroot and came across uproot. In the end my application should write events (consisiting of arrays) to a tree which are continuously coming in.
The first approach is the classic way:
event= [1.,2.,3.]
f = ROOT.TFile("my_tree.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")
pt = array.array('f', [0.]*3)
tree.Branch("pt", pt, "pt[3]/F")
#for loop to simulate incoming events
for _ in range(10000):
for i, element in enumerate(event):
pt[i] = element
tree.Fill()
tree.Print()
tree.Write("", ROOT.TObject.kOverwrite);
f.Close()
This gives the following Tree and execution time:
Tree characterisitics
Trying to do it with uproot my code looks like this:
np_array = np.array([[1,2,3]])
ak_array = ak.from_numpy(np_array)
with uproot.recreate("testing.root", compression=None) as fout:
fout.mktree("tree", {"branch": ak_array.type})
for _ in range(10000):
fout["tree"].extend({"branch": ak_array})
which gives the following tree:
Tree characteristics
So the uproot method takes much longer, the file size is much bigger and each event gets a seperate basket. I tried out different commpression settings but that did not change anything. Any idea on how to optimize this? Is this even a sensible usecase for uproot and can the process of writing trees being speed up in comparision to the first way of doing it?
The extend method is supposed to write a new TBasket with each invocation. (See the documentation, especially the orange warning box. The purpose of that is so that you can control the TBasket sizes.) If you're calling it 10000 times to write 1 value (the value [1, 2, 3]) each, that's a maximally inefficient use.
Fundamentally, you're thinking about this problem in an entry-by-entry way, rather than in terms of columns, the way that scientific processing is normally done in Python. What you want to do instead is to collect a large dataset in memory and write it to the file in one chunk. If the data that you'll eventually be addressing is larger than the memory on your computer, you would do it in "large enough" chunks, which is probably on the order of hundreds of megabytes or gigabytes.
For instance, starting with your example,
import time
import uproot
import numpy as np
import awkward as ak
np_array = np.array([[1, 2, 3]])
ak_array = ak.from_numpy(np_array)
starttime = time.time()
with uproot.recreate("bad.root") as fout:
fout.mktree("tree", {"branch": ak_array.type})
for _ in range(10000):
fout["tree"].extend({"branch": ak_array})
print("Total time:", time.time() - starttime)
The total time (on my computer) is 1.9 seconds and the TTree characteristics are atrocious:
******************************************************************************
*Tree :tree : *
*Entries : 10000 : Total = 1170660 bytes File Size = 2970640 *
* : : Tree compression factor = 1.00 *
******************************************************************************
*Br 0 :branch : branch[3]/L *
*Entries : 10000 : Total Size= 1170323 bytes File Size = 970000 *
*Baskets : 10000 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
Instead, we want the data to be in a single array (or some loop that produces ~GB scale arrays):
np_array = np.array([[1, 2, 3]] * 10000)
(This isn't necessarily how you would get np_array, since * 10000 makes a large, intermediate Python list. Suffice to say, you get the data somehow.)
Now we do the write with a single call to extend, which makes a single TBasket:
np_array = np.array([[1, 2, 3]] * 10000)
ak_array = ak.from_numpy(np_array)
starttime = time.time()
with uproot.recreate("good.root") as fout:
fout.mktree("tree", {"branch": ak_array.type})
fout["tree"].extend({"branch": ak_array})
print("Total time:", time.time() - starttime)
The total time (on my computer) is 0.0020 seconds and the TTree characteristics are much better:
******************************************************************************
*Tree :tree : *
*Entries : 10000 : Total = 240913 bytes File Size = 3069 *
* : : Tree compression factor = 107.70 *
******************************************************************************
*Br 0 :branch : branch[3]/L *
*Entries : 10000 : Total Size= 240576 bytes File Size = 2229 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 107.70 *
*............................................................................*
So, the writing is almost 1000× faster and the compression is 100× better. (With one entry per TBasket in the previous example, there was no compression because any compressed data would be bigger than the original!)
By comparison, if we do entry-by-entry writing with PyROOT,
import time
import array
import ROOT
data = [1, 2, 3]
holder = array.array("q", [0]*3)
file = ROOT.TFile("pyroot.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")
tree.Branch("branch", holder, "branch[3]/L")
starttime = time.time()
for _ in range(10000):
for i, x in enumerate(data):
holder[i] = x
tree.Fill()
tree.Write("", ROOT.TObject.kOverwrite)
file.Close()
print("Total time:", time.time() - starttime)
The total time (on my computer) is 0.062 seconds and the TTree characteristics are fine:
******************************************************************************
*Tree :tree : An Example Tree *
*Entries : 10000 : Total = 241446 bytes File Size = 3521 *
* : : Tree compression factor = 78.01 *
******************************************************************************
*Br 0 :branch : branch[3]/L *
*Entries : 10000 : Total Size= 241087 bytes File Size = 3084 *
*Baskets : 8 : Basket Size= 32000 bytes Compression= 78.01 *
*............................................................................*
So, PyROOT is 30× slower here, but the compression is almost as good. ROOT decided to make 8 TBaskets, which is configurable with AutoFlush parameters.
Keep in mind, though, that this is a comparison of techniques, not libraries. If you wrap a NumPy array with RDataFrame and write that, then you can skip all of the overhead involved in the Python for loop and you get the advantages of columnar processing.
But columnar processing only matters if you're working with big data. Much like compression, if you apply it to very small datasets (or a very small dataset many times), then it can hurt, rather than help.

Dramatic drop in numpy fromfile performance when switching from python 2 to python 3

Background
I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.
Simplified code to reproduce the problem
This code, iotest.py
Generates a file of a similar structure to what I have in my studies
Reads it using numpy.fromfile
Reads it using numpy.frombuffer
Compares timing of both
import numpy as np
import os
def generate_binary_file(filename, nrecords):
n_records = np.random.poisson(lam = nrecords)
record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
with open(filename, 'wb') as f:
s = 0
for i in range(n_records):
f.write(record_lengths[i].tobytes())
f.write(x[s:s+record_lengths[i]].tobytes())
s += record_lengths[i]
# Trick for testing: make sum of records equal to 0
f.write(np.array([1], dtype = 'i4').tobytes())
f.write(np.array([-x.sum()], dtype = 'd').tobytes())
return os.path.getsize(filename)
def read_binary_npfromfile(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.fromfile(f, 'i4', 1)[0]
x = np.fromfile(f, 'd', record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
def read_binary_npfrombuffer(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
if __name__ == '__main__':
from timeit import Timer
from functools import partial
fname = 'testfile.tmp'
print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
fsize = generate_binary_file(fname, i)
t1 = Timer(partial(read_binary_npfromfile, fname))
t2 = Timer(partial(read_binary_npfrombuffer, fname))
a1 = np.array(t1.repeat(5, 1))
a2 = np.array(t2.repeat(5, 1))
print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
Results
Conclusions
In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.
In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.
In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.
Questions
In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?
TL;DR: np.fromfile and np.frombuffer are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.
Analysis
The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while loops calls 601_214 times np.fromfile and np.frombuffer. The timing on my machine are 10.5s for read_binary_npfromfile and 1.2s for read_binary_npfrombuffer. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).
Faster Numpy code
The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:
def read_binary_faster_numpy(filename):
buff = np.fromfile(filename, dtype=np.uint8)
buff_int32 = buff.view(np.int32)
buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
nblocks = buff.size // 4 # Number of 4-byte blocks
pos = 0 # Displacement by block of 4 bytes
lst = []
while pos < nblocks:
record_length = buff_int32[pos]
pos += 1
if pos + record_length * 2 > nblocks:
break
offset = pos // 2
if pos % 2 == 0: # Aligned with buff_double_1
x = buff_double_1[offset:offset+record_length]
else: # Aligned with buff_double_2
x = buff_double_2[offset:offset+record_length]
lst.append(x) # np.sum is too expensive here
pos += record_length * 2
checksum = np.sum(np.concatenate(lst))
assert(np.abs(checksum) < 1e-6)
The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos] or buff_double_1[offset:offset+record_length]. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...
Better performance with... a basic pure-Python code
It turns out that the following pure-python implementation is faster, safer and simpler:
from struct import unpack_from
def read_binary_python_struct(filename):
checksum = 0.0
with open(filename, 'rb') as f:
data = f.read()
offset = 0
while offset < len(data):
record_length = unpack_from('#i', data, offset)[0]
checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
offset += 4 + record_length * 8
assert(np.abs(checksum) < 1e-6)
This is because the overhead of unpack_from is far lower than the one of Numpy functions but it is still not great.
In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.
Writing a high-performance code with Numba
We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:
#nb.njit('float64(uint8[::1])')
def decode_buffer(buff):
checksum = 0.0
offset = 0
while offset + 4 < buff.size:
record_length = buff[offset:offset+4].view(np.int32)[0]
start = offset + 4
end = start + record_length * 8
if end > buff.size:
break
x = buff[start:end].view(np.float64)
checksum += x.sum()
offset = end
return checksum
def read_binary_numba(filename):
buff = np.fromfile(filename, dtype=np.uint8)
checksum = decode_buffer(buff)
assert(np.abs(checksum) < 1e-6)
Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile which need to be called outside a Numba-compiled function.
Benchmark
Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.
read_binary_npfromfile: 10616 ms ( x1)
read_binary_npfrombuffer: 1132 ms ( x9)
read_binary_faster_numpy: 509 ms ( x21)
read_binary_python_struct: 222 ms ( x48)
read_binary_numba: 12 ms ( x885)
Optimal time: 7 ms (x1517)
One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile and only 4 ms in decode_buffer!

Did I/O become slower since Python 2.7?

I'm currently having a small side project in which I want to sort a 20GB file on my machine as fast as possible. The idea is to chunk the file, sort the chunks, merge the chunks. I just used pyenv to time the radixsort code with different Python versions and saw that 2.7.18 is way faster than 3.6.10, 3.7.7, 3.8.3 and 3.9.0a. Can anybody explain why Python 3.x is slower than 2.7.18 in this simple example? Were there new features added?
import os
def chunk_data(filepath, prefixes):
"""
Pre-sort and chunk the content of filepath according to the prefixes.
Parameters
----------
filepath : str
Path to a text file which should get sorted. Each line contains
a string which has at least 2 characters and the first two
characters are guaranteed to be in prefixes
prefixes : List[str]
"""
prefix2file = {}
for prefix in prefixes:
chunk = os.path.abspath("radixsort_tmp/{:}.txt".format(prefix))
prefix2file[prefix] = open(chunk, "w")
# This is where most of the execution time is spent:
with open(filepath) as fp:
for line in fp:
prefix2file[line[:2]].write(line)
Execution times (multiple runs):
2.7.18: 192.2s, 220.3s, 225.8s
3.6.10: 302.5s
3.7.7: 308.5s
3.8.3: 279.8s, 279.7s (binary mode), 295.3s (binary mode), 307.7s, 380.6s (wtf?)
3.9.0a: 292.6s
The complete code is on Github, along with a minimal complete version
Unicode
Yes, I know that Python 3 and Python 2 deal different with strings. I tried opening the files in binary mode (rb / wb), see the "binary mode" comments. They are a tiny bit faster on a couple of runs. Still, Python 2.7 is WAY faster on all runs.
Try 1: Dictionary access
When I phrased this question, I thought that dictionary access might be a reason for this difference. However, I think the total execution time is way less for dictionary access than for I/O. Also, timeit did not show anything important:
import timeit
import numpy as np
durations = timeit.repeat(
'a["b"]',
repeat=10 ** 6,
number=1,
setup="a = {'b': 3, 'c': 4, 'd': 5}"
)
mul = 10 ** -7
print(
"mean = {:0.1f} * 10^-7, std={:0.1f} * 10^-7".format(
np.mean(durations) / mul,
np.std(durations) / mul
)
)
print("min = {:0.1f} * 10^-7".format(np.min(durations) / mul))
print("max = {:0.1f} * 10^-7".format(np.max(durations) / mul))
Try 2: Copy time
As a simplified experiment, I tried to copy the 20GB file:
cp via shell: 230s
Python 2.7.18: 237s, 249s
Python 3.8.3: 233s, 267s, 272s
The Python stuff is generated by the following code.
My first thought was that the variance is quite high. So this could be the reason. But then, the variance of chunk_data execution time is also high, but the mean is noticeably lower for Python 2.7 than for Python 3.x. So it seems not to be an I/O scenario as simple as I tried here.
import time
import sys
import os
version = sys.version_info
version = "{}.{}.{}".format(version.major, version.minor, version.micro)
if os.path.isfile("numbers-tmp.txt"):
os.remove("numers-tmp.txt")
t0 = time.time()
with open("numbers-large.txt") as fin, open("numers-tmp.txt", "w") as fout:
for line in fin:
fout.write(line)
t1 = time.time()
print("Python {}: {:0.0f}s".format(version, t1 - t0))
My System
Ubuntu 20.04
Thinkpad T460p
Python through pyenv
This is a combination of multiple effects, mostly the fact that Python 3 needs to perform unicode decoding/encoding when working in text mode and if working in binary mode it will send the data through dedicated buffered IO implementations.
First of all, using time.time to measure execution time uses the wall time and hence includes all sorts of Python unrelated things such as OS-level caching and buffering, as well as buffering of the storage medium. It also reflects any interference with other processes that require the storage medium. That's why you are seeing these wild variations in timing results. Here are the results for my system, from seven consecutive runs for each version:
py3 = [660.9, 659.9, 644.5, 639.5, 752.4, 648.7, 626.6] # 661.79 +/- 38.58
py2 = [635.3, 623.4, 612.4, 589.6, 633.1, 613.7, 603.4] # 615.84 +/- 15.09
Despite the large variation it seems that these results indeed indicate different timings as can be confirmed for example by a statistical test:
>>> from scipy.stats import ttest_ind
>>> ttest_ind(p2, p3)[1]
0.018729004515179636
i.e. there's only a 2% chance that the timings emerged from the same distribution.
We can get a more precise picture by measuring the process time rather than the wall time. In Python 2 this can be done via time.clock while Python 3.3+ offers time.process_time. These two functions report the following timings:
py3_process_time = [224.4, 226.2, 224.0, 226.0, 226.2, 223.7, 223.8] # 224.90 +/- 1.09
py2_process_time = [171.0, 171.1, 171.2, 171.3, 170.9, 171.2, 171.4] # 171.16 +/- 0.16
Now there's much less spread in the data since the timings reflect the Python process only.
This data suggests that Python 3 takes about 53.7 seconds longer to execute. Given the large amount of lines in the input file (550_000_000) this amounts to about 97.7 nanoseconds per iteration.
The first effect causing increased execution time are unicode strings in Python 3. The binary data is read from the file, decoded and then encoded again when it is written back. In Python 2 all strings are stored as binary strings right away, so this doesn't introduce any encoding/decoding overhead. You don't see this effect clearly in your tests because it disappears in the large variation introduced by various external resources which are reflected in the wall time difference. For example we can measure the time it takes for a roundtrip from binary to unicode to binary:
In [1]: %timeit b'000000000000000000000000000000000000'.decode().encode()
162 ns ± 2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
This does include two attribute lookups as well as two function calls, so the actual time needed is smaller than the value reported above. To see the effect on execution time, we can change the test script to use binary modes "rb" and "wb" instead of text modes "r" and "w". This reduces the timing results for Python 3 as follows:
py3_binary_mode = [200.6, 203.0, 207.2] # 203.60 +/- 2.73
That reduces the process time by about 21.3 seconds or 38.7 nanoseconds per iteration. This is in agreement with timing results for the roundtrip benchmark minus timing results for name lookups and function calls:
In [2]: class C:
...: def f(self): pass
...:
In [3]: x = C()
In [4]: %timeit x.f()
82.2 ns ± 0.882 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [5]: %timeit x
17.8 ns ± 0.0564 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
Here %timeit x measures the additional overhead of resolving the global name x and hence the attribute lookup and function call make 82.2 - 17.8 == 64.4 seconds. Subtracting this overhead twice from the above roundtrip data gives 162 - 2*64.4 == 33.2 seconds.
Now there's still a difference of 32.4 seconds between Python 3 using binary mode and Python 2. This comes from the fact that all the IO in Python 3 goes through the (quite complex) implementation of io.BufferedWriter .write while in Python 2 the file.write method proceeds fairly straightforward to fwrite.
We can check the types of the file objects in both implementations:
$ python3.8
>>> type(open('/tmp/test', 'wb'))
<class '_io.BufferedWriter'>
$ python2.7
>>> type(open('/tmp/test', 'wb'))
<type 'file'>
Here we also need to note that the above timing results for Python 2 have been obtained by using text mode, not binary mode. Binary mode aims to support all objects implementing the buffer protocol which results in additional work being performed also for strings (see also this question). If we switch to binary mode also for Python 2 then we obtain:
py2_binary_mode = [212.9, 213.9, 214.3] # 213.70 +/- 0.59
which is actually a bit larger than the Python 3 results (18.4 ns / iteration).
The two implementations also differ in other details such as the dict implementation. To measure this effect we can create a corresponding setup:
from __future__ import print_function
import timeit
N = 10**6
R = 7
results = timeit.repeat(
"d[b'10'].write",
setup="d = dict.fromkeys((str(i).encode() for i in range(10, 100)), open('test', 'rb'))", # requires file 'test' to exist
repeat=R, number=N
)
results = [x/N for x in results]
print(['{:.3e}'.format(x) for x in results])
print(sum(results) / R)
This gives the following results for Python 2 and Python 3:
Python 2: ~ 56.9 nanoseconds
Python 3: ~ 78.1 nanoseconds
This additional difference of about 21.2 nanoseconds amounts to about 12 seconds for the full 550M iterations.
The above timing code checks the dict lookup for only one key, so we also need to verify that there are no hash collisions:
$ python3.8 -c "print(len({str(i).encode() for i in range(10, 100)}))"
90
$ python2.7 -c "print len({str(i).encode() for i in range(10, 100)})"
90

Reading data from CSV into dataframe with multiple delimiters efficiently

I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.
I have made 5 attempts: among them, utilising the converters argument of pd.read_csv, using regex with engine='python', using str.replace. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.
I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?
The below string can be used for benchmarking purposes.
# Python 3.7.0, Pandas 0.23.4
from io import StringIO
import pandas as pd
import csv
# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6
def csv_reader_1(x):
df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
converters={3: lambda x: x.split(';')})
return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))
def csv_reader_2(x):
df = pd.read_csv(x, header=None, delimiter=';',
converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
return df.astype(float)
def csv_reader_3(x):
return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')
def csv_reader_4(x):
with x as fin:
reader = csv.reader(fin, delimiter=',')
L = [i[-1].split(';') for i in reader]
return pd.DataFrame(L, dtype=float)
def csv_reader_5(x):
with x as fin:
return pd.read_csv(StringIO(fin.getvalue().replace(';',',')),
sep=',', header=None, usecols=[3, 4, 5])
Checks:
res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res3 = csv_reader_3(StringIO(x))
res4 = csv_reader_4(StringIO(x))
res5 = csv_reader_5(StringIO(x))
print(res1.head(3))
# 0 1 2
# 0 34.23 562.45 213.5432
# 1 56.23 63.45 625.2340
# 2 34.23 562.45 213.5432
assert all(np.array_equal(res1.values, i.values) for i in (res2, res3, res4, res5))
Benchmarking results:
%timeit csv_reader_1(StringIO(x)) # 5.31 s per loop
%timeit csv_reader_2(StringIO(x)) # 6.69 s per loop
%timeit csv_reader_3(StringIO(x)) # 18.6 s per loop
%timeit csv_reader_4(StringIO(x)) # 5.68 s per loop
%timeit csv_reader_5(StringIO(x)) # 7.01 s per loop
%timeit pd.read_csv(StringIO(x)) # 1.65 s per loop
Update
I'm open to using command-line tools as a last resort. To that extent, I have included such an answer. My hope is there is a pure-Python or Pandas solution with comparable efficiency.
Use a command-line tool
By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.
Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.
For example, using Find And Replace Text:
import os
os.chdir(r'C:\temp') # change directory location
os.system('fart.exe -c file.csv ";" ","') # run FART with character to replace
df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None) # read file into Pandas
How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?
import io
import pandas as pd
# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def replacementgenerator(haystack, needle, replace):
for s in haystack:
if s == needle:
yield str.encode(replace);
else:
yield str.encode(s);
csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])
Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.
This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.
If this is an option, substituting the character ; with , in the string is faster.
I have written the string x to a file test.dat.
def csv_reader_4(x):
with open(x, 'r') as f:
a = f.read()
return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])
The unicode() function was necessary to avoid a TypeError in Python 2.
Benchmarking:
%timeit csv_reader_2('test.dat') # 1.6 s per loop
%timeit csv_reader_4('test.dat') # 1.2 s per loop
A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':
def csv_reader_4(x):
with x as fin:
reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
return reader
The benchmark:
%timeit csv_reader_4(StringIO(x)) # 3.51 s per loop
Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend .
Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.
In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:
class SingleCharReplacingFilter:
def __init__(self, reader, oldchar, newchar):
def proxy(obj, attr):
a = getattr(obj, attr)
if attr in ('read'):
def f(*args):
return a(*args).replace(oldchar, newchar)
return f
else:
return a
for a in dir(reader):
if not a.startswith("_") or a == '__iter__':
setattr(self, a, proxy(reader, a))
def csv_reader_6(x):
with x as fin:
return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
sep=',', header=None, usecols=[3, 4, 5])
The result is a little better performance compared to reading the entire CSV file with no conversions:
In [3]: %timeit pd.read_csv(StringIO(x))
605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit csv_reader_5(StringIO(x))
733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit csv_reader_6(StringIO(x))
568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).

python literal binary to hex conversion

I have a textfile containing a range of bits, in ascii:
cat myFile.txt
0101111011100011001...
I would like to write this to an other file in binary mode, so that i can read it in an hexeditor. How could I reach that? I tried already to convert it with code like:
f2=open(fileOut, 'wb')
with open(fileIn) as f:
while True:
c = f.read(1)
byte = byte+str(c)
if not c:
print "End of file"
break
if count % 8 is 0:
count = 0
print hex(int(byte,2))
try:
f2.write('\\x'+hex(int(byte,2))[2:]).zfill(2)
except:
pass
byte = ''
count += 1
but that didn't achieve what I planed to do. Do you have any hint?
Reading and writing one byte at a time is painfully slow. You may get around ~45x speedup of your code simply by reading more data from the file per call to f.read and f.write:
|------------------+--------------------|
| using_loop_20480 | 8.34 msec per loop |
| using_loop_8 | 354 msec per loop |
|------------------+--------------------|
using_loop is the code shown at the bottom of this post. using_loop_20480 is the code with chunksize = 1024*20. This means that 20480 bytes are read from the file at a time. using_loop_1 is the same code with chunksize = 1.
Regarding count % 8 is 0: Don't use is to compare numerical values; use == instead. Here are some examples why is may give you wrong results (maybe not in the code you posted, but in general, is is not appropriate here):
In [5]: 1L is 1
Out[5]: False
In [6]: 1L == 1
Out[6]: True
In [7]: 0.0 is 0
Out[7]: False
In [8]: 0.0 == 0
Out[8]: True
Instead of
struct.pack('{n}B'.format(n = len(bytes)), *bytes)
you could use
bytearray(bytes)
Not only is it less typing, it is a slight bit faster too.
|------------------------------+--------------------|
| using_loop_20480 | 8.34 msec per loop |
| using_loop_with_struct_20480 | 8.59 msec per loop |
|------------------------------+--------------------|
bytearrays are a good match for this job because it bridges the
gap between regarding the data as a string and as a sequence of
numbers.
In [16]: bytearray([97,98,99])
Out[16]: bytearray(b'abc')
In [17]: print(bytearray([97,98,99]))
abc
As you can see above, bytearray(bytes) allows you to
define the bytearray by passing it a sequence of ints (in
range(256)), and allows you to write it out as though it were a
string: g.write(bytearray(bytes)).
def using_loop(output, chunksize):
with open(filename, 'r') as f, open(output, 'wb') as g:
while True:
chunk = f.read(chunksize)
if chunk == '':
break
bytes = [int(chunk[i:i+8], 2)
for i in range(0, len(chunk), 8)]
g.write(bytearray(bytes))
Make sure chunksize is a multiple of 8.
This is the code I used to create the tables. Note that prettytable also does something similar to this, and it may be advisable to use their code rather than my hack: table.py
This is the module I used to time the code: utils_timeit.py. (It uses table.py).
And here is the code I used to time using_loop (and other variants): timeit_bytearray_vs_struct.py
Use struct:
import struct
...
f2.write(struct.pack('b', int(byte,2))) # signed 8 bit int
or
f2.write(struct.pack('B', int(byte,2))) # unsigned 8 bit int

Categories