I have a script that generates two-dimensional numpy arrays with dtype=float and shape on the order of (1e3, 1e6). Right now I'm using np.save and np.load to perform IO operations with the arrays. However, these functions take several seconds for each array. Are there faster methods for saving and loading the entire arrays (i.e., without making assumptions about their contents and reducing them)? I'm open to converting the arrays to another type before saving as long as the data are retained exactly.
For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :
NumPy.memmap, maps big arrays to binary form
Pros :
No dependency other than Numpy
Transparent replacement of ndarray (Any class accepting ndarray accepts memmap )
Cons :
Chunks of your array are limited to 2.5G
Still limited by Numpy throughput
Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py
Pros :
Format supports compression, indexing, and other super nice features
Apparently the ultimate PetaByte-large file format
Cons :
Learning curve of having a hierarchical format ?
Have to define what your performance needs are (see later)
Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)
Pros:
It's Pythonic ! (haha)
Supports all sorts of objects
Cons:
Probably slower than others (because aimed at any objects not arrays)
Numpy.memmap
From the docs of NumPy.memmap :
Create a memory-map to an array stored in a binary file on disk.
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory
The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp , isinstance(fp, numpy.ndarray) returns True.
HDF5 arrays
From the h5py doc
Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient
I've compared a few methods using perfplot (one of my projects). Here are the results:
Writing
For large arrays, all methods are about equally fast. The file sizes are also equal which is to be expected since the input array are random doubles and hence hardly compressible.
Code to reproduce the plot:
from sys import version_info
import matplotlib.pyplot as plt
import perfplot
import pickle
import netCDF4
import numpy as np
import h5py
import tables
import zarr
def write_numpy(data):
np.save("out.npy", data)
def write_hdf5(data):
with h5py.File("out.h5", "w") as f:
f.create_dataset("data", data=data)
def write_netcdf(data):
with netCDF4.Dataset("out.nc", "w") as nc:
nc.createDimension("len_data", len(data))
ncdata = nc.createVariable(
"mydata",
"float64",
("len_data",),
)
ncdata[:] = data
def write_pickle(data):
with open("out.pkl", "wb") as f:
pickle.dump(data, f)
def write_pytables(data):
with tables.open_file("out-pytables.h5", mode="w") as f:
gcolumns = f.create_group(f.root, "columns", "data")
f.create_array(gcolumns, "data", data, "data")
def write_zarr_zarr(data):
zarr.save_array("out.zarr", data)
def write_zarr_zip(data):
zarr.save_array("out.zip", data)
def write_zarr_zarr_uncompressed(data):
zarr.save_array("out-uncompressed.zarr", data, compressor=None)
def write_zarr_zip_uncompressed(data):
zarr.save_array("out-uncompressed.zip", data)
def setup(n):
data = np.random.rand(n)
n[...] = data.nbytes
return data
b = perfplot.bench(
setup=setup,
kernels=[
write_numpy,
write_hdf5,
write_netcdf,
write_pickle,
write_pytables,
write_zarr_zarr,
write_zarr_zip,
write_zarr_zarr_uncompressed,
write_zarr_zip_uncompressed,
],
title="write comparison",
n_range=[2**k for k in range(28)],
xlabel="data.nbytes",
equality_check=None,
)
plt.text(
0.0,
-0.3,
", ".join(
[
f"Python {version_info.major}.{version_info.minor}.{version_info.micro}",
f"h5py {h5py.__version__}",
f"netCDF4 {netCDF4.__version__}",
f"NumPy {np.__version__}",
f"PyTables {tables.__version__}",
f"Zarr {zarr.__version__}",
]
),
transform=plt.gca().transAxes,
fontsize="x-small",
verticalalignment="top",
)
b.save("out-write.png")
b.show()
Reading
pickles, pytables and hdf5 are roughly equally fast; pickles and zarr are slower for large arrays.
Code to reproduce the plot:
import perfplot
import pickle
import numpy
import h5py
import tables
import zarr
def setup(n):
data = numpy.random.rand(n)
# write all files
#
numpy.save("out.npy", data)
#
f = h5py.File("out.h5", "w")
f.create_dataset("data", data=data)
f.close()
#
with open("test.pkl", "wb") as f:
pickle.dump(data, f)
#
f = tables.open_file("pytables.h5", mode="w")
gcolumns = f.create_group(f.root, "columns", "data")
f.create_array(gcolumns, "data", data, "data")
f.close()
#
zarr.save("out.zip", data)
def npy_read(data):
return numpy.load("out.npy")
def hdf5_read(data):
f = h5py.File("out.h5", "r")
out = f["data"][()]
f.close()
return out
def pickle_read(data):
with open("test.pkl", "rb") as f:
out = pickle.load(f)
return out
def pytables_read(data):
f = tables.open_file("pytables.h5", mode="r")
out = f.root.columns.data[()]
f.close()
return out
def zarr_read(data):
return zarr.load("out.zip")
b = perfplot.bench(
setup=setup,
kernels=[
npy_read,
hdf5_read,
pickle_read,
pytables_read,
zarr_read,
],
n_range=[2 ** k for k in range(27)],
xlabel="len(data)",
)
b.save("out2.png")
b.show()
Here is a comparison with PyTables.
I cannot get up to (int(1e3), int(1e6) due to memory restrictions.
Therefore, I used a smaller array:
data = np.random.random((int(1e3), int(1e5)))
NumPy save:
%timeit np.save('array.npy', data)
1 loops, best of 3: 4.26 s per loop
NumPy load:
%timeit data2 = np.load('array.npy')
1 loops, best of 3: 3.43 s per loop
PyTables writing:
%%timeit
with tables.open_file('array.tbl', 'w') as h5_file:
h5_file.create_array('/', 'data', data)
1 loops, best of 3: 4.16 s per loop
PyTables reading:
%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
data2 = h5_file.root.data.read()
1 loops, best of 3: 3.51 s per loop
The numbers are very similar. So no real gain wit PyTables here.
But we are pretty close to the maximum writing and reading rate of my SSD.
Writing:
Maximum write speed: 241.6 MB/s
PyTables write speed: 183.4 MB/s
Reading:
Maximum read speed: 250.2
PyTables read speed: 217.4
Compression does not really help due to the randomness of the data:
%%timeit
FILTERS = tables.Filters(complib='blosc', complevel=5)
with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
h5_file.create_carray('/', 'data', obj=data)
1 loops, best of 3: 4.08 s per loop
Reading of the compressed data becomes a bit slower:
%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
data2 = h5_file.root.data.read()
1 loops, best of 3: 4.01 s per loop
This is different for regular data:
reg_data = np.ones((int(1e3), int(1e5)))
Writing is significantly faster:
%%timeit
FILTERS = tables.Filters(complib='blosc', complevel=5)
with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:
h5_file.create_carray('/', 'reg_data', obj=reg_data)
1 loops, best of 3: 849 ms per loop
The same holds true for reading:
%%timeit
with tables.open_file('array.tbl', 'r') as h5_file:
reg_data2 = h5_file.root.reg_data.read()
1 loops, best of 3: 1.7 s per loop
Conclusion: The more regular your data the faster it should get using PyTables.
According to my experience, np.save()&np.load() is the fastest solution when trasfering data between hard disk and memory so far.
I've heavily relied my data loading on database and HDFS system before I realized this conclusion.
My tests shows that:
The database data loading(from hard disk to memory) bandwidth could be around 50 MBps(Byets/Second), but the np.load() bandwidth is almost same as my hard disk maximum bandwidth: 2GBps(Byets/Second). Both test environments use the simplest data structure.
And I don't think it's a problem to use several seconds to loading an array with shape: (1e3, 1e6). E.g.
Your array shape is (1000, 1000000), its data type is float128, then the pure data size is (128/8)*1000*1,000,000=16,000,000,000=16GBytes
and if it takes 4 seconds,
Then your data loading bandwidth is 16GBytes/4Seconds = 4GBps.
SATA3 maximum bandwidth is 600MBps=0.6GBps, your data loading bandwidth is already 6 times of it, your data loading performance almost could compete with DDR's maximum bandwidth, what else do you want?
So my final conclusion is:
Don't use python's Pickle, don't use any database, don't use any big data system to store your data into hard disk, if you could use np.save() and np.load(). These two functions are the fastest solution to transfer data between harddisk and memory so far.
I've also tested the HDF5 , and found that it's much slower than np.load() and np.save(), so use np.save()&np.load() if you've enough DDR memory in your platfrom.
I created a benchmarking tool and produced a benchmark of the various loading/saving methods using python 3.10. I ran it on a fast NVMe (with >6GB/s transfer rate so the measurements here are not disk I/O bound). The size of the tested numpy array was varied from tiny to 32GB. The results can be seen here.
The github repo for the tool is here.
The results vary, and are affected by the array size; and some methods perform data compression so there's a tradeoff for those. Here is an idea of I/O rate (more results in via link above):
Legend (for the saves):
np: np.save(), npz: np.savez(), npzc: np.savez_compressed(), hdf5: h5py.File().create_dataset(), pickle: pickle.dump(), zarr_zip: zarr.save_array() w/ .zip extension, zarr_zip: zarr.save_array() w/ .zarr extension, pytables: tables.open_file().create_array(), fast_np: uses this answer.
I was surprised to see torch.load and torch.save were considered as optimal or near-optimal according to the benchmarks here, yet I find it quite slow for what it's supposed to do. So I gave it a try and came up with a much faster alternative: fastnumpyio
Running 100000 save/load iterations of a 3x64x64 float array (common scenario in computer vision) I achieved the following speedup over numpy.save and numpy.load (I suppose numpy.load is so slow because it has to parse text data first ?):
Windows 11, Python 3.9.5, Numpy 1.22.0, Intel Core i7-9750H:
numpy.save: 0:00:01.656569
fast_numpy_save: 0:00:00.398236
numpy.load: 0:00:16.281941
fast_numpy_load: 0:00:00.308100
Ubuntu 20.04, Python 3.9.7, Numpy 1.21.4, Intel Core i7-9750H:
numpy.save: 0:00:01.887152
fast_numpy_save: 0:00:00.745052
numpy.load: 0:00:16.368871
fast_numpy_load: 0:00:00.381135
macOS 12.0.1, Python 3.9.5, Numpy 1.21.2, Apple M1:
numpy.save: 0:00:01.268598
fast_numpy_save: 0:00:00.449448
numpy.load: 0:00:11.303569
fast_numpy_load: 0:00:00.318216
With larger arrays (3x512x512), fastnumpyio is still slightly faster for save and 2 times faster for load.
Related
Background
I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.
Simplified code to reproduce the problem
This code, iotest.py
Generates a file of a similar structure to what I have in my studies
Reads it using numpy.fromfile
Reads it using numpy.frombuffer
Compares timing of both
import numpy as np
import os
def generate_binary_file(filename, nrecords):
n_records = np.random.poisson(lam = nrecords)
record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
with open(filename, 'wb') as f:
s = 0
for i in range(n_records):
f.write(record_lengths[i].tobytes())
f.write(x[s:s+record_lengths[i]].tobytes())
s += record_lengths[i]
# Trick for testing: make sum of records equal to 0
f.write(np.array([1], dtype = 'i4').tobytes())
f.write(np.array([-x.sum()], dtype = 'd').tobytes())
return os.path.getsize(filename)
def read_binary_npfromfile(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.fromfile(f, 'i4', 1)[0]
x = np.fromfile(f, 'd', record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
def read_binary_npfrombuffer(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
if __name__ == '__main__':
from timeit import Timer
from functools import partial
fname = 'testfile.tmp'
print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
fsize = generate_binary_file(fname, i)
t1 = Timer(partial(read_binary_npfromfile, fname))
t2 = Timer(partial(read_binary_npfrombuffer, fname))
a1 = np.array(t1.repeat(5, 1))
a2 = np.array(t2.repeat(5, 1))
print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
Results
Conclusions
In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.
In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.
In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.
Questions
In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?
TL;DR: np.fromfile and np.frombuffer are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.
Analysis
The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while loops calls 601_214 times np.fromfile and np.frombuffer. The timing on my machine are 10.5s for read_binary_npfromfile and 1.2s for read_binary_npfrombuffer. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).
Faster Numpy code
The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:
def read_binary_faster_numpy(filename):
buff = np.fromfile(filename, dtype=np.uint8)
buff_int32 = buff.view(np.int32)
buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
nblocks = buff.size // 4 # Number of 4-byte blocks
pos = 0 # Displacement by block of 4 bytes
lst = []
while pos < nblocks:
record_length = buff_int32[pos]
pos += 1
if pos + record_length * 2 > nblocks:
break
offset = pos // 2
if pos % 2 == 0: # Aligned with buff_double_1
x = buff_double_1[offset:offset+record_length]
else: # Aligned with buff_double_2
x = buff_double_2[offset:offset+record_length]
lst.append(x) # np.sum is too expensive here
pos += record_length * 2
checksum = np.sum(np.concatenate(lst))
assert(np.abs(checksum) < 1e-6)
The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos] or buff_double_1[offset:offset+record_length]. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...
Better performance with... a basic pure-Python code
It turns out that the following pure-python implementation is faster, safer and simpler:
from struct import unpack_from
def read_binary_python_struct(filename):
checksum = 0.0
with open(filename, 'rb') as f:
data = f.read()
offset = 0
while offset < len(data):
record_length = unpack_from('#i', data, offset)[0]
checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
offset += 4 + record_length * 8
assert(np.abs(checksum) < 1e-6)
This is because the overhead of unpack_from is far lower than the one of Numpy functions but it is still not great.
In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.
Writing a high-performance code with Numba
We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:
#nb.njit('float64(uint8[::1])')
def decode_buffer(buff):
checksum = 0.0
offset = 0
while offset + 4 < buff.size:
record_length = buff[offset:offset+4].view(np.int32)[0]
start = offset + 4
end = start + record_length * 8
if end > buff.size:
break
x = buff[start:end].view(np.float64)
checksum += x.sum()
offset = end
return checksum
def read_binary_numba(filename):
buff = np.fromfile(filename, dtype=np.uint8)
checksum = decode_buffer(buff)
assert(np.abs(checksum) < 1e-6)
Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile which need to be called outside a Numba-compiled function.
Benchmark
Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.
read_binary_npfromfile: 10616 ms ( x1)
read_binary_npfrombuffer: 1132 ms ( x9)
read_binary_faster_numpy: 509 ms ( x21)
read_binary_python_struct: 222 ms ( x48)
read_binary_numba: 12 ms ( x885)
Optimal time: 7 ms (x1517)
One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile and only 4 ms in decode_buffer!
Calling MATLAB from Python is bound to give some performance reduction that I could avoid by rewriting (a lot of) code in Python. However, this isn't a realistic option for me, but it annoys me that a huge loss of efficiency lies in the simple conversion from a numpy array to a MATLAB double.
I'm talking about the following conversion from data1 to data1m, where
data1 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
data1m = matlab.double(list(data1))
Here matlab.double comes from Mathworks own MATLAB package / engine. The second line of code takes 20 s on my system, which just seems like too much for a conversion that doesn't really do anything other than making the numbers 'edible' for MATLAB.
So basically I'm looking for a trick opposite to the one given here that works for converting MATLAB output back to Python.
Passing numpy arrays efficiently
Take a look at the file mlarray_sequence.py in the folder PYTHONPATH\Lib\site-packages\matlab\_internal. There you will find the construction of the MATLAB array object. The performance problem comes from copying data with loops within the generic_flattening function.
To avoid this behavior we will edit the file a bit. This fix should work on complex and non-complex datatypes.
Make a backup of the original file in case something goes wrong.
Add import numpy as np to the other imports at the beginning of the file
In line 38 you should find:
init_dims = _get_size(initializer)
replace this with:
try:
init_dims=initializer.shape
except:
init_dims = _get_size(initializer)
In line 48 you should find:
if is_complex:
complex_array = flat(self, initializer,
init_dims, typecode)
self._real = complex_array['real']
self._imag = complex_array['imag']
else:
self._data = flat(self, initializer, init_dims, typecode)
Replace this with:
if is_complex:
try:
self._real = array.array(typecode,np.ravel(initializer, order='F').real)
self._imag = array.array(typecode,np.ravel(initializer, order='F').imag)
except:
complex_array = flat(self, initializer,init_dims, typecode)
self._real = complex_array['real']
self._imag = complex_array['imag']
else:
try:
self._data = array.array(typecode,np.ravel(initializer, order='F'))
except:
self._data = flat(self, initializer, init_dims, typecode)
Now you can pass a numpy array directly to the MATLAB array creation method.
data1 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
#faster
data1m = matlab.double(data1)
#or slower method
data1m = matlab.double(data1.tolist())
data2 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,)).astype(np.complex128)
#faster
data1m = matlab.double(data2,is_complex=True)
#or slower method
data1m = matlab.double(data2.tolist(),is_complex=True)
The performance in MATLAB array creation increases by a factor of 15 and the interface is easier to use now.
While awaiting better suggestions, I'll post the best trick I've come up with so far. It comes down to saving the file with `scipy.io.savemat´ and then loading this file in MATLAB.
This is not the prettiest hack and it requires some care to ensure different processes relying on the same script don't end up writing and loading each other's .mat files, but the performance gain is worth it for me.
As a test case I wrote two simple, almost identical MATLAB functions that require 2 numpy arrays (I tested with length 1000000) and one int as input.
function d = test(x, y, fs_signal)
d = sum((x + y))./double(fs_signal);
function d = test2(path)
load(path)
d = sum((x + y))./double(fs_signal);
The function test requires conversion, while test2 requires saving.
Testing test: Converting the two numpy arrays takes cirka 40 s on my system. The total time to prepare for and run test comes down to 170 s
Testing test2: Saving the arrays and int takes cirka 0.35 s on my system. Suprisingly, loading the .mat file in MATLAB is extremely efficient (or more suprisingly, it is extremely ineffcient at dealing with its doubles)... The total time to prepare for and run test2 comes down to 0.38 s
That's a performance gain of almost 450x...
My situation was a bit different (python script called from matlab) but for me converting the ndarray into an array.array massively speed up the process. Basically it is very similar to Alexandre Chabot solution but without the need to alter any files:
#untested i.e. only deducted from my "matlab calls python" situation
import numpy
import array
data1 = numpy.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
ar = array.array('d',data1.flatten('F').tolist())
p = matlab.double(ar)
C = matlab.reshape(p,data1.shape) #this part I am definitely not sure about if it will work like that
At least if done from Matlab the combination of "array.array" and "double" is relative fast. Tested with Matlab 2016b + python 3.5.4 64bit.
I'm currently having a small side project in which I want to sort a 20GB file on my machine as fast as possible. The idea is to chunk the file, sort the chunks, merge the chunks. I just used pyenv to time the radixsort code with different Python versions and saw that 2.7.18 is way faster than 3.6.10, 3.7.7, 3.8.3 and 3.9.0a. Can anybody explain why Python 3.x is slower than 2.7.18 in this simple example? Were there new features added?
import os
def chunk_data(filepath, prefixes):
"""
Pre-sort and chunk the content of filepath according to the prefixes.
Parameters
----------
filepath : str
Path to a text file which should get sorted. Each line contains
a string which has at least 2 characters and the first two
characters are guaranteed to be in prefixes
prefixes : List[str]
"""
prefix2file = {}
for prefix in prefixes:
chunk = os.path.abspath("radixsort_tmp/{:}.txt".format(prefix))
prefix2file[prefix] = open(chunk, "w")
# This is where most of the execution time is spent:
with open(filepath) as fp:
for line in fp:
prefix2file[line[:2]].write(line)
Execution times (multiple runs):
2.7.18: 192.2s, 220.3s, 225.8s
3.6.10: 302.5s
3.7.7: 308.5s
3.8.3: 279.8s, 279.7s (binary mode), 295.3s (binary mode), 307.7s, 380.6s (wtf?)
3.9.0a: 292.6s
The complete code is on Github, along with a minimal complete version
Unicode
Yes, I know that Python 3 and Python 2 deal different with strings. I tried opening the files in binary mode (rb / wb), see the "binary mode" comments. They are a tiny bit faster on a couple of runs. Still, Python 2.7 is WAY faster on all runs.
Try 1: Dictionary access
When I phrased this question, I thought that dictionary access might be a reason for this difference. However, I think the total execution time is way less for dictionary access than for I/O. Also, timeit did not show anything important:
import timeit
import numpy as np
durations = timeit.repeat(
'a["b"]',
repeat=10 ** 6,
number=1,
setup="a = {'b': 3, 'c': 4, 'd': 5}"
)
mul = 10 ** -7
print(
"mean = {:0.1f} * 10^-7, std={:0.1f} * 10^-7".format(
np.mean(durations) / mul,
np.std(durations) / mul
)
)
print("min = {:0.1f} * 10^-7".format(np.min(durations) / mul))
print("max = {:0.1f} * 10^-7".format(np.max(durations) / mul))
Try 2: Copy time
As a simplified experiment, I tried to copy the 20GB file:
cp via shell: 230s
Python 2.7.18: 237s, 249s
Python 3.8.3: 233s, 267s, 272s
The Python stuff is generated by the following code.
My first thought was that the variance is quite high. So this could be the reason. But then, the variance of chunk_data execution time is also high, but the mean is noticeably lower for Python 2.7 than for Python 3.x. So it seems not to be an I/O scenario as simple as I tried here.
import time
import sys
import os
version = sys.version_info
version = "{}.{}.{}".format(version.major, version.minor, version.micro)
if os.path.isfile("numbers-tmp.txt"):
os.remove("numers-tmp.txt")
t0 = time.time()
with open("numbers-large.txt") as fin, open("numers-tmp.txt", "w") as fout:
for line in fin:
fout.write(line)
t1 = time.time()
print("Python {}: {:0.0f}s".format(version, t1 - t0))
My System
Ubuntu 20.04
Thinkpad T460p
Python through pyenv
This is a combination of multiple effects, mostly the fact that Python 3 needs to perform unicode decoding/encoding when working in text mode and if working in binary mode it will send the data through dedicated buffered IO implementations.
First of all, using time.time to measure execution time uses the wall time and hence includes all sorts of Python unrelated things such as OS-level caching and buffering, as well as buffering of the storage medium. It also reflects any interference with other processes that require the storage medium. That's why you are seeing these wild variations in timing results. Here are the results for my system, from seven consecutive runs for each version:
py3 = [660.9, 659.9, 644.5, 639.5, 752.4, 648.7, 626.6] # 661.79 +/- 38.58
py2 = [635.3, 623.4, 612.4, 589.6, 633.1, 613.7, 603.4] # 615.84 +/- 15.09
Despite the large variation it seems that these results indeed indicate different timings as can be confirmed for example by a statistical test:
>>> from scipy.stats import ttest_ind
>>> ttest_ind(p2, p3)[1]
0.018729004515179636
i.e. there's only a 2% chance that the timings emerged from the same distribution.
We can get a more precise picture by measuring the process time rather than the wall time. In Python 2 this can be done via time.clock while Python 3.3+ offers time.process_time. These two functions report the following timings:
py3_process_time = [224.4, 226.2, 224.0, 226.0, 226.2, 223.7, 223.8] # 224.90 +/- 1.09
py2_process_time = [171.0, 171.1, 171.2, 171.3, 170.9, 171.2, 171.4] # 171.16 +/- 0.16
Now there's much less spread in the data since the timings reflect the Python process only.
This data suggests that Python 3 takes about 53.7 seconds longer to execute. Given the large amount of lines in the input file (550_000_000) this amounts to about 97.7 nanoseconds per iteration.
The first effect causing increased execution time are unicode strings in Python 3. The binary data is read from the file, decoded and then encoded again when it is written back. In Python 2 all strings are stored as binary strings right away, so this doesn't introduce any encoding/decoding overhead. You don't see this effect clearly in your tests because it disappears in the large variation introduced by various external resources which are reflected in the wall time difference. For example we can measure the time it takes for a roundtrip from binary to unicode to binary:
In [1]: %timeit b'000000000000000000000000000000000000'.decode().encode()
162 ns ± 2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
This does include two attribute lookups as well as two function calls, so the actual time needed is smaller than the value reported above. To see the effect on execution time, we can change the test script to use binary modes "rb" and "wb" instead of text modes "r" and "w". This reduces the timing results for Python 3 as follows:
py3_binary_mode = [200.6, 203.0, 207.2] # 203.60 +/- 2.73
That reduces the process time by about 21.3 seconds or 38.7 nanoseconds per iteration. This is in agreement with timing results for the roundtrip benchmark minus timing results for name lookups and function calls:
In [2]: class C:
...: def f(self): pass
...:
In [3]: x = C()
In [4]: %timeit x.f()
82.2 ns ± 0.882 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [5]: %timeit x
17.8 ns ± 0.0564 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
Here %timeit x measures the additional overhead of resolving the global name x and hence the attribute lookup and function call make 82.2 - 17.8 == 64.4 seconds. Subtracting this overhead twice from the above roundtrip data gives 162 - 2*64.4 == 33.2 seconds.
Now there's still a difference of 32.4 seconds between Python 3 using binary mode and Python 2. This comes from the fact that all the IO in Python 3 goes through the (quite complex) implementation of io.BufferedWriter .write while in Python 2 the file.write method proceeds fairly straightforward to fwrite.
We can check the types of the file objects in both implementations:
$ python3.8
>>> type(open('/tmp/test', 'wb'))
<class '_io.BufferedWriter'>
$ python2.7
>>> type(open('/tmp/test', 'wb'))
<type 'file'>
Here we also need to note that the above timing results for Python 2 have been obtained by using text mode, not binary mode. Binary mode aims to support all objects implementing the buffer protocol which results in additional work being performed also for strings (see also this question). If we switch to binary mode also for Python 2 then we obtain:
py2_binary_mode = [212.9, 213.9, 214.3] # 213.70 +/- 0.59
which is actually a bit larger than the Python 3 results (18.4 ns / iteration).
The two implementations also differ in other details such as the dict implementation. To measure this effect we can create a corresponding setup:
from __future__ import print_function
import timeit
N = 10**6
R = 7
results = timeit.repeat(
"d[b'10'].write",
setup="d = dict.fromkeys((str(i).encode() for i in range(10, 100)), open('test', 'rb'))", # requires file 'test' to exist
repeat=R, number=N
)
results = [x/N for x in results]
print(['{:.3e}'.format(x) for x in results])
print(sum(results) / R)
This gives the following results for Python 2 and Python 3:
Python 2: ~ 56.9 nanoseconds
Python 3: ~ 78.1 nanoseconds
This additional difference of about 21.2 nanoseconds amounts to about 12 seconds for the full 550M iterations.
The above timing code checks the dict lookup for only one key, so we also need to verify that there are no hash collisions:
$ python3.8 -c "print(len({str(i).encode() for i in range(10, 100)}))"
90
$ python2.7 -c "print len({str(i).encode() for i in range(10, 100)})"
90
Calling MATLAB from Python is bound to give some performance reduction that I could avoid by rewriting (a lot of) code in Python. However, this isn't a realistic option for me, but it annoys me that a huge loss of efficiency lies in the simple conversion from a numpy array to a MATLAB double.
I'm talking about the following conversion from data1 to data1m, where
data1 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
data1m = matlab.double(list(data1))
Here matlab.double comes from Mathworks own MATLAB package / engine. The second line of code takes 20 s on my system, which just seems like too much for a conversion that doesn't really do anything other than making the numbers 'edible' for MATLAB.
So basically I'm looking for a trick opposite to the one given here that works for converting MATLAB output back to Python.
Passing numpy arrays efficiently
Take a look at the file mlarray_sequence.py in the folder PYTHONPATH\Lib\site-packages\matlab\_internal. There you will find the construction of the MATLAB array object. The performance problem comes from copying data with loops within the generic_flattening function.
To avoid this behavior we will edit the file a bit. This fix should work on complex and non-complex datatypes.
Make a backup of the original file in case something goes wrong.
Add import numpy as np to the other imports at the beginning of the file
In line 38 you should find:
init_dims = _get_size(initializer)
replace this with:
try:
init_dims=initializer.shape
except:
init_dims = _get_size(initializer)
In line 48 you should find:
if is_complex:
complex_array = flat(self, initializer,
init_dims, typecode)
self._real = complex_array['real']
self._imag = complex_array['imag']
else:
self._data = flat(self, initializer, init_dims, typecode)
Replace this with:
if is_complex:
try:
self._real = array.array(typecode,np.ravel(initializer, order='F').real)
self._imag = array.array(typecode,np.ravel(initializer, order='F').imag)
except:
complex_array = flat(self, initializer,init_dims, typecode)
self._real = complex_array['real']
self._imag = complex_array['imag']
else:
try:
self._data = array.array(typecode,np.ravel(initializer, order='F'))
except:
self._data = flat(self, initializer, init_dims, typecode)
Now you can pass a numpy array directly to the MATLAB array creation method.
data1 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
#faster
data1m = matlab.double(data1)
#or slower method
data1m = matlab.double(data1.tolist())
data2 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,)).astype(np.complex128)
#faster
data1m = matlab.double(data2,is_complex=True)
#or slower method
data1m = matlab.double(data2.tolist(),is_complex=True)
The performance in MATLAB array creation increases by a factor of 15 and the interface is easier to use now.
While awaiting better suggestions, I'll post the best trick I've come up with so far. It comes down to saving the file with `scipy.io.savemat´ and then loading this file in MATLAB.
This is not the prettiest hack and it requires some care to ensure different processes relying on the same script don't end up writing and loading each other's .mat files, but the performance gain is worth it for me.
As a test case I wrote two simple, almost identical MATLAB functions that require 2 numpy arrays (I tested with length 1000000) and one int as input.
function d = test(x, y, fs_signal)
d = sum((x + y))./double(fs_signal);
function d = test2(path)
load(path)
d = sum((x + y))./double(fs_signal);
The function test requires conversion, while test2 requires saving.
Testing test: Converting the two numpy arrays takes cirka 40 s on my system. The total time to prepare for and run test comes down to 170 s
Testing test2: Saving the arrays and int takes cirka 0.35 s on my system. Suprisingly, loading the .mat file in MATLAB is extremely efficient (or more suprisingly, it is extremely ineffcient at dealing with its doubles)... The total time to prepare for and run test2 comes down to 0.38 s
That's a performance gain of almost 450x...
My situation was a bit different (python script called from matlab) but for me converting the ndarray into an array.array massively speed up the process. Basically it is very similar to Alexandre Chabot solution but without the need to alter any files:
#untested i.e. only deducted from my "matlab calls python" situation
import numpy
import array
data1 = numpy.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
ar = array.array('d',data1.flatten('F').tolist())
p = matlab.double(ar)
C = matlab.reshape(p,data1.shape) #this part I am definitely not sure about if it will work like that
At least if done from Matlab the combination of "array.array" and "double" is relative fast. Tested with Matlab 2016b + python 3.5.4 64bit.
I have python code where I have a continuous loop with a pickle load inside. I have 200 pickle files in the loop each about 80 MB each on an SSD drive.
When I ran the code I experienced that the performance of the pickle load fluctuates continuously: mostly at about 0,2s but at times it "pauses" for 4-6s debasing the overall benchmark of the process.
What could be the problem?
def unpickle(filename):
fo = open(filename, 'r')
contents = cPickle.load(fo)
fo.close()
return contents
for xd in self.X:
tt = time()
xdf = unpickle(xd)
tt = time() - tt
print tt
OUT:
1.87527704239
4.30886101723
0.259668111801
0.234542131424
0.228765964508
0.214528799057
0.213661909103
0.215914011002
0.217473983765
0.225739002228
The way I created pickle files:
I have a pandas DataFrame with the column: 'name','source','level','image','path','is_train'.
The main data regarding the size is the 'image'.
I pickle it with:
def pickle(filename, data):
with open(filename, 'w') as fo:
cPickle.dump(data, fo, protocol=cPickle.HIGHEST_PROTOCOL)
Your question is terribly unclear (in particular, you should be giving us enough information to reproduce your test case ourselves), but it feels like GC pauses or memory defragmentation.
Pickle is a terribly inefficient format, and processing 16 gigabytes' worth of it is bound to cause some serious memory thrashing.