Time cost severely different when copying the same data in Python.
I have two numpy matrices, of the same size but stores in two different objects. Now I will copy the same data into these two objects slice by slice. However, the first copying cost about 0.3s but the second one only cost 0.01s. See the following code and comments:
import numpy as np, time
data1 = np.zeros([32, 1024, 1024, 3], np.uint8)
data2 = np.zeros([32, 1024, 1024, 3], np.uint8)
src = np.load('C:/ffhq/npy/000000.npy', 'r') # <== src is of size [40000, 1024, 1024, 3]
indices = np.random.permutation(src.shape[0]) # <== randomly permutate the indices
s = 0
for t in range(400):
e = s + 32
tic1 = time.time()
for k in range(s, e):
data1[k - s] = src[indices[k]]
toc1 = time.time()
tic2 = time.time()
for k in range(s, e):
data2[k - s] = src[indices[k]]
toc2 = time.time()
diff = data2 - data1
s += 32
print(toc1 - tic1) # <== Here, time cost is ~= 0.3s
print(toc2 - tic2) # <== Here, time cost is ~= 0.01s
print(np.min(diff), np.max(diff))
print('')
The only difference between the two copying lies in the order in code. So why?
The first copy loads everything into memory and perhaps into the CPU cache. The second copy then incurs much less latency because it doesn't need to read from disk.
Note you are using mmap_mode='r' in np.load() which means the "load" does not actually load any data beyond the NumPy header. If you want to load everything into memory at the start, just remove the 'r' argument entirely, and you'll have a more fair comparison.
Related
I am trying to optimize the way trees are written in pyroot and came across uproot. In the end my application should write events (consisiting of arrays) to a tree which are continuously coming in.
The first approach is the classic way:
event= [1.,2.,3.]
f = ROOT.TFile("my_tree.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")
pt = array.array('f', [0.]*3)
tree.Branch("pt", pt, "pt[3]/F")
#for loop to simulate incoming events
for _ in range(10000):
for i, element in enumerate(event):
pt[i] = element
tree.Fill()
tree.Print()
tree.Write("", ROOT.TObject.kOverwrite);
f.Close()
This gives the following Tree and execution time:
Tree characterisitics
Trying to do it with uproot my code looks like this:
np_array = np.array([[1,2,3]])
ak_array = ak.from_numpy(np_array)
with uproot.recreate("testing.root", compression=None) as fout:
fout.mktree("tree", {"branch": ak_array.type})
for _ in range(10000):
fout["tree"].extend({"branch": ak_array})
which gives the following tree:
Tree characteristics
So the uproot method takes much longer, the file size is much bigger and each event gets a seperate basket. I tried out different commpression settings but that did not change anything. Any idea on how to optimize this? Is this even a sensible usecase for uproot and can the process of writing trees being speed up in comparision to the first way of doing it?
The extend method is supposed to write a new TBasket with each invocation. (See the documentation, especially the orange warning box. The purpose of that is so that you can control the TBasket sizes.) If you're calling it 10000 times to write 1 value (the value [1, 2, 3]) each, that's a maximally inefficient use.
Fundamentally, you're thinking about this problem in an entry-by-entry way, rather than in terms of columns, the way that scientific processing is normally done in Python. What you want to do instead is to collect a large dataset in memory and write it to the file in one chunk. If the data that you'll eventually be addressing is larger than the memory on your computer, you would do it in "large enough" chunks, which is probably on the order of hundreds of megabytes or gigabytes.
For instance, starting with your example,
import time
import uproot
import numpy as np
import awkward as ak
np_array = np.array([[1, 2, 3]])
ak_array = ak.from_numpy(np_array)
starttime = time.time()
with uproot.recreate("bad.root") as fout:
fout.mktree("tree", {"branch": ak_array.type})
for _ in range(10000):
fout["tree"].extend({"branch": ak_array})
print("Total time:", time.time() - starttime)
The total time (on my computer) is 1.9 seconds and the TTree characteristics are atrocious:
******************************************************************************
*Tree :tree : *
*Entries : 10000 : Total = 1170660 bytes File Size = 2970640 *
* : : Tree compression factor = 1.00 *
******************************************************************************
*Br 0 :branch : branch[3]/L *
*Entries : 10000 : Total Size= 1170323 bytes File Size = 970000 *
*Baskets : 10000 : Basket Size= 32000 bytes Compression= 1.00 *
*............................................................................*
Instead, we want the data to be in a single array (or some loop that produces ~GB scale arrays):
np_array = np.array([[1, 2, 3]] * 10000)
(This isn't necessarily how you would get np_array, since * 10000 makes a large, intermediate Python list. Suffice to say, you get the data somehow.)
Now we do the write with a single call to extend, which makes a single TBasket:
np_array = np.array([[1, 2, 3]] * 10000)
ak_array = ak.from_numpy(np_array)
starttime = time.time()
with uproot.recreate("good.root") as fout:
fout.mktree("tree", {"branch": ak_array.type})
fout["tree"].extend({"branch": ak_array})
print("Total time:", time.time() - starttime)
The total time (on my computer) is 0.0020 seconds and the TTree characteristics are much better:
******************************************************************************
*Tree :tree : *
*Entries : 10000 : Total = 240913 bytes File Size = 3069 *
* : : Tree compression factor = 107.70 *
******************************************************************************
*Br 0 :branch : branch[3]/L *
*Entries : 10000 : Total Size= 240576 bytes File Size = 2229 *
*Baskets : 1 : Basket Size= 32000 bytes Compression= 107.70 *
*............................................................................*
So, the writing is almost 1000× faster and the compression is 100× better. (With one entry per TBasket in the previous example, there was no compression because any compressed data would be bigger than the original!)
By comparison, if we do entry-by-entry writing with PyROOT,
import time
import array
import ROOT
data = [1, 2, 3]
holder = array.array("q", [0]*3)
file = ROOT.TFile("pyroot.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")
tree.Branch("branch", holder, "branch[3]/L")
starttime = time.time()
for _ in range(10000):
for i, x in enumerate(data):
holder[i] = x
tree.Fill()
tree.Write("", ROOT.TObject.kOverwrite)
file.Close()
print("Total time:", time.time() - starttime)
The total time (on my computer) is 0.062 seconds and the TTree characteristics are fine:
******************************************************************************
*Tree :tree : An Example Tree *
*Entries : 10000 : Total = 241446 bytes File Size = 3521 *
* : : Tree compression factor = 78.01 *
******************************************************************************
*Br 0 :branch : branch[3]/L *
*Entries : 10000 : Total Size= 241087 bytes File Size = 3084 *
*Baskets : 8 : Basket Size= 32000 bytes Compression= 78.01 *
*............................................................................*
So, PyROOT is 30× slower here, but the compression is almost as good. ROOT decided to make 8 TBaskets, which is configurable with AutoFlush parameters.
Keep in mind, though, that this is a comparison of techniques, not libraries. If you wrap a NumPy array with RDataFrame and write that, then you can skip all of the overhead involved in the Python for loop and you get the advantages of columnar processing.
But columnar processing only matters if you're working with big data. Much like compression, if you apply it to very small datasets (or a very small dataset many times), then it can hurt, rather than help.
Background
I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.
Simplified code to reproduce the problem
This code, iotest.py
Generates a file of a similar structure to what I have in my studies
Reads it using numpy.fromfile
Reads it using numpy.frombuffer
Compares timing of both
import numpy as np
import os
def generate_binary_file(filename, nrecords):
n_records = np.random.poisson(lam = nrecords)
record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
with open(filename, 'wb') as f:
s = 0
for i in range(n_records):
f.write(record_lengths[i].tobytes())
f.write(x[s:s+record_lengths[i]].tobytes())
s += record_lengths[i]
# Trick for testing: make sum of records equal to 0
f.write(np.array([1], dtype = 'i4').tobytes())
f.write(np.array([-x.sum()], dtype = 'd').tobytes())
return os.path.getsize(filename)
def read_binary_npfromfile(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.fromfile(f, 'i4', 1)[0]
x = np.fromfile(f, 'd', record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
def read_binary_npfrombuffer(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
if __name__ == '__main__':
from timeit import Timer
from functools import partial
fname = 'testfile.tmp'
print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
fsize = generate_binary_file(fname, i)
t1 = Timer(partial(read_binary_npfromfile, fname))
t2 = Timer(partial(read_binary_npfrombuffer, fname))
a1 = np.array(t1.repeat(5, 1))
a2 = np.array(t2.repeat(5, 1))
print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
Results
Conclusions
In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.
In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.
In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.
Questions
In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?
TL;DR: np.fromfile and np.frombuffer are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.
Analysis
The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while loops calls 601_214 times np.fromfile and np.frombuffer. The timing on my machine are 10.5s for read_binary_npfromfile and 1.2s for read_binary_npfrombuffer. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).
Faster Numpy code
The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:
def read_binary_faster_numpy(filename):
buff = np.fromfile(filename, dtype=np.uint8)
buff_int32 = buff.view(np.int32)
buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
nblocks = buff.size // 4 # Number of 4-byte blocks
pos = 0 # Displacement by block of 4 bytes
lst = []
while pos < nblocks:
record_length = buff_int32[pos]
pos += 1
if pos + record_length * 2 > nblocks:
break
offset = pos // 2
if pos % 2 == 0: # Aligned with buff_double_1
x = buff_double_1[offset:offset+record_length]
else: # Aligned with buff_double_2
x = buff_double_2[offset:offset+record_length]
lst.append(x) # np.sum is too expensive here
pos += record_length * 2
checksum = np.sum(np.concatenate(lst))
assert(np.abs(checksum) < 1e-6)
The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos] or buff_double_1[offset:offset+record_length]. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...
Better performance with... a basic pure-Python code
It turns out that the following pure-python implementation is faster, safer and simpler:
from struct import unpack_from
def read_binary_python_struct(filename):
checksum = 0.0
with open(filename, 'rb') as f:
data = f.read()
offset = 0
while offset < len(data):
record_length = unpack_from('#i', data, offset)[0]
checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
offset += 4 + record_length * 8
assert(np.abs(checksum) < 1e-6)
This is because the overhead of unpack_from is far lower than the one of Numpy functions but it is still not great.
In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.
Writing a high-performance code with Numba
We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:
#nb.njit('float64(uint8[::1])')
def decode_buffer(buff):
checksum = 0.0
offset = 0
while offset + 4 < buff.size:
record_length = buff[offset:offset+4].view(np.int32)[0]
start = offset + 4
end = start + record_length * 8
if end > buff.size:
break
x = buff[start:end].view(np.float64)
checksum += x.sum()
offset = end
return checksum
def read_binary_numba(filename):
buff = np.fromfile(filename, dtype=np.uint8)
checksum = decode_buffer(buff)
assert(np.abs(checksum) < 1e-6)
Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile which need to be called outside a Numba-compiled function.
Benchmark
Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.
read_binary_npfromfile: 10616 ms ( x1)
read_binary_npfrombuffer: 1132 ms ( x9)
read_binary_faster_numpy: 509 ms ( x21)
read_binary_python_struct: 222 ms ( x48)
read_binary_numba: 12 ms ( x885)
Optimal time: 7 ms (x1517)
One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile and only 4 ms in decode_buffer!
I have a compile function with Numba that splits an array based on an index, this returns an irregular(variable length) list of numpy arrays. This then get padded to form a 2d array from the irregular list.
Problem
The compile function 'nb_array2mat' should be much faster than the pure python 'array2mat' but it is not.
Additionally, is this possible using numpy?
length of the array and index
1456391 95007
times:
numba: 1.3438396453857422
python: 1.1407015323638916
I think I am not using the numba compile in a proper manner. Any help would be great.
EDIT
Using dummy data as edited in the code section now I get an speed up, why does it not work with the actual data?
length of the array and index
1456391 95007
times:
numba: 0.012002706527709961
python: 0.13403034210205078
Code
idx_split: https://drive.google.com/file/d/1hSduTs1_s3seEFAiyk_n5yk36ZBl0AXW/view?usp=sharing
dist_min_orto: https://drive.google.com/file/d/1fwarVmBa0NGbWPifBEezTzjEZSrHncSN/view?usp=sharing
import time
import numba
import numpy as np
from numba.pycc import CC
cc = CC('compile_func')
cc.verbose = True
#numba.njit(parallel=True, fastmath=True)
#cc.export('nb_array2mat', 'f8[:,:](f8[:], i4[:])')
def array2mat(arr, idx):
# split arr by idx indexes
out = []
s = 0
for n in numba.prange(len(idx)):
e = idx[n]
out.append(arr[s:e])
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
_len = [len(_i) for _i in out]
cols = max(_len)
rows = len(out)
mat = np.full(shape=(rows, cols), fill_value=1000000.0)
for row in numba.prange(rows):
len_col = len(out[row])
mat[row, :len_col] = out[row]
return mat
if __name__ == "__main__":
cc.compile()
# PYTHON FUNC
def array2mat(arr, idx):
# split arr by idx indexes
out = []
s = 0
for n in range(len(idx)):
e = idx[n]
out.append(arr[s:e])
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
_len = [len(_i) for _i in out]
cols = max(_len)
rows = len(out)
mat = np.full(shape=(rows, cols), fill_value=1000000.0)
for row in range(rows):
len_col = len(out[row])
mat[row, :len_col] = out[row]
return mat
import compile_func
#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(float)
idx = np.load('idx_split.npy').astype(int)
# DUMMY DATA
arr = np.random.randint(50, size=1456391).astype(float)
idx = np.cumsum(np.random.randint(5, size=95007).astype(int))
print(len(arr), len(idx))
#NUMBA FUNC
t0 = time.time()
print(compile_func.nb_array2mat(arr, idx))
print(time.time() - t0)
# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)
You cannot use nb.prange on the first loop since out is shared between threads and it is also read/written by them. This causes a race condition. Numba assume that there is not dependencies between iterations and this is your responsibility to guarantee this. The simplest solution is not to use a parallel loop here
Additionally, the second loop is mainly memory-bound so I do not expect a big speed up using multiple threads since the RAM is a shared resource with a limited throughput (few threads are often enough to saturate it, especially on PC where sometimes one thread is enough).
Hopefully, you do not need to create the out temporary list, just the end offsets so then to compute len_cols in the parallel loop. The maximum cols can be computed on the fly in the first loop. The first loop should be executed very quickly compared to the second loop. Filling a big matrix newly allocated is often faster in parallel on Linux since page faults can be done in parallel. AFAIK, one Windows this is less true (certainly since pages faults scale more badly). This is also better here since the range 0:len_col is variable and thus the time to fill this part of the matrix is variable causing some thread to finish after others (the slower thread bound the execution). Furthermore, this is generally much faster on NUMA machines since each NUMA node can write in its own memory.
Note that AOT compilation does not support automatic parallel execution. To quote a Numba developer:
From discussion in today's triage meeting, related to #7696: this is not likely to be supported as AOT code doesn't require Numba to be installed - this would mean a great deal of work and issues to overcome for packaging the code for the threading layers.
The same thing applies for fastmath also it is likely to be added in the next incoming release regarding the current work.
Note that JIT compilation and AOT compilation are two separate process. Thus the parameters of njit are not shared to cc.export and the signature is not shared to njit. This means that the function will be compiled during its first execution due to lazy compilation. That being said, the function is redefined, so the njit is just useless here (overwritten).
Here is the resulting code (using only the JIT implementation with an eager compilation instead of the AOT one):
import time
import numba
import numpy as np
#numba.njit('f8[:,:](f8[:], i4[:])', fastmath=True)
def nb_array2mat(arr, idx):
# split arr by idx indexes
s = 0
ends = np.empty(len(idx), dtype=np.int_)
cols = 0
for n in range(len(idx)):
e = idx[n]
ends[n] = e
len_col = e - s
cols = max(cols, len_col)
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
rows = len(idx)
mat = np.empty(shape=(rows, cols))
for row in numba.prange(rows):
s = ends[row-1] if row >= 1 else 0
e = ends[row]
len_col = e - s
mat[row, 0:len_col] = arr[s:e]
mat[row, len_col:cols] = 1000000.0
return mat
# PYTHON FUNC
def array2mat(arr, idx):
# split arr by idx indexes
out = []
s = 0
for n in range(len(idx)):
e = idx[n]
out.append(arr[s:e])
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
_len = [len(_i) for _i in out]
cols = max(_len)
rows = len(out)
mat = np.full(shape=(rows, cols), fill_value=1000000.0)
for row in range(rows):
len_col = len(out[row])
mat[row, :len_col] = out[row]
return mat
#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(np.float64)
idx = np.load('idx_split.npy').astype(np.int32)
#NUMBA FUNC
t0 = time.time()
print(nb_array2mat(arr, idx))
print(time.time() - t0)
# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)
On my machine, the new Numba code is slightly faster: it takes 0.358 seconds for the Numba implementation and 0.418 for the Python implementation. In fact, using a sequential Numba code is even slightly faster on my machine as it takes 0.344 second.
Note that the shape of the output matrix is (95007,5469). Thus, the matrix takes 3.87 GiB in memory. You should check you have enough memory to store it. In fact the Python implementation takes about 7.5 GiB on my machine (possibly because the GC/default-allocator does not release the memory directly). If you do not have enouth memory, then the system can use the very slow swap memory (which use your storage device). Moreover, x86-64 processors use a write allocate cache policy causing written cache-lines to be actually read by default. Non temporal writes can be used to avoid this on a big matrix. Unfortunately, neither Numpy nor Numba use this on my machine. This means half the RAM throughput is wasted. Not to mention page faults are pretty expensive: in sequential, 60% of the time of the Numpy implementation is spent in page faults. The Numba code spend almost all its time writing in memory and performing page faults. Here is a related open issue.
based on #Jérôme Richard answer I wrote the same function. The improvement was in the way the mat numpy array is created, as the previous answer stated, the size in memory of the np.full takes a lot longer to operate, so the solution was to initialize it as a np.empty.
The improvement is not much bewtewn python and numba, but the size of the mat array takes a big impact in processing time.
1456391 95007
python: 0.29506611824035645
numba: 0.1800403594970703
Code
#cc.export('nb_array2mat', 'f8[:,:](f8[:], i4[:])')
def nb_array2mat(arr, idx):
s = 0
_len = np.empty(len(idx), dtype=np.int_)
_len[0] = idx[0]
_len[1:] = idx[1:] - idx[:-1]
# create a 2d array
cols = int(np.max(_len))
rows = len(idx)
mat = np.empty(shape=(rows, cols), dtype=np.float_)
for row in range(len(idx)):
e = idx[row]
len_col = _len[row]
mat[row, :len_col] = arr[s:e]
s = e
return mat
I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance
I have binary data files in the multiple GB range that I am memory mapping with numpy. The start of each data packet contains a BCD timestamp. Where each hex number is coded into the time format of 0DDD:HH:MM:SS.ssss I need this timestamp turned into total seconds of the current year.
Example:
The the first time stamp 0x0261 1511 2604 6002 Would be: 261:15:11:26.046002 or
261*86400 + 15*3600 + 11*60 + 26.046002 = 22551986.046002
Currently I am doing this to compute the timestamps:
import numpy as np
rawData = np.memmap('dataFile.bin',dtype='u1',mode='r')
#findFrameStart returns the index to the start of each data packet [0,384,768,...]
fidx = findFrameStart(rawData)
# Do lots of bit shifting and multiplying and type casting....
day1 = ((rawData[fidx ]>>4)*10 + (rawData[fidx ]&0x0F)).astype('f8')
day2 = ((rawData[fidx+1]>>4)*10 + (rawData[fidx+1]&0x0F)).astype('f8')
hour = ((rawData[fidx+2]>>4)*10 + (rawData[fidx+2]&0x0F)).astype('f8')
mins = ((rawData[fidx+3]>>4)*10 + (rawData[fidx+3]&0x0F)).astype('f8')
sec1 = ((rawData[fidx+4]>>4)*10 + (rawData[fidx+4]&0x0F)).astype('f8')
sec2 = ((rawData[fidx+5]>>4)*10 + (rawData[fidx+5]&0x0F)).astype('f8')
sec3 = ((rawData[fidx+6]>>4)*10 + (rawData[fidx+6]&0x0F)).astype('f8')
sec4 = ((rawData[fidx+7]>>4)*10 + (rawData[fidx+7]&0x0F)).astype('f8')
time = (day1*100+day2)*86400 + hour*3600 + mins*60 + sec1 + sec2/100 + sec3/10000 + sec4/1000000
Note I had to cast each of the intermediate vars (day1, day2, etc.) to double to get the time to compute correctly.
Given that there are lots of frames, fidx can get kind of large (~10e6 elements or more). This results in lots of math operations, bit shifts, casting, etc. in my current method. So far it is working OK on a smaller test file (~180ms on a 150MB data file). However, I am worried about when I hit some larger data(4-5GB) there might be memory issues with all of the intermediate arrays.
So if possible I was looking for a different method that might shortcut some of the overhead. The BCD to decimal operations are similar for each byte so it seems I should maybe be able to iterate over something and maybe convert an array in place ... at least reducing the memory footprint.
Any help would be appreciated. FYI, I am using Python 3.7
I made the following adjustments to my code. This modifies the time array in place & removed the need for all of the intermediate arrays. I haven't timed the result but it should require less memory.
time = np.zeros(fidx.shape,dtype='f8')
scale = np.array([8640000, 86400, 3600, 60, 1, .01, .0001, .000001],dtype='f8')
for ii,sf in enumerate(scale):
time = time + ((rawData[fidx+ii]>>4)*10 + (rawData[fidx+ii]&0x0F))*sf