I've been working on a project for a while now that requires calculating some very large datasets, and very quickly have moved beyond anything that my meager Excel knowledge could handle. In the last few days I've started learning Python, which has helped with handling the size of data I'm dealing with, but the estimated processing time for these datasets is looking to be incredibly long (possibly a couple hundred years on my laptop).
The bottleneck here is an equation that could produce trillions or quadrillions of results, since it is calculating every combination of 6 different lists and running it through an equation that you'll see in the code. The code works just fine, as is, but is isn't feasible for larger datasets than the example I included. A real dataset would be something more like Set1S, 2S, and 3S being 50 items each, and Sets12A...being about 2500 items each (50x50 in this case. These sets always have a length equal to the square of the first 3 lists, but I'm keeping things short and simple here.).
I'm well aware that the amount of results is absolutely huge, but want to start with as large a dataset as I can, so I can see how much I can reduce the input sizes without greatly impacting the results when I plot a cumulative% histogram.
'Calculator'
import numpy as np
Set1S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set2S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set3S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set12A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set23A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set13A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
'Define an empty array to add results'
BlockVol = []
from itertools import product
'itertools iterates through all combinations of lists'
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
'This is the bottleneck equation, with large input datasets'
BlockVol.append((abs(i*j*k*np.sin(a)*np.sin(b)*np.sin(c))))
arr = np.array(BlockVol)
'manipulate the result list a couple ways'
BlockVol = np.cbrt(BlockVol)
BlockVol = BlockVol*12
'quick check to size of results list'
len(BlockVol)
This took me about 3 minutes or so for 11.3M results, just from eyeballing the clock.
I've learned about #njit, prange in the last day or so, but am a bit stuck in trying to translate my work into this format. I do have a desktop PC with a pretty good GPU, so I think I could speed things up by a lot. I'm well aware that the code below is a big garbage fire that doesn't do anything, but I'm hoping that I'm at least getting the point across on what I'm trying to do.
It seems that the way to go is to define a function with my 6 input lists, but i'm just not sure how to fuse the itertools product and the njit together.
import numpy as np
from itertools import product
from numba import njit, prange
#njit(parallel = True)
def BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
numRows =Len(Set12A)
BlockVol = np.zeros(numRows)
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol.append((abs(i*j*k*np.sin(a)*np.sin(b)*np.sin(c))))
arr = np.array(BlockVol)
BlockVol = np.cbrt(BlockVol)
BlockVol = BlockVol*12
len(BlockVol)
Any help is much appreciated, as this is all very new and overwhelming.
Thank you!
I solved your task just by NumPy code, it is always nicer to use just NumPy instead of heavy Numba if possible. Next NumPy-only code will be as fast as same solution using Numba.
My code is 2800 times faster than your reference code, time is measured at the end of code.
In next code BlockValCalcRef(...) function is just your reference code organized as function. And BlockVolCalc(...) is my NumPy based function that should give a lot of speedup. At the end I do assert np.allclose(...) in order to check that both solutions give same results.
Also I simplified a bit sets creation to use one N param to generate sets, in your real world you just provide necessary sets.
In order to solve task I did several things:
Instead of computing np.sin(...) many times for same values I precomputed them just once for Set12A, Set23A, Set13A. Also precomputed np.abs(...) for all sets.
In order to compute cross-product I used special way of numpy arrays indexing like [None, None, :, None, None, None] this allows us to use so-called popular numpy arrays broadcasting.
I have also idea how to improve code even more, to make it around 6 times even faster, but I think even with current huge speed you'll fill whole RAM of your machine in matter of seconds. The idea how to improve is next, currently cross product computes on each step product of 6 numbers, instead of this one can compute product of K - 1 sets and then multiply this array by K-th set in order to get K sets product. This will give 6 time more speedup (because there are 6 sets) because you'll need just one multiplication instead of 6.
Update: I've implemented second improved version of function BlockVolCalc2(...) according to paragraph above. It has 2800x speedup, for larger N it will be probably even more faster.
Try it online!
import numpy as np, time
N = 7
Set1S = np.arange(1, N + 1)
Set2S = np.arange(1, N + 1)
Set3S = np.arange(1, N + 1)
Set12A = np.arange(1, N + 1)
Set23A = np.arange(1, N + 1)
Set13A = np.arange(1, N + 1)
def BlockValCalcRef(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol = []
from itertools import product
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol.append((abs(i*j*k*np.sin(a)*np.sin(b)*np.sin(c))))
return np.array(BlockVol)
def BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
Set1S, Set2S, Set3S = np.abs(Set1S), np.abs(Set2S), np.abs(Set3S)
Set12A, Set23A, Set13A = np.abs(np.sin(Set12A)), np.abs(np.sin(Set23A)), np.abs(np.sin(Set13A))
return (
Set1S[:, None, None, None, None, None] *
Set2S[None, :, None, None, None, None] *
Set3S[None, None, :, None, None, None] *
Set12A[None, None, None, :, None, None] *
Set23A[None, None, None, None, :, None] *
Set13A[None, None, None, None, None, :]
).ravel()
def BlockVolCalc2(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
Set1S, Set2S, Set3S = np.abs(Set1S), np.abs(Set2S), np.abs(Set3S)
Set12A, Set23A, Set13A = np.abs(np.sin(Set12A)), np.abs(np.sin(Set23A)), np.abs(np.sin(Set13A))
prod = np.ones((1,), dtype = np.float32)
for s in reversed([Set1S, Set2S, Set3S, Set12A, Set23A, Set13A]):
prod = (s[:, None] * prod[None, :]).ravel()
return prod
# -------- Testing Correctness and Time Measuring --------
tb = time.time()
a0 = BlockValCalcRef(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A),
t0 = time.time() - tb
print(f'base time {round(t0, 4)} sec')
tb = time.time()
a1 = BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
t1 = time.time() - tb
print(f'improved time {round(t1, 4)} sec, speedup {round(t0 / t1, 2)}x')
tb = time.time()
a2 = BlockVolCalc2(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
t2 = time.time() - tb
print(f'improved2 time {round(t2, 4)} sec, speedup {round(t0 / t2, 2)}x')
assert np.allclose(a0, a1)
assert np.allclose(a0, a2)
Output:
base time 2.7569 sec
improved time 0.0015 sec, speedup 1834.83x
improved2 time 0.001 sec, speedup 2755.09x
My function embedded into your initial first code will look like here in this code.
Also I created TensorFlow-based variant of code, which will use all of your CPU cores and GPU, this code needs installing tensorflow one time by python -m pip install --upgrade numpy tensorflow:
import numpy as np
N = 18
Set1S, Set2S, Set3S, Set12A, Set23A, Set13A = [np.arange(1 + i, N + 1 + i) for i in range(6)]
dtype = np.float32
def Prepare(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
import numpy as np
Set12A, Set23A, Set13A = np.sin(Set12A), np.sin(Set23A), np.sin(Set13A)
return [np.abs(s).astype(dtype) for s in [Set1S, Set2S, Set3S, Set12A, Set23A, Set13A]]
sets = Prepare(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
def ProcessNP(sets):
import numpy as np
res = np.ones((1,), dtype = dtype)
for s in reversed(sets):
res = (s[:, None] * res[None, :]).ravel()
res = np.cbrt(res) * 12
return res
def ProcessTF(sets, *, state = {}):
if 'graph' not in state:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import numpy as np, tensorflow as tf
tf.compat.v1.disable_eager_execution()
cpus = tf.config.list_logical_devices('CPU')
#print(f"CPUs: {[e.name for e in cpus]}")
gpus = tf.config.list_logical_devices('GPU')
#print(f"GPUs: {[e.name for e in gpus]}")
print(f"GPU: {len(gpus) > 0}")
state['graph'] = tf.Graph()
state['sess'] = tf.compat.v1.Session(graph = state['graph'])
#tf.device(cpus[0].name if len(gpus) == 0 else gpus[0].name)
with state['sess'].as_default(), state['graph'].as_default():
res = tf.ones((1,), dtype = dtype)
state['inp'] = []
for s in reversed(sets):
sph = tf.compat.v1.placeholder(dtype, s.shape)
state['inp'].insert(0, sph)
res = sph[:, None] * res[None, :]
res = tf.reshape(res, (tf.size(res),))
res = tf.math.pow(res, 1 / 3) * 12
state['out'] = res
def Run(sets):
with state['sess'].as_default(), state['graph'].as_default():
return tf.compat.v1.get_default_session().run(
state['out'], {ph: s for ph, s in zip(state['inp'], sets)}
)
state['run'] = Run
return state['run'](sets)
# ------------ Testing ------------
npa, tfa = ProcessNP(sets), ProcessTF(sets)
assert np.allclose(npa, tfa)
from timeit import timeit
print('Nums:', round(npa.size / 10 ** 6, 3), 'M')
timeit_num = 2
print('NP:', round(timeit(lambda: ProcessNP(sets), number = timeit_num) / timeit_num, 3), 'sec')
print('TF:', round(timeit(lambda: ProcessTF(sets), number = timeit_num) / timeit_num, 3), 'sec')
On my 2-cores CPU it prints:
GPU: False
Nums: 34.012 M
NP: 3.487 sec
TF: 1.185 sec
I have a pandas dataframe with indices to a numpy array. The value of the array has to be set to 1 for those indices. I need to do this millions of times on a big numpy array. Is there a more efficient way than the approach shown below?
from numpy import float32, uint
from numpy.random import choice
from pandas import DataFrame
from timeit import timeit
xy = 2000,300000
sz = 10000000
ind = DataFrame({"i":choice(range(xy[0]),sz),"j":choice(range(xy[1]),sz)}).drop_duplicates()
dtype = uint
repeats = 10
#original (~21s)
stmt = '''\
from numpy import zeros
a = zeros(xy, dtype=dtype)
a[ind.values[:,0],ind.values[:,1]] = 1'''
print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))
#suggested by #piRSquared (~13s)
stmt = '''\
from numpy import ones
from scipy.sparse import coo_matrix
i,j = ind.i.values,ind.j.values
a = coo_matrix((ones(i.size, dtype=dtype), (i, j)), dtype=dtype).toarray()
'''
print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))
I have edited the above post to show the approach(es) suggested by #piRSquared and re-wrote it to allow an apples-to-apples comparison. Irrespective of the data type (tried uint and float32), the suggested approach has a 40% reduction in time.
OP time
56.56 s
I can only marginally improve with
i, j = ind.i.values, ind.j.values
a[i, j] = 1
New Time
52.19 s
However, you can considerably speed this up by using scipy.sparse.coo_matrix to instantiate a sparse matrix and then convert it to a numpy.array.
import timeit
stmt = '''\
import numpy, pandas
from scipy.sparse import coo_matrix
xy = 2000,300000
sz = 10000000
ind = pandas.DataFrame({"i":numpy.random.choice(range(xy[0]),sz),"j":numpy.random.choice(range(xy[1]),sz)}).drop_duplicates()
################################################
i, j = ind.i.values, ind.j.values
dtype = numpy.uint8
a = coo_matrix((numpy.ones(i.size, dtype=dtype), (i, j)), dtype=dtype).toarray()'''
timeit.timeit(stmt, number=10)
33.06471237000369