I've been working on a project for a while now that requires calculating some very large datasets, and very quickly have moved beyond anything that my meager Excel knowledge could handle. In the last few days I've started learning Python, which has helped with handling the size of data I'm dealing with, but the estimated processing time for these datasets is looking to be incredibly long (possibly a couple hundred years on my laptop).
The bottleneck here is an equation that could produce trillions or quadrillions of results, since it is calculating every combination of 6 different lists and running it through an equation that you'll see in the code. The code works just fine, as is, but is isn't feasible for larger datasets than the example I included. A real dataset would be something more like Set1S, 2S, and 3S being 50 items each, and Sets12A...being about 2500 items each (50x50 in this case. These sets always have a length equal to the square of the first 3 lists, but I'm keeping things short and simple here.).
I'm well aware that the amount of results is absolutely huge, but want to start with as large a dataset as I can, so I can see how much I can reduce the input sizes without greatly impacting the results when I plot a cumulative% histogram.
'Calculator'
import numpy as np
Set1S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set2S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set3S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set12A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set23A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set13A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
'Define an empty array to add results'
BlockVol = []
from itertools import product
'itertools iterates through all combinations of lists'
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
'This is the bottleneck equation, with large input datasets'
BlockVol.append((abs(i*j*k*np.sin(a)*np.sin(b)*np.sin(c))))
arr = np.array(BlockVol)
'manipulate the result list a couple ways'
BlockVol = np.cbrt(BlockVol)
BlockVol = BlockVol*12
'quick check to size of results list'
len(BlockVol)
This took me about 3 minutes or so for 11.3M results, just from eyeballing the clock.
I've learned about #njit, prange in the last day or so, but am a bit stuck in trying to translate my work into this format. I do have a desktop PC with a pretty good GPU, so I think I could speed things up by a lot. I'm well aware that the code below is a big garbage fire that doesn't do anything, but I'm hoping that I'm at least getting the point across on what I'm trying to do.
It seems that the way to go is to define a function with my 6 input lists, but i'm just not sure how to fuse the itertools product and the njit together.
import numpy as np
from itertools import product
from numba import njit, prange
#njit(parallel = True)
def BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
numRows =Len(Set12A)
BlockVol = np.zeros(numRows)
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol.append((abs(i*j*k*np.sin(a)*np.sin(b)*np.sin(c))))
arr = np.array(BlockVol)
BlockVol = np.cbrt(BlockVol)
BlockVol = BlockVol*12
len(BlockVol)
Any help is much appreciated, as this is all very new and overwhelming.
Thank you!
I solved your task just by NumPy code, it is always nicer to use just NumPy instead of heavy Numba if possible. Next NumPy-only code will be as fast as same solution using Numba.
My code is 2800 times faster than your reference code, time is measured at the end of code.
In next code BlockValCalcRef(...) function is just your reference code organized as function. And BlockVolCalc(...) is my NumPy based function that should give a lot of speedup. At the end I do assert np.allclose(...) in order to check that both solutions give same results.
Also I simplified a bit sets creation to use one N param to generate sets, in your real world you just provide necessary sets.
In order to solve task I did several things:
Instead of computing np.sin(...) many times for same values I precomputed them just once for Set12A, Set23A, Set13A. Also precomputed np.abs(...) for all sets.
In order to compute cross-product I used special way of numpy arrays indexing like [None, None, :, None, None, None] this allows us to use so-called popular numpy arrays broadcasting.
I have also idea how to improve code even more, to make it around 6 times even faster, but I think even with current huge speed you'll fill whole RAM of your machine in matter of seconds. The idea how to improve is next, currently cross product computes on each step product of 6 numbers, instead of this one can compute product of K - 1 sets and then multiply this array by K-th set in order to get K sets product. This will give 6 time more speedup (because there are 6 sets) because you'll need just one multiplication instead of 6.
Update: I've implemented second improved version of function BlockVolCalc2(...) according to paragraph above. It has 2800x speedup, for larger N it will be probably even more faster.
Try it online!
import numpy as np, time
N = 7
Set1S = np.arange(1, N + 1)
Set2S = np.arange(1, N + 1)
Set3S = np.arange(1, N + 1)
Set12A = np.arange(1, N + 1)
Set23A = np.arange(1, N + 1)
Set13A = np.arange(1, N + 1)
def BlockValCalcRef(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol = []
from itertools import product
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol.append((abs(i*j*k*np.sin(a)*np.sin(b)*np.sin(c))))
return np.array(BlockVol)
def BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
Set1S, Set2S, Set3S = np.abs(Set1S), np.abs(Set2S), np.abs(Set3S)
Set12A, Set23A, Set13A = np.abs(np.sin(Set12A)), np.abs(np.sin(Set23A)), np.abs(np.sin(Set13A))
return (
Set1S[:, None, None, None, None, None] *
Set2S[None, :, None, None, None, None] *
Set3S[None, None, :, None, None, None] *
Set12A[None, None, None, :, None, None] *
Set23A[None, None, None, None, :, None] *
Set13A[None, None, None, None, None, :]
).ravel()
def BlockVolCalc2(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
Set1S, Set2S, Set3S = np.abs(Set1S), np.abs(Set2S), np.abs(Set3S)
Set12A, Set23A, Set13A = np.abs(np.sin(Set12A)), np.abs(np.sin(Set23A)), np.abs(np.sin(Set13A))
prod = np.ones((1,), dtype = np.float32)
for s in reversed([Set1S, Set2S, Set3S, Set12A, Set23A, Set13A]):
prod = (s[:, None] * prod[None, :]).ravel()
return prod
# -------- Testing Correctness and Time Measuring --------
tb = time.time()
a0 = BlockValCalcRef(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A),
t0 = time.time() - tb
print(f'base time {round(t0, 4)} sec')
tb = time.time()
a1 = BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
t1 = time.time() - tb
print(f'improved time {round(t1, 4)} sec, speedup {round(t0 / t1, 2)}x')
tb = time.time()
a2 = BlockVolCalc2(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
t2 = time.time() - tb
print(f'improved2 time {round(t2, 4)} sec, speedup {round(t0 / t2, 2)}x')
assert np.allclose(a0, a1)
assert np.allclose(a0, a2)
Output:
base time 2.7569 sec
improved time 0.0015 sec, speedup 1834.83x
improved2 time 0.001 sec, speedup 2755.09x
My function embedded into your initial first code will look like here in this code.
Also I created TensorFlow-based variant of code, which will use all of your CPU cores and GPU, this code needs installing tensorflow one time by python -m pip install --upgrade numpy tensorflow:
import numpy as np
N = 18
Set1S, Set2S, Set3S, Set12A, Set23A, Set13A = [np.arange(1 + i, N + 1 + i) for i in range(6)]
dtype = np.float32
def Prepare(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
import numpy as np
Set12A, Set23A, Set13A = np.sin(Set12A), np.sin(Set23A), np.sin(Set13A)
return [np.abs(s).astype(dtype) for s in [Set1S, Set2S, Set3S, Set12A, Set23A, Set13A]]
sets = Prepare(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
def ProcessNP(sets):
import numpy as np
res = np.ones((1,), dtype = dtype)
for s in reversed(sets):
res = (s[:, None] * res[None, :]).ravel()
res = np.cbrt(res) * 12
return res
def ProcessTF(sets, *, state = {}):
if 'graph' not in state:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import numpy as np, tensorflow as tf
tf.compat.v1.disable_eager_execution()
cpus = tf.config.list_logical_devices('CPU')
#print(f"CPUs: {[e.name for e in cpus]}")
gpus = tf.config.list_logical_devices('GPU')
#print(f"GPUs: {[e.name for e in gpus]}")
print(f"GPU: {len(gpus) > 0}")
state['graph'] = tf.Graph()
state['sess'] = tf.compat.v1.Session(graph = state['graph'])
#tf.device(cpus[0].name if len(gpus) == 0 else gpus[0].name)
with state['sess'].as_default(), state['graph'].as_default():
res = tf.ones((1,), dtype = dtype)
state['inp'] = []
for s in reversed(sets):
sph = tf.compat.v1.placeholder(dtype, s.shape)
state['inp'].insert(0, sph)
res = sph[:, None] * res[None, :]
res = tf.reshape(res, (tf.size(res),))
res = tf.math.pow(res, 1 / 3) * 12
state['out'] = res
def Run(sets):
with state['sess'].as_default(), state['graph'].as_default():
return tf.compat.v1.get_default_session().run(
state['out'], {ph: s for ph, s in zip(state['inp'], sets)}
)
state['run'] = Run
return state['run'](sets)
# ------------ Testing ------------
npa, tfa = ProcessNP(sets), ProcessTF(sets)
assert np.allclose(npa, tfa)
from timeit import timeit
print('Nums:', round(npa.size / 10 ** 6, 3), 'M')
timeit_num = 2
print('NP:', round(timeit(lambda: ProcessNP(sets), number = timeit_num) / timeit_num, 3), 'sec')
print('TF:', round(timeit(lambda: ProcessTF(sets), number = timeit_num) / timeit_num, 3), 'sec')
On my 2-cores CPU it prints:
GPU: False
Nums: 34.012 M
NP: 3.487 sec
TF: 1.185 sec
I am trying to come up with a method that could implement a FIFOQueue in tensorflow. So on every iteration, the purpose is to assign a placeholder a certain number, then store it in a Variable named: buffer. After each assignment, I am incrementing an index. The buffer size is [5], so that index should range from 0 to 4. Finally, after the buffer is full, I would set buffer[0:4] to be buffer[1:5], and then add the new value to buffer[4]. So here is my
code:
import tensorflow as tf
import numpy as np
import random
dim = 30
lst = []
for i in range(dim):
lst.append(random.randint(1, 10))
data = np.reshape(lst, [dim, 1])
print(lst)
# create a buffer:
buffer_input = tf.placeholder(tf.int32, shape=[1])
buffer = tf.Variable(tf.zeros([5], tf.int32))
index = tf.Variable(tf.constant(0))
def fillBufferBeforeFilled():
update_op1 = tf.scatter_update(buffer, indices=[index], updates=buffer_input)
index_assign_add = tf.assign_add(index, 1)
return update_op1, index_assign_add
def fillBufferAfterFilled():
tmp = tf.slice(buffer, begin=[0], size=[4])
update_op2 = tf.scatter_update(buffer, indices=[0, 1, 2, 3], updates=tmp)
update_op3 = tf.scatter_update(buffer, indices=[index], updates=buffer_input)
return update_op2, update_op3
cond = tf.cond(tf.equal(index, 4), lambda: fillBufferBeforeFilled(), lambda: fillBufferAfterFilled())
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(dim):
cond_ = sess.run(cond, feed_dict={buffer_input: data[i]})
buf = sess.run(buffer, feed_dict={buffer_input: data[i]})
print('buf: ', buf)
Problem: The index Variable is not incremented after each call, while the first element of the buffer is being assigned to the value passed to the placeholder.
I would like to know why I'm getting this behavior and what is the solution to this problem.
any help is much appreciated!!
You've mixed up the order of the conditions in tf.cond; it should be
cond = tf.cond(tf.equal(index, 4), lambda: fillBufferAfterFilled(), lambda: fillBufferBeforeFilled())
I can get your code running and it mostly works, but the updates aren't quite right; I suspect you'll need to add some tf.control_dependencies calls to force things to happen in the right order.
Here is the solution:
import tensorflow as tf
import numpy as np
import random
dim = 30
lst = []
for i in range(dim):
lst.append(random.randint(1, 10))
data = np.reshape(lst, [dim, 1])
print(lst)
# create a buffer:
buffer_input = tf.placeholder(tf.int32, shape=[1])
buffer = tf.Variable(tf.zeros([5], tf.int32))
index = tf.Variable(-1, tf.int32)
def fillBufferBeforeFilled():
index_assign_add = tf.assign_add(index, 1)
with tf.control_dependencies([index_assign_add]):
update_op1 = tf.scatter_update(buffer, indices=[index], updates=buffer_input)
return update_op1, index_assign_add
def fillBufferAfterFilled():
tmp = tf.slice(buffer, begin=[1], size=[4])
update_op2 = tf.scatter_update(buffer, indices=[0, 1, 2, 3], updates=tmp)
with tf.control_dependencies([update_op2]):
update_op3 = tf.scatter_update(buffer, indices=[index], updates=buffer_input)
return update_op2, update_op3
cond = tf.cond(tf.equal(index, 4), lambda: fillBufferAfterFilled(), lambda: fillBufferBeforeFilled())
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(dim):
cond_ = sess.run(cond, feed_dict={buffer_input: data[i]})
buf = sess.run(buffer, feed_dict={buffer_input: data[i]})
print('buf: ', buf)
In the following code which is an example of my main code, I have tried to use pathos.multiprocessing to increase the speed of iteration of a loop. The output of each iteration which has implemented with multiprocessing is a 2-D array. I used pathos.multiprocessing instead of multiprocessing since I wanted to use it in my class method. I have used apipe method of the pathos.multiprocessing to collect the output in a list but it returns an empty list. I have no idea why it fails
import numpy as np
import random
import pathos.multiprocessing as mp
class Testsystematics(object):
def __init__(self, x, y, NTH = None, THMIN = None, THMAX = None, NRESAMPLE = None):
self.x = x
self.y = y
self.nbins = NTH
self.bmin = THMIN
self.bmax = THMAX
self.nresample= NRESAMPLE
self.bins = np.linspace(self.bmin, self.bmax, self.nbins+1, True).astype(np.float)
self.sample = np.array([[random.choice(range(len(self.y))) for _ in xrange(len(self.y))] for i in range(self.nresample)])
self.result_list=[]
def log_result(self, result):
self.result_list.append(result)
def bootstrapping(self, k):
xi_p = np.zeros(self.nbins, float)
xi_m = np.zeros(self.nbins, float)
nind = np.zeros(self.nbins, float)
for i in range(len(self.x)):
for j in range(len(self.x)):
if (i!=j):
sep= np.sqrt(self.x[i]**2+self.x[j]**2)
index= np.searchsorted(self.bins, sep , side='right')-1
sind = np.sin(sep)
if ((sep< self.bins[-1]) and (sep>=self.bins[0])):
xi_p[index] += sind*(np.mean(y)-np.median(y))
xi_m[index] += sind*np.std(y)
nind[index] += 1.0
for i in range(self.nbins):
xi_p[i]=xi_p[i]/nind[i]
xi_m[i]=xi_m[i]/nind[i]
return np.vstack((xi_p,xi_m))
def twopcf(self):
if (self.sys_type==1):
pool = mp.ProcessingPool(16)
for n in range(self.nresample):
pool.apipe(self.bootstrapping, args=(n,), callback=self.log_result)
shape,scale=0.5, 0.6
x=np.random.gamma(shape, scale, 10000)
mu1, sigma1 = 0, 0.5 # mean and standard deviation
mu2, sigma2 = 0.1, 0.7 # mean and standard deviation
y = np.random.normal(mu1, sigma1, 1000)+np.random.normal(mu2, sigma2, 1000)
sysTest=Testsystematics(x, y, NTH = 10, THMIN = 0, THMAX = 5, NRESAMPLE = 100)
any suggestion?
I'm the pathos author. I tried your code, and it runs, but produces no error and produces no result in result_list. I believe that is because you are using apipe incorrectly. The correct use of apipe is as follows:
>>> import pathos
>>> def squared(x):
... return x**2
...
>>> pool = pathos.multiprocessing.ProcessingPool()
>>> res = pool.apipe(squared, 5)
>>> res.get()
25
self.bootstrapping takes self and k, so you have to provide a k in the pipe call when you calling it as an instance method. There is no callback -- if you want a callback, you'd need to add one to your function.
Note that the return value is retrieved by (1) getting a return object, and (2) by calling get on the return object.
From you use of apipe within a for loop, that points me to suggest you use pool.amap (or pool.imap) instead -- then you can do the for loop in parallel.