I've been working on a project for a while now that requires calculating some very large datasets, and very quickly have moved beyond anything that my meager Excel knowledge could handle. In the last few days I've started learning Python, which has helped with handling the size of data I'm dealing with, but the estimated processing time for these datasets is looking to be incredibly long (possibly a couple hundred years on my laptop).
The bottleneck here is an equation that could produce trillions or quadrillions of results, since it is calculating every combination of 6 different lists and running it through an equation that you'll see in the code. The code works just fine, as is, but is isn't feasible for larger datasets than the example I included. A real dataset would be something more like Set1S, 2S, and 3S being 50 items each, and Sets12A...being about 2500 items each (50x50 in this case. These sets always have a length equal to the square of the first 3 lists, but I'm keeping things short and simple here.).
I'm well aware that the amount of results is absolutely huge, but want to start with as large a dataset as I can, so I can see how much I can reduce the input sizes without greatly impacting the results when I plot a cumulative% histogram.
'Calculator'
import numpy as np
Set1S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set2S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set3S = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set12A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set23A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Set13A = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
'Define an empty array to add results'
BlockVol = []
from itertools import product
'itertools iterates through all combinations of lists'
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
'This is the bottleneck equation, with large input datasets'
BlockVol.append((abs(i*j*k*np.sin(a)*np.sin(b)*np.sin(c))))
arr = np.array(BlockVol)
'manipulate the result list a couple ways'
BlockVol = np.cbrt(BlockVol)
BlockVol = BlockVol*12
'quick check to size of results list'
len(BlockVol)
This took me about 3 minutes or so for 11.3M results, just from eyeballing the clock.
I've learned about #njit, prange in the last day or so, but am a bit stuck in trying to translate my work into this format. I do have a desktop PC with a pretty good GPU, so I think I could speed things up by a lot. I'm well aware that the code below is a big garbage fire that doesn't do anything, but I'm hoping that I'm at least getting the point across on what I'm trying to do.
It seems that the way to go is to define a function with my 6 input lists, but i'm just not sure how to fuse the itertools product and the njit together.
import numpy as np
from itertools import product
from numba import njit, prange
#njit(parallel = True)
def BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
numRows =Len(Set12A)
BlockVol = np.zeros(numRows)
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol.append((abs(i*j*k*np.sin(a)*np.sin(b)*np.sin(c))))
arr = np.array(BlockVol)
BlockVol = np.cbrt(BlockVol)
BlockVol = BlockVol*12
len(BlockVol)
Any help is much appreciated, as this is all very new and overwhelming.
Thank you!
I solved your task just by NumPy code, it is always nicer to use just NumPy instead of heavy Numba if possible. Next NumPy-only code will be as fast as same solution using Numba.
My code is 2800 times faster than your reference code, time is measured at the end of code.
In next code BlockValCalcRef(...) function is just your reference code organized as function. And BlockVolCalc(...) is my NumPy based function that should give a lot of speedup. At the end I do assert np.allclose(...) in order to check that both solutions give same results.
Also I simplified a bit sets creation to use one N param to generate sets, in your real world you just provide necessary sets.
In order to solve task I did several things:
Instead of computing np.sin(...) many times for same values I precomputed them just once for Set12A, Set23A, Set13A. Also precomputed np.abs(...) for all sets.
In order to compute cross-product I used special way of numpy arrays indexing like [None, None, :, None, None, None] this allows us to use so-called popular numpy arrays broadcasting.
I have also idea how to improve code even more, to make it around 6 times even faster, but I think even with current huge speed you'll fill whole RAM of your machine in matter of seconds. The idea how to improve is next, currently cross product computes on each step product of 6 numbers, instead of this one can compute product of K - 1 sets and then multiply this array by K-th set in order to get K sets product. This will give 6 time more speedup (because there are 6 sets) because you'll need just one multiplication instead of 6.
Update: I've implemented second improved version of function BlockVolCalc2(...) according to paragraph above. It has 2800x speedup, for larger N it will be probably even more faster.
Try it online!
import numpy as np, time
N = 7
Set1S = np.arange(1, N + 1)
Set2S = np.arange(1, N + 1)
Set3S = np.arange(1, N + 1)
Set12A = np.arange(1, N + 1)
Set23A = np.arange(1, N + 1)
Set13A = np.arange(1, N + 1)
def BlockValCalcRef(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol = []
from itertools import product
for i,j,k,a,b,c in product(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
BlockVol.append((abs(i*j*k*np.sin(a)*np.sin(b)*np.sin(c))))
return np.array(BlockVol)
def BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
Set1S, Set2S, Set3S = np.abs(Set1S), np.abs(Set2S), np.abs(Set3S)
Set12A, Set23A, Set13A = np.abs(np.sin(Set12A)), np.abs(np.sin(Set23A)), np.abs(np.sin(Set13A))
return (
Set1S[:, None, None, None, None, None] *
Set2S[None, :, None, None, None, None] *
Set3S[None, None, :, None, None, None] *
Set12A[None, None, None, :, None, None] *
Set23A[None, None, None, None, :, None] *
Set13A[None, None, None, None, None, :]
).ravel()
def BlockVolCalc2(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
Set1S, Set2S, Set3S = np.abs(Set1S), np.abs(Set2S), np.abs(Set3S)
Set12A, Set23A, Set13A = np.abs(np.sin(Set12A)), np.abs(np.sin(Set23A)), np.abs(np.sin(Set13A))
prod = np.ones((1,), dtype = np.float32)
for s in reversed([Set1S, Set2S, Set3S, Set12A, Set23A, Set13A]):
prod = (s[:, None] * prod[None, :]).ravel()
return prod
# -------- Testing Correctness and Time Measuring --------
tb = time.time()
a0 = BlockValCalcRef(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A),
t0 = time.time() - tb
print(f'base time {round(t0, 4)} sec')
tb = time.time()
a1 = BlockVolCalc(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
t1 = time.time() - tb
print(f'improved time {round(t1, 4)} sec, speedup {round(t0 / t1, 2)}x')
tb = time.time()
a2 = BlockVolCalc2(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
t2 = time.time() - tb
print(f'improved2 time {round(t2, 4)} sec, speedup {round(t0 / t2, 2)}x')
assert np.allclose(a0, a1)
assert np.allclose(a0, a2)
Output:
base time 2.7569 sec
improved time 0.0015 sec, speedup 1834.83x
improved2 time 0.001 sec, speedup 2755.09x
My function embedded into your initial first code will look like here in this code.
Also I created TensorFlow-based variant of code, which will use all of your CPU cores and GPU, this code needs installing tensorflow one time by python -m pip install --upgrade numpy tensorflow:
import numpy as np
N = 18
Set1S, Set2S, Set3S, Set12A, Set23A, Set13A = [np.arange(1 + i, N + 1 + i) for i in range(6)]
dtype = np.float32
def Prepare(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A):
import numpy as np
Set12A, Set23A, Set13A = np.sin(Set12A), np.sin(Set23A), np.sin(Set13A)
return [np.abs(s).astype(dtype) for s in [Set1S, Set2S, Set3S, Set12A, Set23A, Set13A]]
sets = Prepare(Set1S, Set2S, Set3S, Set12A, Set23A, Set13A)
def ProcessNP(sets):
import numpy as np
res = np.ones((1,), dtype = dtype)
for s in reversed(sets):
res = (s[:, None] * res[None, :]).ravel()
res = np.cbrt(res) * 12
return res
def ProcessTF(sets, *, state = {}):
if 'graph' not in state:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import numpy as np, tensorflow as tf
tf.compat.v1.disable_eager_execution()
cpus = tf.config.list_logical_devices('CPU')
#print(f"CPUs: {[e.name for e in cpus]}")
gpus = tf.config.list_logical_devices('GPU')
#print(f"GPUs: {[e.name for e in gpus]}")
print(f"GPU: {len(gpus) > 0}")
state['graph'] = tf.Graph()
state['sess'] = tf.compat.v1.Session(graph = state['graph'])
#tf.device(cpus[0].name if len(gpus) == 0 else gpus[0].name)
with state['sess'].as_default(), state['graph'].as_default():
res = tf.ones((1,), dtype = dtype)
state['inp'] = []
for s in reversed(sets):
sph = tf.compat.v1.placeholder(dtype, s.shape)
state['inp'].insert(0, sph)
res = sph[:, None] * res[None, :]
res = tf.reshape(res, (tf.size(res),))
res = tf.math.pow(res, 1 / 3) * 12
state['out'] = res
def Run(sets):
with state['sess'].as_default(), state['graph'].as_default():
return tf.compat.v1.get_default_session().run(
state['out'], {ph: s for ph, s in zip(state['inp'], sets)}
)
state['run'] = Run
return state['run'](sets)
# ------------ Testing ------------
npa, tfa = ProcessNP(sets), ProcessTF(sets)
assert np.allclose(npa, tfa)
from timeit import timeit
print('Nums:', round(npa.size / 10 ** 6, 3), 'M')
timeit_num = 2
print('NP:', round(timeit(lambda: ProcessNP(sets), number = timeit_num) / timeit_num, 3), 'sec')
print('TF:', round(timeit(lambda: ProcessTF(sets), number = timeit_num) / timeit_num, 3), 'sec')
On my 2-cores CPU it prints:
GPU: False
Nums: 34.012 M
NP: 3.487 sec
TF: 1.185 sec
I modify the code of stuff_patches_3D to recover 3D rib image from overlapping patches, but find the results is not completely correct.
First, I use the following code to extract patches:
def extract_patches(arr, patch_shape=24, extraction_step=8):
# From: scikit-learn/sklearn/feature_extraction/image.py
"""Extracts patches of any n-dimensional array in place using strides.
Parameters
----------
arr : ndarray. n-dimensional array of which patches are to be extracted
patch_shape : integer or tuple of length arr.ndim
extraction_step : integer or tuple of length arr.ndim
Returns
-------
patches : strided ndarray
"""
print('Extract Patch...')
arr_ndim = arr.ndim
if isinstance(patch_shape, numbers.Number):
patch_shape = tuple([patch_shape] * arr_ndim)
if isinstance(extraction_step, numbers.Number):
extraction_step = tuple([extraction_step] * arr_ndim)
patch_strides = arr.strides
slices = tuple(slice(None, None, st) for st in extraction_step)
indexing_strides = arr[slices].strides
patch_indices_shape = ((np.array(arr.shape) - np.array(patch_shape)) //
np.array(extraction_step)) + 1
shape = tuple(list(patch_indices_shape) + list(patch_shape))
strides = tuple(list(indexing_strides) + list(patch_strides))
patches = as_strided(arr, shape=shape, strides=strides)
return patches
Then, I use the modified codes from the post of Google to recover patches:
def stuff_patches_3D(img_org, patches,xstep=12,ystep=12,zstep=12):
out_shape = img_org.shape
print('Recover image...')
out = np.zeros(out_shape, patches.dtype)
denom = np.zeros(out_shape, patches.dtype)
patch_shape = patches.shape[-3:]
patches_shape = ((out.shape[0]-patch_shape[0])//xstep+1, (out.shape[1]-patch_shape[1])//ystep+1,
(out.shape[2]-patch_shape[2])//zstep+1, patch_shape[0], patch_shape[1], patch_shape[2])
patches_strides = (out.strides[0]*xstep, out.strides[1] * ystep,out.strides[2] * zstep, out.strides[0], out.strides[1],out.strides[2])
patches_6D = np.lib.stride_tricks.as_strided(out, patches_shape, patches_strides)
denom_6D = np.lib.stride_tricks.as_strided(denom, patches_shape, patches_strides)
grid_inds = tuple(x.ravel() for x in np.indices(patches_6D.shape))
np.add.at(patches_6D, grid_inds, patches.ravel())
np.add.at(denom_6D, grid_inds, 1)
# in case there are 0 elements in denom
inds = denom != 0
img_recover = np.zeros(out.shape, dtype=out.dtype)
img_recover[inds] = out[inds]/denom[inds]
return img_recover
I plt.imshow() the recovered image of out_new, whick looks like the original image. But, when using the codes of inds = img_recover!=img_org and num = np.sum(inds) to test the recovered image, I find the num is far larger than 0. In fact, the shape of img_org is 399*196*299, that's 23382996 voxels, and num=1317528.
I cannot find the reason. Any helps? Thanks in advance!
I am trying to use the argmax result of tf.nn.max_pool_with_argmax() to index another tensor. For simplicity, let's say I am trying to implement the following:
output, argmax = tf.nn.max_pool_with_argmax(input, ksize, strides, padding)
tf.assert_equal(input[argmax],output)
Now my question is how do I implement the necessary indexing operation input[argmax] to achieve the desired result? I am guessing this involves some usage of tf.gather_nd() and related calls, but I cannot figure it out. If necessary, we could assume that input has [BatchSize, Height, Width, Channel] dimensions.
Thx for your help!
Mat
I found a solution using tf.gather_ndand it works, although it seems not so elegant. I used the function unravel_argmaxthat was posted here.
def unravel_argmax(argmax, shape):
output_list = []
output_list.append(argmax // (shape[2] * shape[3]))
output_list.append(argmax % (shape[2] * shape[3]) // shape[3])
return tf.stack(output_list)
def max_pool(input, ksize, strides,padding):
output, arg_max = tf.nn.max_pool_with_argmax(input=input,ksize=ksize,strides=strides,padding=padding)
shape = input.get_shape()
arg_max = tf.cast(arg_max,tf.int32)
unraveld = unravel_argmax(arg_max,shape)
indices = tf.transpose(unraveld,(1,2,3,4,0))
channels = shape[-1]
bs = tf.shape(iv.m)[0]
t1 = tf.range(channels,dtype=arg_max.dtype)[None, None, None, :, None]
t2 = tf.tile(t1,multiples=(bs,) + tuple(indices.get_shape()[1:-2]) + (1,1))
t3 = tf.concat((indices,t2),axis=-1)
t4 = tf.range(tf.cast(bs, dtype=arg_max.dtype))
t5 = tf.tile(t4[:,None,None,None,None],(1,) + tuple(indices.get_shape()[1:-2].as_list()) + (channels,1))
t6 = tf.concat((t5, t3), -1)
return tf.gather_nd(input,t6)
In case anyone has a more elegant solution, I'd still be curious to know.
Mat
This small snippet works:
def get_results(data,other_tensor):
pooled_data, indices = tf.nn.max_pool_with_argmax(data,ksize=[1,ksize,ksize,1],strides=[1,stride,stride,1],padding='VALID',include_batch_in_index=True)
b,w,h,c = other_tensor.get_shape.as_list()
other_tensor_pooled = tf.gather(tf.reshape(other_tensor,shape= [b*w*h*c,]),indices)
return other_tensor_pooled
The above indices can be used to index the tensor. This function actually returns flattened indices and to use it with anything with batch_size > 1 you need to pass include_batch_in_index as True in-order to get proper results. I am assuming here that othertensor you has the same batch size as data.
I am doing it in this way:
def max_pool(input, ksize, strides,padding):
output, arg_max = tf.nn.max_pool_with_argmax(input=input,ksize=ksize,strides=strides,padding=padding)
shape=tf.shape(output)
output1=tf.reshape(tf.gather(tf.reshape(input,[-1]),arg_max),shape)
err=tf.reduce_sum(tf.square(tf.subtract(output,output1)))
return output1, err
Main Problem
What is the better/pythonic way of retrieving elements in a particular array that are not found in a different array. This is what I have;
idata = [np.column_stack(data[k]) for k in range(len(data)) if data[k] not in final]
idata = np.vstack(idata)
My interest is in performance. My data is an (X,Y,Z) array of size (7000 x 3) and my gdata is an (X,Y) array of (11000 x 2)
Preamble
I am working on an octant search to find the n-number(e.g. 8) of points (+) closest to my circular point (o) in each octant. This would mean that my points (+) are reduced to only 64 (8 per octant). Then for each gdata I would save the elements that are not found in data.
import tkinter as tk
from tkinter import filedialog
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
from collections import defaultdict
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
data = pd.read_excel(file_path)
data = np.array(data, dtype=np.float)
nrow, cols = data.shape
file_path1 = filedialog.askopenfilename()
gdata = pd.read_excel(file_path1)
gdata = np.array(gdata, dtype=np.float)
gnrow, gcols = gdata.shape
N=8
delta = gdata - data[:,:2]
angles = np.arctan2(delta[:,1], delta[:,0])
bins = np.linspace(-np.pi, np.pi, 9)
bins[-1] = np.inf # handle edge case
octantsort = []
for j in range(gnrow):
delta = gdata[j, ::] - data[:, :2]
angles = np.arctan2(delta[:, 1], delta[:, 0])
octantsort = []
for i in range(8):
data_i = data[(bins[i] <= angles) & (angles < bins[i+1])]
if data_i.size > 0:
dist_order = np.argsort(cdist(data_i[:, :2], gdata[j, ::][np.newaxis]), axis=0)
if dist_order.size < npoint_per_octant+1:
[octantsort.append(data_i[dist_order[:npoint_per_octant][j]]) for j in range(dist_order.size)]
else:
[octantsort.append(data_i[dist_order[:npoint_per_octant][j]]) for j in range(npoint_per_octant)]
final = np.vstack(octantsort)
idata = [np.column_stack(data[k]) for k in range(len(data)) if data[k] not in final]
idata = np.vstack(idata)
Is there an efficient and pythonic way of doing this do increase performance in the last two lines of the code?
If I understand your code correctly, then I see the following potential savings:
dedent the final = ... line
don't use arctan it's expensive; since you only want octants compare the coordinates to zero and to each other
don't do a full argsort, use argpartition instead
make your octantsort an "octantargsort", i.e. store the indices into data, not the data points themselves; this would save you the search in the last but one line and allow you to use np.delete for removing
don't use append inside a list comprehension. This will produce a list of Nones that is immediately discarded. You can use list.extend outside the comprehension instead
besides, these list comprehensions look like a convoluted way of converting data_i[dist_order[:npoint_per_octant]] into a list, why not simply cast, or even keep as an array, since you want to vstack in the end?
Here is some sample code illustrating these ideas:
import numpy as np
def discard_nearest_in_each_octant(eater, eaten, n_eaten_p_eater):
# build octants
# start with quadrants ...
top, left = (eaten < eater).T
quadrants = [np.where(v&h)[0] for v in (top, ~top) for h in (left, ~left)]
dcoord2 = (eaten - eater)**2
dc2quadrant = [dcoord2[q] for q in quadrants]
# ... and split them
oct4158 = [q[:, 0] < q [:, 1] for q in dc2quadrant]
# main loop
dc2octants = [[q[o], q[~o]] for q, o in zip (dc2quadrant, oct4158)]
reloap = [[
np.argpartition(o.sum(-1), n_eaten_p_eater)[:n_eaten_p_eater]
if o.shape[0] > n_eaten_p_eater else None
for o in opair] for opair in dc2octants]
# translate indices
octantargpartition = [q[so] if oap is None else q[np.where(so)[0][oap]]
for q, o, oaps in zip(quadrants, oct4158, reloap)
for so, oap in zip([o, ~o], oaps)]
octantargpartition = np.concatenate(octantargpartition)
return np.delete(eaten, octantargpartition, axis=0)