I am using sympy sympify function for evaluating a formula (dynamic) for data present in a dataframe.
import sympy as sy
def evaluate_function(formula,dataframe):
gfg_exp = sy.sympify(formula)
dataframe_dict=dataframe.to_dict()
gfg_exp = gfg_exp.subs(dataframe_dict)
return gfg_exp
df['result']=df.apply(lambda row:evaluate_function(formula=condition_to_check,dataframe=row),axis=1)
sample datadata is like:
A B
200 400
320 100
formula: A/B > 1
This is working for small datasets (around 20k records in less time), but when dataset size is huge around 1 million records -
its taking longer time to finish the computation.
Is there anyother way to do this process.
Thanks in advance.
You might try using lambdify to convert your expression into a Python function, rather than using subs. See the documentation https://docs.sympy.org/latest/modules/utilities/lambdify.html#sympy.utilities.lambdify.lambdify
Related
My goal is to create an array where each elemet is normal(size={})) of each element of it.
I am trying to oprimize:
it = 2 ** arange(6, 25)
M = zeros(len(it))
for x in range(len(it)):
M[x] = (normal(size=it[x]))
I have these not working so far:
N = zeros(len(it))
it = 2 ** arange(6, 25)
N = (normal(size=it))
Further I tried:
N = (normal(size=it[:]))
Provided my data, I believe that such a manual work, or for loop is really inefficient, so I am trying to come up with vectorized operations.
i receive:
File "mtrand.pyx", line 1335, in numpy.random.mtrand.RandomState.normal
File "common.pyx", line 557, in numpy.random.common.cont
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
you've not been very precise in where these functions are coming from, but I'm guessing that by normal(size=it[:]) you mean:
import numpy as np
it = 2 ** np.arange(6, 25)
np.random.normal(size=it)
which would be telling numpy to create a 19 dimensional array (i.e. len(it)) that contains 6 × 1085 elements (i.e. np.prod(it.astype(float)) as floats because the number overflows an int64). numpy is saying that it can't do that, which seems like a reasonable thing to do.
Numpy doesn't like the "ragged arrays" you're trying to create, neither do most matrix/numeric libraries, hence support is limited!
I'm unsure why you consider that the "loop is really inefficient". You're creating ~33 million of floats from 19 iterations of a simple Python loop. The vast majority of time will be in highly optimised Numpy library code and some tiny (basically unmeasurable) amount of time will be spent evaluating your Python bytecode.
If you really want a one-liner then you can do:
X = [np.random.normal(size=2**i) for i in range(6, 25)]
which makes the split between Numpy and Python worlds more obvious.
Note that on my laptop, the Python code executes in ~5µs while the Numpy code runs for ~800ms. So you're trying to optimise the 0.0006% part!
Note that it's not always a win to use Numpy's vectorization, it only helps with larger arrays, for example the above loop is "faster" than:
X = [np.random.normal(i) for i in 2**np.arange(6, 25)]
4.8 vs 5.1 µs for the Python code, because of the time spent marshalling objects into/out of the Numpy world. Again, none of this matters, just use whichever solution makes your code easier to understand. A few microseconds is nothing compared to seconds.
I would like to find the fastest way to generate ~10^9 poisson random numbers in python/numpy—for instance, say I have a mean Poisson parameter (calculated elsewhere) of shape (1000, 2000), and I need 500 independent samples. This is a bottleneck in my code, taking several minutes to complete. I have tried three methods, but am looking for something faster:
import numpy as np
# example parameters
nsamples = 500
nmeas = 2000
ninputs = 1000
lambdax = np.ones([ninputs, nmeas]) * 20
# numpy, one big array
sample0 = np.random.poisson(lam=lambdax, size=(nsamples, ninputs, nmeas))
# numpy, current version where other code happens in the loop
sample1 = np.zeros([nsamples, ninputs, nmeas])
for i in range(nsamples):
sample1[i, :, :] = np.random.poisson(lam=lambdax)
# scipy
from scipy.stats import poisson
sample2 = poisson.rvs(lambdax, size=(nsamples, ninputs, nmeas))
Results:
sample0: 1 m 16 s
sample1: 1 m 20 s
sample2: 1 m 50 s
Not shown here, I am also parallelizing the independent samples via multiprocessing, but the calculations are still pretty expensive for such large parameters. Is there a better way?
I have been in your shoes and here are my suggestions:
For large mean values, poisson works similar to uniform. check out this post (and probably more if you search) .
~1m runtime seems reasonable to generate such a large number of random numbers. I don't think you can top sample0 method by much via just coding. Now depending on what you want to do with random numbers,
if your issue is rerunning program multiple times, try saving sample0 into a file and reloading it in the next runs.
if not, I suggest creating lower number of randoms and reuse them. A lot of those random numbers in sample0 will be repeated in your sample, depending on your mean value. You might want to create smaller sample size and randomly choose from them. for example I would chose a random number from sample0 and reuse it for e.g. 100 times (since that number would appear in sample0 over 100 times anyways).
If you provide more information on what you intend to do with random numbers, we might be able to help more. Otherwise, coding-wise I am not sure if you can do much further.
I need to calculate distance between 2 xyz points in massive data (100 Gb, about 20 trylion points). I am trying to speed up this loop. I created KDtree, add parallel calculation's, split my array to smaller parts. So i guess all left to speed up is this loop. My pure python calculation time took about 10 hours 42 minutes. Adding numpy reduce time to 5 hours and 34 minutes. Adding numba speed it up to 4h 15 minutes. But it is still not fast enough. I heard that Cython is the fastest way for python calculation's but i don't have any experience in c and I don't know how to translate my function to cython code. How can i get this loop to run faster, using cython or any other way?
def controller(point_array, las_point_array):
empty = []
tree = spatial.cKDTree(point_array, leafsize=1000, copy_data = True)
empty = __pure_calc(las_point_array, point_array, empty, tree)
return ptList
#############################################################################################
#autojit
def __pure_calc(las_point_array, point_array, empty, tree):
for i in las_point_array:
p = tree.query(i)
euc_dist = math.sqrt(np.sum((point_array[p[1]]-i)**2))
##add one row at a time to empty list
empty.append([i[0], i[1], i[2], euc_dist, point_array[p[1]][0], point_array[p[1]][1], point_array[p[1]][2]])
return empty
I attach sample data for testing:
Sample
Your function builds a list (closestPt) that ends up looking like this:
[
[i0[0], i0[1], i0[2], distM0],
[i1[0], i1[1], i1[2], distM1],
...
]
The first thing you should do is to preallocate the entire result as a NumPy array (np.empty()), and write into it one row at a time. This will avoid a ton of memory allocations. Then you will note that you can defer the sqrt() to the very end, and run it on the distM column after your loops are all done.
There may be more optimization opportunities if you post a full working test harness with random/sample input data.
The key is to utilize vectorized functions as much as possible since any call to a pure python function inside the loop will more or less make the autojit pointless (the bottleneck will be the pure function call).
I noticed that the query function is vectorizable, and so is the euclidian distance calculation.
I'm not sure what your ptList variable in the controller function is (the example is a bit faulty), but assuming it is the output of your jit function, or close enfough to it, you should be able to do something like this:
def controller(point_array, las_point_array):
tree = spatial.cKDTree(point_array, leafsize=1000, copy_data = True)
distances, pt_idx = tree.query(las_point_array)
nearest_pts = point_array[pt_idx]
euc_distances = np.sqrt((nearest_pts - las_point_array).sum(axis=1) ** 2)
result = np.vstack((las_point_array.T, euc_distances.T, nearest_pts.T)).T
return result
I'm looking for a Python-based Kolmogorov-Zurbenko filter which receives a time-series input and filters it based on a window size and number of iterations and haven't found anything that seems to work. Has anyone had better luck than I?
Thanks!
I have just been looking into the same issue. The actual KZ filter is very easy in pandas:
import pandas as pd
def kz(series, window, iterations):
"""KZ filter implementation
series is a pandas series
window is the filter window m in the units of the data (m = 2q+1)
iterations is the number of times the moving average is evaluated
"""
z = series.copy()
for i in range(iterations):
z = pd.rolling_mean(z, window=window, min_periods=1, center=True)
return z
What cannot be easily realized to my knowledge is the adaptive version of the Kologorov Zurbenko filter (KZA). This would at least require a rolling_mean method which allows for the specification of different window lengths to the left and right of the center. The C code at https://cran.r-project.org/web/packages/kza/index.html looks fairly simple and straightforward, but it requires loops and would therefore be quite slow if implemented in Python directly.
I have a matrix which is fairly large (around 50K rows), and I want to print the correlation coefficient between each row in the matrix. I have written Python code like this:
for i in xrange(rows): # rows are the number of rows in the matrix.
for j in xrange(i, rows):
r = scipy.stats.pearsonr(data[i,:], data[j,:])
print r
Please note that I am making use of the pearsonr function available from the scipy module (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).
My question is: Is there a quicker way of doing this? Is there some matrix partition technique that I can use?
Thanks!
New Solution
After looking at Joe Kington's answer, I decided to look into the corrcoef() code and was inspired by it to do the following implementation.
ms = data.mean(axis=1)[(slice(None,None,None),None)]
datam = data - ms
datass = np.sqrt(scipy.stats.ss(datam,axis=1))
for i in xrange(rows):
temp = np.dot(datam[i:],datam[i].T)
rs = temp / (datass[i:]*datass[i])
Each loop through generates the Pearson coefficients between row i and rows i through to the last row. It is very fast. It is at least 1.5x as fast as using corrcoef() alone because it doesn't redundantly calculate the coefficients and a few other things. It will also be faster and won't give you the memory problems with a 50,000 row matrix because then you can choose to either store each set of r's or process them before generating another set. Without storing any of the r's long term, I was able to get the above code to run on 50,000 x 10 set of randomly generated data in under a minute on my fairly new laptop.
Old Solution
First, I wouldn't recommend printing out the r's to the screen. For 100 rows (10 columns), this is a difference of 19.79 seconds with printing vs. 0.301 seconds without using your code. Just store the r's and use them later if you would like, or do some processing on them as you go along like looking for some of the largest r's.
Second, you can get some savings by not redundantly calculating some quantities. The Pearson coefficient is calculated in scipy using some quantities that you can precalculate rather than calculating every time that a row is used. Also, you aren't using the p-value (which is also returned by pearsonr() so let's scratch that too. Using the below code:
r = np.zeros((rows,rows))
ms = data.mean(axis=1)
datam = np.zeros_like(data)
for i in xrange(rows):
datam[i] = data[i] - ms[i]
datass = scipy.stats.ss(datam,axis=1)
for i in xrange(rows):
for j in xrange(i,rows):
r_num = np.add.reduce(datam[i]*datam[j])
r_den = np.sqrt(datass[i]*datass[j])
r[i,j] = min((r_num / r_den), 1.0)
I get a speed-up of about 4.8x over the straight scipy code when I've removed the p-value stuff - 8.8x if I leave the p-value stuff in there (I used 10 columns with hundreds of rows). I also checked that it does give the same results. This isn't a really huge improvement, but it might help.
Ultimately, you are stuck with the problem that you are computing (50000)*(50001)/2 = 1,250,025,000 Pearson coefficients (if I'm counting correctly). That's a lot. By the way, there's really no need to compute each row's Pearson coefficient with itself (it will equal 1), but that only saves you from computing 50,000 Pearson coefficients. With the above code, I expect that it would take about 4 1/4 hours to do your computation if you have 10 columns to your data based on my results on smaller datasets.
You can get some improvement by taking the above code into Cython or something similar. I expect that you'll maybe get up to a 10x improvement over straight Scipy if you're lucky. Also, as suggested by pyInTheSky, you can do some multiprocessing.
Have you tried just using numpy.corrcoef? Seeing as how you're not using the p-values, it should do exactly what you want, with as little fuss as possible. (Unless I'm mis-remembering exactly what pearson's R is, which is quite possible.)
Just quickly checking the results on random data, it returns exactly the same thing as #Justin Peel's code above and runs ~100x faster.
For example, testing things with 1000 rows and 10 columns of random data...:
import numpy as np
import scipy as sp
import scipy.stats
def main():
data = np.random.random((1000, 10))
x = corrcoef_test(data)
y = justin_peel_test(data)
print 'Maximum difference between the two results:', np.abs((x-y)).max()
return data
def corrcoef_test(data):
"""Just using numpy's built-in function"""
return np.corrcoef(data)
def justin_peel_test(data):
"""Justin Peel's suggestion above"""
rows = data.shape[0]
r = np.zeros((rows,rows))
ms = data.mean(axis=1)
datam = np.zeros_like(data)
for i in xrange(rows):
datam[i] = data[i] - ms[i]
datass = sp.stats.ss(datam,axis=1)
for i in xrange(rows):
for j in xrange(i,rows):
r_num = np.add.reduce(datam[i]*datam[j])
r_den = np.sqrt(datass[i]*datass[j])
r[i,j] = min((r_num / r_den), 1.0)
r[j,i] = r[i,j]
return r
data = main()
Yields a maximum absolute difference of ~3.3e-16 between the two results
And timings:
In [44]: %timeit corrcoef_test(data)
10 loops, best of 3: 71.7 ms per loop
In [45]: %timeit justin_peel_test(data)
1 loops, best of 3: 6.5 s per loop
numpy.corrcoef should do just what you want, and it's a lot faster.
you can use the python multiprocess module, chunk up your rows into 10 sets, buffer your results and then print the stuff out (this would only speed it up on a multicore machine though)
http://docs.python.org/library/multiprocessing.html
btw: you'd also have to turn your snippet into a function and also consider how to do the data reassembly. having each subprocess have a list like this ...[startcord,stopcord,buff] .. might work nicely
def myfunc(thelist):
for i in xrange(thelist[0]:thelist[1]):
....
thelist[2] = result