Fastest way to eleminate outlier in 2D array in python without pandas - python

I need to iterate over a lot of itemsets and need to remove the outlier. As a threshold, I simply use the standard val > 3* σ. Currently, I have the following solution:
def remove_outlier(data):
data_t = np.array(data).T.tolist()
for ele in data_t:
temp = []
for val in ele:
if (val < (3 * np.std(ele)) + np.mean(ele)) and (val > (np.mean(ele) - 3 * np.std(ele))):
temp.append(val)
data_t[i] = np.asarray(temp)
data = np.asarray(data_t).T
return data
I'm looking for a faster solution, because it takes up to 7 seconds per dataset (foreseeable for a double for-loop).
I've come across scipy's z-score method and since it also supports the axis=1 argument, it seems more valuable and faster than my solution. Is there a shortcut of how I can remove the corresponding z-scores from my dataset?
I played around with numpy.where(), but it returns only certain values if compared above a threshold.
The shape of the data is usually around 1000x8, but can also be transposed without any problem.

Related

Python: how to speed up this function and make it more scalable?

I have the following function which accepts an indicator matrix of shape (20,000 x 20,000). And I have to run the function 20,000 x 20,000 = 400,000,000 times. Note that the indicator_Matrix has to be in the form of a pandas dataframe when passed as parameter into the function, as my actual problem's dataframe has timeIndex and integer columns but I have simplified this a bit for the sake of understanding the problem.
Pandas Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.sum(axis=1)
d = indicator_Matrix.div(s,axis=0)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
I tried to improve it by using numpy but it is still taking ages to run. I also tried concurrent.future.ThreadPoolExecutor but it still take a long time to run and not much improvement from list comprehension.
Numpy Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.to_numpy().sum(axis=1)
d = (indicator_Matrix.to_numpy().T / s).T
d = pd.DataFrame(d, index = indicator_Matrix.index, columns = indicator_Matrix.columns)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
output = [operations(indicator_Matrix) for i in range(0,20000**2)]
Note that the reason I convert d to a dataframe again is because I need to obtain the column means and retain only the last column mean using .iloc[-1]. d[d>0].mean(axis=0) return column means, i.e.
2478 1.0
0 1.0
Update: I am still stuck in this problem. I wonder if using gpu packages like cudf and CuPy on my local desktop would make any difference.
Assuming the answer of #CrazyChucky is correct, one can implement a faster parallel Numba implementation. The idea is to use plain loops and care about reading data the contiguous way. Reading data contiguously is important so to make the computation cache-friendly/memory-efficient. Here is an implementation:
import numba as nb
#nb.njit(['(int_[:,:],)', '(int_[:,::1],)', '(int_[::1,:],)'], parallel=True)
def compute_fastest(matrix):
n, m = matrix.shape
sum_by_row = np.zeros(n, matrix.dtype)
is_row_major = matrix.strides[0] >= matrix.strides[1]
if is_row_major:
for i in nb.prange(n):
s = 0
for j in range(m):
s += matrix[i, j]
sum_by_row[i] = s
else:
for chunk_id in nb.prange(0, (n+63)//64):
start = chunk_id * 64
end = min(start+64, n)
for j in range(m):
for i2 in range(start, end):
sum_by_row[i2] += matrix[i2, j]
count = 0
s = 0.0
for i in range(n):
value = matrix[i, -1] / sum_by_row[i]
if value > 0:
s += value
count += 1
return s / count
# output = [compute_fastest(indicator_Matrix.to_numpy()) for i in range(0,20000**2)]
Pandas dataframes can contain both row-major and column-major arrays. Regarding the memory layout, it is better to iterate over the rows or the column. This is why there is two implementations of the sum based on is_row_major. There is also 3 Numba signatures: one for row-major contiguous arrays, one for columns-major contiguous arrays and one for non-contiguous arrays. Numba will compile the 3 function variants and automatically pick the best one at runtime. The JIT-compiler of Numba can generate a faster implementation (eg. using SIMD instructions) when the input 2D array is known to be contiguous.
Experimental Results
This computation is about 14.5 times faster than operations_simpler on my i5-9600KF processor (6 cores). It still takes a lot of time but the computation is memory-bound and nearly optimal on my machine: it is bounded by the main-memory which has to be read:
On a 2000x2000 dataframe with 32-bit integers:
- operations: 86.310 ms/iter
- operations_simpler: 5.450 ms/iter
- compute_fastest: 0.375 ms/iter
- optimal: 0.345-0.370 ms/iter
If you want to get a faster code, then you need to use more compact data types. For example, a uint8 data type is large enough to contain the values 0 and 1, and it is 4 times smaller in memory on Windows. This means the code can be up to 4 time faster in this case. The smaller the data type, the faster the program. One could even try to compact 8 columns in 1 using bit tweaks though it is generally significantly slower using Numba unless you have a lot of available cores.
Notes & Discussion
The above code works only with uniformly-typed columns. If this is not the case, you can split the dataframe in multiple groups and convert each column group to Numpy array so to then call the Numba function (modified to support groups). Note the #CrazyChucky code has a similar issue: a dataframe column with mixed datatypes converted to a Numpy array results in an object-based Numpy array which is very inefficient (especially a row-major Numpy array).
Note that using a GPU will not make the computation faster unless the input dataframe is already stored in the GPU memory. Indeed, CPU-GPU data transfers are more expensive than just reading the RAM (due to the interconnect overhead which is generally a quite slow PCI one). Note that the GPU memory is quite limited compared to the CPU. If the target dataframe(s) do not need to be transferred, then using cudf is relatively simple and should give a small speed up. For a faster code, one need to implement a fast CUDA code but this is clearly far from being easy for dataframes with mixed dataype. In the end, the resulting speed up should be main_ram_throughput / gpu_ram_througput assuming there is no data transfer. Note that this factor is generally 5-12. Note also that CUDA and cudf require a Nvidia GPU.
Finally, reducing the input data size or just the amount of computation is certainly the best solution (as indicated in the comment by #zvone) since it is very computationally intensive.
You're doing some extra math you don't have to. In plain English, what you're doing is:
Summing each column
Turning the list of sums "sideways" and dividing each column by it
Taking the mean of each column, ignoring values ≤ 0
Returning only the rightmost mean
After step one, you no longer need anything but the rightmost column; you can ignore the other columns, only dividing and averaging the one whose result you care about. Changing your code accordingly:
def operations_simpler(indicator_matrix):
sums = indicator_matrix.sum(axis=1)
last_column = indicator_matrix.iloc[:, -1]
divided = last_column / sums
return divided[divided > 0].mean()
...yields the same result, and takes about a hundredth of the time. Extrapolating from shorter test runs, this cuts the time for 400,000,000 runs on my machine from about 114 years down to... about 324 days. Still not great. So far I've not managed to get it to run any faster by converting to NumPy, compiling with Numba, or employing multiprocessing, but I'll go ahead and post this for now in case it's helpful.
Note: You're unlikely to see any improvements with compute-heavy work like this from threading; if anything, you'd want to use multiprocessing. concurrent.futures offers executors for both. Threads are mostly useful to avoid waiting around for I/O.
As per the previous answer you can use Numba or you can you two other alternatives such as Dask which is a distributed computing package, to parallelize your function's execution it can divide your data into smaller bits and distribute computing across many CPU cores or even numerous machines.
import dask.array as da
def operations(indicator_matrix):
s = indicator_matrix.sum(axis=1)
d = indicator_matrix.div(s, axis=0)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_dask = da.from_array(indicator_matrix, chunks=(1000, 1000))
output_dask = indicator_matrix_dask.map_blocks(operations, dtype=float)
output = output_dask.compute()
or you can use CuPy which uses GPU to increase your function excution
import cupy as cp
def operations(indicator_matrix):
s = cp.sum(indicator_matrix, axis=1)
d = cp.divide(indicator_matrix.T, s).T
d = pd.DataFrame(d, index = indicator_matrix.index, columns = indicator_matrix.columns)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_cupy = cp.asarray(indicator_matrix)
output_cupy = operations(indicator_matrix_cupy)
output = cp.asnumpy(output_cupy)

Python similarity on sets of strings via Pandas crashes memory. How can I make it work?

I'm struggling to get my python code to run, as I always run out of memory. So, I have the following data frame:
I have a column with a key and a column with features. This is a set containing a maximum of 10 strings that all have no spaces. And in this example I have about 70k rows.
key features
0 String A {'Thisisastring', 'Thisisanothersentence', ... 'Maximumof10Strings'}
1 String B {'Hellothere', 'Woop', ... 'Maxiningoutat10Strings'}
2 String C {'Yessir', 'Stackovervlowisawesome', ... 'Maximumof10Strings'}
...
70000 String XY {'Aintnostring', 'Maybeitis', ... 'pleasehelpme'}
...
Now what I want to do is to compare each of the feature-sets with all the other feature-sets and get their similarity. The similarity score is fairly easy in its code, as I only want, if there are 5 of the 10 the same, to give me 0.5 score of similarity, etc.:
def similarity_score(a, b):
c = a.intersection(b)
return 2 * float(len(c)) / (len(a) + len(b))
This is the current code, at the end I want to have a matrix, so that I can easily cluster them together based upon a similarity score threshold:
base_pd = original_pd['features']
i = base_pd.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']
fnc = np.vectorize(similarity_score)
y = fnc(k['A'], k['B']).reshape(len(base_pd), len(base_pd))
queries = original_pd['key'].to_list()
df = pd.DataFrame(data=y, index=keys, columns=keys)
The issue is, though, that this wipes out my memory and uses > 25 GB quite early on. Obviously, it's also due to the huge amount of data with 70K rows, but there will be even possibilities of using even more rows, so I need to find a solution.
I've already tried with NumPy, to get around it a bit, but I'm not getting anywhere.
How could I make this more efficient? I need to use strings originally, obviously could change them to hashes or so, but even then I am a bit lost.
Best and thanks in advance,
Lukas

How to efficiently update np array depending on index and value?

I have an image of the sun, I found center and radius and now I want to process pixels differently if they are inside or outside the disk. The ideal solution would be to imterpolate the parameters of the processing function, in order to smoothly transition from disk to background.
Here is what I'm doing now:
for index,value in np.ndenumerate(sun_img):
if distance.euclidean(index,center) > radius:
sun_img[index] = processing_function(index,value)
Like this it works but it takes forever to compute the image. I'm sure there is a more efficient way to do that. How would you solve this?
Image shape is around (1000, 1000)
Processing_function is basically not doing anything right now: value += 1
The function should be something like a non-linear "step function" with 0.0 value till radius and 1.0 5px after. something like: _______/''''''''''''''''''''' multiplied by the value of the pixel. The slope should be on the value of the radius. I wanna do this in order to enhance the protuberances
Here's a vectorized way leveraging NumPy broadcasting -
m,n = sun_img.shape
I,J = np.ogrid[:m,:n]
sq_dist = (I - center[0])**2 + (J - center[1])**2
valid_mask = sq_dist > radius**2
Now, for a processing_function that just adds 1 to the valid places, defined by the IF-conditional, do -
sun_img[valid_mask] += 1
If you need to implement a custom operation with processing_function that needs those row, column indices, use np.where to get those indices and then iterate through the valid elements, like so -
r,c = np.where(valid_mask)
for index in zip(r,c):
sun_img[index] = processing_function(index,sun_img[r,c])
If you have a lot of such valid places, then computing r,c might make things slow. In that case, directly use the mask, like so -
for index,value in np.ndenumerate(sun_img):
if valid_mask[index]:
sun_img[index] = processing_function(index,value)
Compared to the original code, the benefit is that we have the conditional values pre-computed before going into the loop. The best way again would be to vectorize processing_function itself so that it works on a bigger chunk of data, but that would depend on its implementation.

Speeding up Numpy Masking

I'm still an amature when it comes to thinking about how to optimize. I have this section of code that takes in a list of found peaks and finds where these peaks,+/- some value, are located in a multidimensional array. It then adds +1 to their indices of a zeros array. The code works well, but it takes a long time to execute. For instance it is taking close to 45min to run if ind has 270 values and refVals has a shape of (3050,3130,80). I understand that its a lot of data to churn through, but is there a more efficient way of going about this?
maskData = np.zeros_like(refVals).astype(np.int16)
for peak in ind:
tmpArr = np.ma.masked_outside(refVals,x[peak]-2,x[peak]+2).astype(np.int16)
maskData[tmpArr.mask == False ] += 1
tmpArr = None
maskData = np.sum(maskData,axis=2)
Approach #1 : Memory permitting, here's a vectorized approach using broadcasting -
# Craate +,-2 limits usind ind
r = x[ind[:,None]] + [-2,2]
# Use limits to get inside matches and sum over the iterative and last dim
mask = (refVals >= r[:,None,None,None,0]) & (refVals <= r[:,None,None,None,1])
out = mask.sum(axis=(0,3))
Approach #2 : If running out of memory with the previous one, we could use a loop and use NumPy boolean arrays and that could be more efficient than masked arrays. Also, we would perform one more level of sum-reduction, so that we would be dragging less data with us when moving across iterations. Thus, the alternative implementation would look something like this -
out = np.zeros(refVals.shape[:2]).astype(np.int16)
x_ind = x[ind]
for i in x_ind:
out += ((refVals >= i-2) & (refVals <= i+2)).sum(-1)
Approach #3 : Alternatively, we could replace that limit based comparison with np.isclose in approach #2. Thus, the only step inside the loop would become -
out += np.isclose(refVals,i,atol=2).sum(-1)

Methods for quickly calculating standard deviation of large number set in Numpy

What's the best(fastest) way to do this?
This generates what I believe is the correct answer, but obviously at N = 10e6 it is painfully slow. I think I need to keep the Xi values so I can correctly calculate the standard deviation, but are there any techniques to make this run faster?
def randomInterval(a,b):
r = ((b-a)*float(random.random(1)) + a)
return r
N = 10e6
Sum = 0
x = []
for sample in range(0,int(N)):
n = randomInterval(-5.,5.)
while n == 5.0:
n = randomInterval(-5.,5.) # since X is [-5,5)
Sum += n
x = np.append(x, n)
A = Sum/N
for sample in range(0,int(N)):
summation = (x[sample] - A)**2.0
standard_deviation = np.sqrt((1./N)*summation)
You made a decent attempt, but should make sure you understand this and don't copy explicitly since this is HW
import numpy as np
N = int(1e6)
a = np.random.uniform(-5,5,size=(N,))
standard_deviation = np.std(a)
This assumes you can use a package like numpy (you tagged it as such). If you can, there are a whole host of methods that allow you to create and do operations on arrays of data, thus avoiding explicit looping (it's done under the hood in an efficient manner). It would be good to take a look at the documentation to see what features are available and how to use them:
http://docs.scipy.org/doc/numpy/reference/index.html
Using the formulas found on this wiki page for Variance, you could compute it in one loop without storing a list of the random numbers (assuming you didn't need them elsewhere).

Categories