`numpy.nanpercentile` is extremely slow

`numpy.nanpercentile` is extremely slow - python

numpy.nanpercentile is extremely slow.
So, I wanted to use cupy.nanpercentile; but there is not cupy.nanpercentile implemented yet.
Do someone have solution for it?

I also had a problem with np.nanpercentile being very slow for my datasets. I found a wokraround that lets you use the standard np.percentile. And it can also be applied to many other libs.
This one should solve your problem. And it also works alot faster than np.nanpercentile:
arr = np.array([[np.nan,2,3,1,2,3],
[np.nan,np.nan,1,3,2,1],
[4,5,6,7,np.nan,9]])
mask = (arr >= np.nanmin(arr)).astype(int)
count = mask.sum(axis=1)
groups = np.unique(count)
groups = groups[groups > 0]
p90 = np.zeros((arr.shape[0]))
for g in range(len(groups)):
pos = np.where (count == groups[g])
values = arr[pos]
values = np.nan_to_num (values, nan=(np.nanmin(arr)-1))
values = np.sort (values, axis=1)
values = values[:,-groups[g]:]
p90[pos] = np.percentile (values, 90, axis=1)
So instead of taking the percentile with the nans, it sorts the rows by the amount of valid data, and takes the percentile of those rows separated. Then adds everything back together. This also works for 3D-arrays, just add y_pos and x_pos instead of pos. And watch out for what axis you are calculating over.

def testset_gen(num):
init=[]
for i in range (num):
a=random.randint(65,122) # Dummy name
b=random.randint(1,100) # Dummy value: 11~100 and 10% of nan
if b<11:
b=np.nan # 10% = nan
init.append([a,b])
return np.array(init)
np_testset=testset_gen(30000000) # 468,751KB
def f1_np (arr, num):
return np.percentile (arr[:,1], num)
# 55.0, 0.523902416229248 sec
print (f1_np(np_testset[:,1], 50))
def cupy_nanpercentile (arr, num):
return len(cp.where(arr > num)[0]) / (len(arr) - cp.sum(cp.isnan(arr))) * 100
# 55.548758317136446, 0.3640251159667969 sec
# 43% faster
# If You need same result, use int(). But You lose saved time.
print (cupy_nanpercentile(cp_testset[:,1], 50))
I can't imagine How test result takes few days. With my computer, It seems 1 Trillion line of data or more. Because of this, I can't reproduce same problem due to lack of resource.

Here's an implementation with numba. After it's been compiled it is more than 7x faster than the numpy version.
Right now it is set up to take the percentile along the first axis, however it could be changed easily.
#numba.jit(nopython=True, cache=True)
def nan_percentile_axis0(arr, percentiles):
"""Faster implementation of np.nanpercentile
This implementation always takes the percentile along axis 0.
Uses numba to speed up the calculation by more than 7x.
Function is equivalent to np.nanpercentile(arr, <percentiles>, axis=0)
Params:
arr (np.array): Array to calculate percentiles for
percentiles (np.array): 1D array of percentiles to calculate
Returns:
(np.array) Array with first dimension corresponding to
values as passed in percentiles
"""
shape = arr.shape
arr = arr.reshape((arr.shape[0], -1))
out = np.empty((len(percentiles), arr.shape[1]))
for i in range(arr.shape[1]):
out[:,i] = np.nanpercentile(arr[:,i], percentiles)
shape = (out.shape[0], *shape[1:])
return out.reshape(shape)

Related

Recreate List based on statistics

I am given the following statistics of an array:
length
Minimum
Maximum
Average
Median
Quartiles
I am supposed to recreate a list with more or less the same statistics. I know that the list for which the statistics were calculated is not normally distributed.
My first idea was to just brute-force it by creating a list of random numbers in the given range and hope that one would fit. The benefit of this method is it works. While the downside obviously is the efficiency.
So I'm looking for a more efficient way to solve this problem. Hope that someone can help...
P.S. Currently I only use numpy but I'm not limited to it.
Edit 1:
As an example input and output was requested:
A input could look as follows:
statistics = {
'length' : 200,
'minimum_value' : 5,
'maximum_vlaue': 132,
'mean': 30,
'median' : 22,
'Q1': 13,
'Q3': 68
}
The desired output would than look like this:
similar_list = function_to_create_similar_list(statistics)
len(similar_list) # should be roughly 200
min(similar_list) # should be roughly 5
max(similar_list) # should be roughly 132
np.mean(similar_list) # should be roughly 30
np.median(similar_list) # should be roughly 22
np.quantile(similar_list, 0.25) # should be roughly 13
np.quantile(similar_list, 0.75) # should be roughly 68
function_to_create_similar_list is the function I want to define
Edit 2.
My first edit was not enough I'm sorry for that. Here is my current code:
def get_statistics(input_list):
output = {}
output['length'] = len(input_list)
output['minimum_value'] = min(input_list)
output['maximum_value'] = max(input_list)
output['mean'] = np.mean(input_list)
output['median'] = np.median(input_list)
output['q1'] = np.quantile(input_list, 0.25)
output['q3'] = np.quantile(input_list, 0.75)
return output
def recreate_similar_list(statistics, maximum_deviation = 0.1 ):
sufficient_list_was_found = False
while True:
candidate_list = [random.uniform(statistics['minimum_value'],statistics['maximum_value']) for _ in range(statistics['length'])]
candidate_statistics = get_statistics(candidate_list)
sufficient_list_was_found = True
for key in statistics.keys():
if np.abs(statistics[key] - candidate_statistics[key]) / statistics[key] > maximum_deviation:
sufficient_list_was_found = False
break
if(sufficient_list_was_found):
return candidate_list
example_input_list_1 = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,10]
recreated_list_1 = recreate_similar_list(get_statistics(example_input_list_1),0.3)
print(recreated_list_1)
print(get_statistics(recreated_list_1))
example_input_list_2 = [1,1,1,1,3,3,4,4,4,4,4,5,18,19,32,35,35,42,49,68]
recreated_list_2 = recreate_similar_list(get_statistics(example_input_list_2),0.3)
print(recreated_list_2)
print(get_statistics(recreated_list_2))
The first example can find a solution. That was no surprise to me. The second one does not (or not in sufficient time). That also did not surprise me as the lists generated in the recreate_similar_list function are uniformly distributed. Though both examples represent the real task. (Keep in mind that I only get the statistics not the list)
I hope this is now a sufficient example

Your existing solution is interesting, but effectively a bogo-solution. There are direct solutions possible that do not need to rely on random-and-check.
The easy-ish part is to create the array of a correct length, and place all five min/max/quartiles in their appropriate positions (this only works for a somewhat simple interpretation of the problem and has limitations).
The trickier part is to choose "fill values" between the quartiles. These fill values can be identical within one interquartile section, because the only things that matter are the sum and bounds. One fairly straightforward way is linear programming, via Scipy's scipy.optimize.linprog. It's typically used for bounded linear algebra problems and this is one. For parameters we use:
Zeros for c, the minimization coefficients, because we don't care about minimization
For A_eq, the equality constraint matrix, we pass a matrix of element counts. This is a length-4 matrix because there are four interquartile sections, each potentially with a slightly different element count. In your example these will each be close to 50.
For B_eq, the equality constraint right-hand side vector, we calculate the desired sum of all interquartile sections based on the desired mean.
For bounds we pass the bounds of each interquartile section.
One tricky aspect is that this assumes easily-divided sections, and a quantile calculation using the lower method. But there are at least thirteen methods! Some will be more difficult to target with an algorithm than others. Also, lower introduces statistical bias. I leave solving these edge cases as an exercise to the reader. But the example works:
import numpy as np
from scipy.optimize import linprog
def solve(length: int, mean: float,
minimum_value: float, q1: float, median: float, q3: float,
maximum_value: float) -> np.ndarray:
sections = (np.arange(5)*(length - 1))//4
sizes = np.diff(sections) - 1
quartiles = np.array((minimum_value, q1, median, q3, maximum_value))
# (quartiles + sizes#x)/length = mean
# sizes#x = mean*length - quartiles
result = linprog(c=np.zeros_like(sizes),
A_eq=sizes[np.newaxis, :],
b_eq=np.array((mean*length - quartiles.sum(),)),
bounds=np.stack((quartiles[:-1], quartiles[1:]), axis=1),
method='highs')
if not result.success:
raise ValueError(result.message)
x = np.empty(length)
x[sections] = quartiles
for i, inner in enumerate(result.x):
i0, i1 = sections[i: i+2]
x[i0+1: i1] = inner
return x
def summarise(x: np.ndarray) -> dict[str, float]:
q0, q1, q2, q3, q4 = np.quantile(
a=x, q=np.linspace(0, 1, num=5), method='lower')
return {'length': len(x), 'mean': x.mean(),
'minimum_value': q0, 'q1': q1, 'median': q2, 'q3': q3, 'maximum_value': q4}
def test() -> None:
statistics = {'length': 200, 'mean': 30, # 27.7 - 58.7 are solvable
'minimum_value': 5, 'q1': 13, 'median': 22, 'q3': 68, 'maximum_value': 132}
x = solve(**statistics)
for k, v in summarise(x).items():
assert np.isclose(v, statistics[k])
if __name__ == '__main__':
test()

Heuristic to choose five column arrays that maximise the dot product

I have a sparse 60000x10000 matrix M where each element is either a 1 or 0. Each column in the matrix is a different combination of signals (ie. 1s and 0s). I want to choose five column vectors from M and take the Hadamard (ie. element-wise) product of them; I call the resulting vector the strategy vector. After this step, I compute the dot product of this strategy vector with a target vector (that does not change). The target vector is filled with 1s and -1s such that having a 1 in a specific row of the strategy vector is either rewarded or penalised.
Is there some heuristic or linear algebra method that I could use to help me pick the five vectors from the matrix M that result in a high dot product? I don't have any experience with Google's OR tools nor Scipy's optimization methods so I am not too sure if they can be applied to my problem. Advice on this would be much appreciated! :)
Note: the five column vectors given as the solution does not need to be the optimal one; I'd rather have something that does not take months/years to run.

First of all, thanks for a good question. I don't get to practice numpy that often. Also, I don't have much experience in posting to SE, so any feedback, code critique, and opinions relating to the answer are welcome.
This was an attempt at finding an optimal solution at first, but I didn't manage to deal with the complexity. The algorithm should, however, give you a greedy solution that might prove to be adequate.
Colab Notebook (Python code + Octave validation)
Core Idea
Note: During runtime, I've transposed the matrix. So, the column vectors in the question correspond to row vectors in the algorithm.
Notice that you can multiply the target with one vector at a time, effectively getting a new target, but with some 0s in it. These will never change, so you can filter out some computations by removing those rows (columns, in the algorithm) in further computations entirely - both from the target and the matrix. - you're then left with a valid target again (only 1s and -1 in it).
That's the basic idea of the algorithm. Given:
n: number of vectors you need to pick
b: number of best vectors to check
m: complexity of matrix operations to check one vector
Do an exponentially-complex O((n*m)^b) depth-first search, but decrease the complexity of the calculations in deeper layers by reducing target/matrix size, while cutting down a few search paths with some heuristics.
Heuristics used
The best score achieved so far is known in every recursion step. Compute an optimistic vector (turn -1 to 0) and check what scores can still be achieved. Do not search in levels where the score cannot be surpassed.
This is useless if the best vectors in the matrix have 1s and 0s equally distributed. The optimistic scores are just too high. However, it gets better with more sparsity.
Ignore duplicates. Basically, do not check duplicate vectors in the same layer. Because we reduce the matrix size, the chance for ending up with duplicates increases in deeper recursion levels.
Further Thoughts on Heuristics
The most valuable ones are those that eliminate the vector choices at the start. There's probably a way to find vectors that are worse-or-equal than others, with respect to their affects on the target. Say, if v1 only differs from v2 by an extra 1, and target has a -1 in that row, then v1 is worse-or-equal than v2.
The problem is that we need to find more than 1 vector, and can't readily discard the rest. If we have 10 vectors, each worse-or-equal than the one before, we still have to keep 5 at the start (in case they're still the best option), then 4 in the next recursion level, 3 in the following, etc.
Maybe it's possible to produce a tree and pass it on in into recursion? Still, that doesn't help trim down the search space at the start... Maybe it would help to only consider 1 or 2 of the vectors in the worse-or-equal chain? That would explore more diverse solutions, but doesn't guarantee that it's more optimal.
Warning: Note that the MATRIX and TARGET in the example are in int8. If you use these for the dot product, it will overflow. Though I think all operations in the algorithm are creating new variables, so are not affected.
Code
# Given:
TARGET = np.random.choice([1, -1], size=60000).astype(np.int8)
MATRIX = np.random.randint(0, 2, size=(10000,60000), dtype=np.int8)
# Tunable - increase to search more vectors, at the cost of time.
# Performs better if the best vectors in the matrix are sparse
MAX_BRANCHES = 3 # can give more for sparser matrices
# Usage
score, picked_vectors_idx = pick_vectors(TARGET, MATRIX, 5)
# Function
def pick_vectors(init_target, init_matrix, vectors_left_to_pick: int, best_prev_result=float("-inf")):
assert vectors_left_to_pick >= 1
if init_target.shape == (0, ) or len(init_matrix.shape) <= 1 or init_matrix.shape[0] == 0 or init_matrix.shape[1] == 0:
return float("inf"), None
target = init_target.copy()
matrix = init_matrix.copy()
neg_matrix = np.multiply(target, matrix)
neg_matrix_sum = neg_matrix.sum(axis=1)
if vectors_left_to_pick == 1:
picked_id = np.argmax(neg_matrix_sum)
score = neg_matrix[picked_id].sum()
return score, [picked_id]
else:
sort_order = np.argsort(neg_matrix_sum)[::-1]
sorted_sums = neg_matrix_sum[sort_order]
sorted_neg_matrix = neg_matrix[sort_order]
sorted_matrix = matrix[sort_order]
best_score = best_prev_result
best_picked_vector_idx = None
# Heuristic 1 (H1) - optimistic target.
# Set a maximum score that can still be achieved
optimistic_target = target.copy()
optimistic_target[target == -1] = 0
if optimistic_target.sum() <= best_score:
# This check can be removed - the scores are too high at this point
return float("-inf"), None
# Heuristic 2 (H2) - ignore duplicates
vecs_tried = set()
# MAIN GOAL: for picked_id, picked_vector in enumerate(sorted_matrix):
for picked_id, picked_vector in enumerate(sorted_matrix[:MAX_BRANCHES]):
# H2
picked_tuple = tuple(picked_vector)
if picked_tuple in vecs_tried:
continue
else:
vecs_tried.add(picked_tuple)
# Discard picked vector
new_matrix = np.delete(sorted_matrix, picked_id, axis=0)
# Discard matrix and target rows where vector is 0
ones = np.argwhere(picked_vector == 1).squeeze()
new_matrix = new_matrix[:, ones]
new_target = target[ones]
if len(new_matrix.shape) <= 1 or new_matrix.shape[0] == 0:
return float("-inf"), None
# H1: Do not compute if best score cannot be improved
new_optimistic_target = optimistic_target[ones]
optimistic_matrix = np.multiply(new_matrix, new_optimistic_target)
optimistic_sums = optimistic_matrix.sum(axis=1)
optimistic_viable_vector_idx = optimistic_sums > best_score
if optimistic_sums.max() <= best_score:
continue
new_matrix = new_matrix[optimistic_viable_vector_idx]
score, next_picked_vector_idx = pick_vectors(new_target, new_matrix, vectors_left_to_pick - 1, best_prev_result=best_score)
if score <= best_score:
continue
# Convert idx of trimmed-down matrix into sorted matrix IDs
for i, returned_id in enumerate(next_picked_vector_idx):
# H1: Loop until you hit the required number of 'True'
values_passed = 0
j = 0
while True:
value_picked: bool = optimistic_viable_vector_idx[j]
if value_picked:
values_passed += 1
if values_passed-1 == returned_id:
next_picked_vector_idx[i] = j
break
j += 1
# picked_vector index
if returned_id >= picked_id:
next_picked_vector_idx[i] += 1
best_score = score
# Convert from sorted matrix to input matrix IDs before returning
matrix_id = sort_order[picked_id]
next_picked_vector_idx = [sort_order[x] for x in next_picked_vector_idx]
best_picked_vector_idx = [matrix_id] + next_picked_vector_idx
return best_score, best_picked_vector_idx

Maybe it's too naive, but the first thing that occurs to me is to choose the 5 columns with the shortest distance to the target:
import scipy
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
def sparse_prod_axis0(A):
"""Sparse equivalent of np.prod(arr, axis=0)
From https://stackoverflow.com/a/44321026/3381305
"""
valid_mask = A.getnnz(axis=0) == A.shape[0]
out = np.zeros(A.shape[1], dtype=A.dtype)
out[valid_mask] = np.prod(A[:, valid_mask].A, axis=0)
return np.matrix(out)
def get_strategy(M, target, n=5):
"""Guess n best vectors.
"""
dists = np.squeeze(pairwise_distances(X=M, Y=target))
idx = np.argsort(dists)[:n]
return sparse_prod_axis0(M[idx])
# Example data.
M = scipy.sparse.rand(m=6000, n=1000, density=0.5, format='csr').astype('bool')
target = np.atleast_2d(np.random.choice([-1, 1], size=1000))
# Try it.
strategy = get_strategy(M, target, n=5)
result = strategy # target.T
It strikes me that you could add another step of taking the top few percent from the M–target distances and check their mutual distances — but this could be quite expensive.
I have not checked how this compares to an exhaustive search.

amplitude spectrum in Python

I have a given array with a length of over 1'000'000 and values between 0 and 255 (included) as integers. Now I would like to plot on the x-axis the integers from 0 to 255 and on the y-axis the quantity of the corresponding x value in the given array (called Arr in my current code).
I thought about this code:
list = []
for i in range(0, 256):
icounter = 0
for x in range(len(Arr)):
if Arr[x] == i:
icounter += 1
list.append(icounter)
But is there any way I can do this a little bit faster (it takes me several minutes at the moment)? I thought about an import ..., but wasn't able to find a good package for this.

Use numpy.bincount for this task (look for more details here)
import numpy as np
list = np.bincount(Arr)

While I completely agree with the previous answers that you should use a standard histogram algorithm, it's quite easy to greatly speed up your own implementation. Its problem is that you pass through the entire input for each bin, over and over again. It would be much faster to only process the input once, and then write only to the relevant bin:
def hist(arr):
nbins = 256
result = [0] * nbins # or np.zeroes(nbins)
for y in arr:
if y>=0 and y<nbins:
result[y] += 1
return result

Calculate fraction of ones along an columns/rows divided by the minimum fraction of ones

I'm trying to create a function calc_frac(a, axis=0) that takes a 2D array and returns the proportion of ones in each column or row divided by the column or row that has the smallest proportion of ones.
So for example
a = np.array([[1,0,1],[1,1,0],[0,1,0]])
print(calc_frac(a))
should return [ 2. 2. 1.] because column 3 has the smallest proportion of ones (1/3) so I divide all proportions by 1/3, since the other column proportions are 2/3, their ratio is (2/3)/(1/3)=2.
From reading the numpy docs, I understand I can go about this two ways- np.sum() or np.count_nonzero()... I understand that I need to find the the mean so possibly also np.mean(), but then how would would I find the minimum proportion of ones? I'd say I'm a little stuck with what method to use here.

You stated that you were stuck with an approach to solve this. One possibility is:
import numpy as np
a = np.array([[1,0,1],[1,1,0],[0,1,0]])
axis = 1
# Create a mask where ones are True and zeros False
ones = a == 1
# Sum the number of ones along the axis, using the fact that booleans act like integers
# True = 1, False = 0
onesaxis = np.sum(ones, axis=axis)
# Minimum of the ones along that axis
minaxis = np.min(onesaxis)
# Divide the amount of ones in each axis by the minimum number
result = onesaxis / minaxis
If you want it shorter put multiple statements in each line (approach is the same):
onesaxis = np.sum(a == 1, axis=axis)
result = onesaxis / np.min(onesaxis)
If your array only contains 1 and 0 you might not need the a == 1 step, simply use the array itself:
onesaxis = np.sum(a, axis=axis)
result = onesaxis / np.min(onesaxis)
One warning though: You probably need to special case the case that one row contains zero 1s. Otherwise you'll get division by zero, which is almost never correct:
onesaxis = np.sum(a, axis=axis)
minaxis = np.min(onesaxis)
if minaxis == 0:
raise ValueError() # or something else
result = onesaxis / minaxis

Numpy optimization

I have a function that assigns value depending on the condition. My dataset size is usually in the range of 30-50k. I am not sure if this is the correct way to use numpy but when it's more than 5k numbers, it gets really slow. Is there a better way to make it faster ?
import numpy as np
N = 5000; #dataset size
L = N/2;
d=0.1; constant = 5;
x=constant+d*np.random.random(N);
matrix = np.zeros([L,N]);
print "Assigning matrix"
for k in xrange(L):
for i in xrange(k+1):
matrix[k,i] = random.random()
for i in xrange(k+1,N-k-1):
if ( x[i] > x[i-k-1] ) and ( x[i] > x[i+k+1] ):
matrix[k,i] = 0
else:
matrix[k,i] = random.random()
for i in xrange(N-k-1,N):
matrix[k,i] = random.random()

If you are using for loops, you are going to lose the speed in numpy. The way to get speed is to use numpys functions and vectorized operations. Is there a way you can create a random matrix:
matrix = np.random.randn(L,k+1)
Then do something to this matrix to get the 0's positioned you want? Can you elaborate on the condition for setting an entry to 0? For example, you can make the matrix then do:
matrix[matrix > value]
To retain all values above a threshold. If the condition can be expressed as some boolean indexer or arithmetic operation, you can speed it up. If it has to be in the for loop (ie it depends on the values surrounding it as the loop cycles) it may not be able to be vectorized.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

`numpy.nanpercentile` is extremely slow - python

numpy.nanpercentile is extremely slow. So, I wanted to use cupy.nanpercentile; but there is not cupy.nanpercentile implemented yet. Do someone have solution for it?

Related

Recreate List based on statistics

Heuristic to choose five column arrays that maximise the dot product

amplitude spectrum in Python

Calculate fraction of ones along an columns/rows divided by the minimum fraction of ones

Numpy optimization

Categories

Resources