Comparing another array with a list full of arrays - python

So I've essentially split an array#1(full of float values) into 100 arrays contained in a list, and what I want to do is compare it to an array#2(also full of floats) and have the program give me the number of values in array#2 that fall within the range of each of the 100 arrays in the list.
I may not have explained it well enough, but I've done this successfully for the first two arrays however I can't find a way to do it elegantly through a 'for' loop so I don't have to type it out 100 times.
Here's the code:
manual_bins_threshim = np.array_split(threshim_by_size, 100)
def count(rand, l, r):
return len(list(i for i in rand if l <= i <= r))
a = np.array(manual_bins_threshim[0:1])
l = a[:][0][0]
r = a[:][0][len(a[:][0]) -1]
a_1 = count(array2, l, r)
b = np.array(manual_bins_threshim[1:2])
l = b[:][0][0]
r = b[:][0][len(b[:][0]) -1]
b_1 = count(array2, l, r)
print(a_1,b_1)
I'm also open to a function that can do this in a different way if I've made it way more complicated than it needs to be.

Just iterate over the elements of manual_bins_threshim :
for a in manual_bins_threshim:
l = a[0,0]
r = a[0,-1]
print(count(array2, l, r))
A few words about my modifications:
l = a[:][0][0] → l = a[0,0] - I don't think [:] is needed here (it creates a new array referring to the same data).
r = a[:][0][len(a[:][0]) -1] → r = a[0,-1] - the last element of an array (or a list) can be accessed with -1 (by the way, the n-th element from the end can be accessed with -n).

This question requires some numpy high dimension array operation:
import numpy as np
threshim_by_size = np.random.rand(300)
manual_bins_threshim = np.array_split(threshim_by_size, 100)
array2 = np.random.rand(20)
def count(rand, ll, rr):
return len(list(i for i in rand if ll <= i <= rr))
a = np.array(manual_bins_threshim[0:1])
l = a[:][0][0]
r = a[:][0][len(a[:][0]) - 1]
a_1 = count(array2, l, r)
b = np.array(manual_bins_threshim[1:2])
l = b[:][0][0]
r = b[:][0][len(b[:][0]) - 1]
b_1 = count(array2, l, r)
print(a_1, b_1)
def array_op():
reshaped_threshim_by_size = np.reshape(threshim_by_size, [100, -1])
ll = reshaped_threshim_by_size[:, 0:1]
rr = reshaped_threshim_by_size[:, -1:]
reshape_array2 = np.reshape(array2, [1, -1])
mask = np.logical_and(ll <= reshape_array2, reshape_array2 <= rr)
return np.sum(mask, axis=1)
res = array_op()
assert res[0] == a_1 and res[1] == b_1

Related

For Loop Append to List (odd probably explicable behavior)

I was attempting implement the Runge-Kutta method detailed here. However, I wanted to return the values, so I modified the VDP1() equation as you will see below.
The odd behavior is that if I both calculate x and append that value to an array it will in each step seem to replace all entries in the list with that being appended. You can see the first code bit and output for the rub of the problem, the rest is just me showing what didn't work.
.extend() works as expected so this is perhaps just another opportunity to explain to stupid people like me why the two function so differently. I base my understanding of the two on the answer given here:
append adds its argument as a single element to the end of a list.
The length of the list itself will increase by one.
extend iterates over its argument adding each element to the list, extending the list. The length of the list will increase by however
many elements were in the iterable argument.
Which would not lead me to believe that each item would somehow be replaced by the item be appended. But anyway, here's the code with unexpected output.
def rKN(x, fx, n, hs):
k1 = []
k2 = []
k3 = []
k4 = []
xk = []
for i in range(n):
k1.append(fx[i](x)*hs)
for i in range(n):
xk.append(x[i] + k1[i]*0.5)
for i in range(n):
k2.append(fx[i](xk)*hs)
for i in range(n):
xk[i] = x[i] + k2[i]*0.5
for i in range(n):
k3.append(fx[i](xk)*hs)
for i in range(n):
xk[i] = x[i] + k3[i]
for i in range(n):
k4.append(fx[i](xk)*hs)
for i in range(n):
x[i] = x[i] + (k1[i] + 2*(k2[i] + k3[i]) + k4[i])/6
return x
def fa1(x):
return 0.9*(1 - x[1]*x[1])*x[0] - x[1] + np.sin(x[2])
def fb1(x):
return x[0]
def fc1(x):
return 0.5
def VDP1():
f = [fa1, fb1, fc1]
x = [1, 1, 0]
X_v = []
hs = 0.05
for i in range(3):
x = rKN(x, f, 3, hs)
# x = [1,i,3]
print(x)
X_v.append(x)
print(X_v)
VDP1()
Output:
calc x value is [0.9472269022674134, 1.0487033185947015, 0.024999999999999998]
calc x value is [0.8893603370715508, 1.0946376878598068, 0.049999999999999996]
calc x value is [0.8271667883479003, 1.1375671528881417, 0.075]
X_v array is
[[0.8271667883479003, 1.1375671528881417, 0.075], [0.8271667883479003, 1.1375671528881417, 0.075], [0.8271667883479003, 1.1375671528881417, 0.075]]
As you can see the X_v array is just a three-peat of the final item.
If we switch from append to .extend(x) it does extend the list, however it returns it as a flat list when we would prefer a list of lists.
Output using .extend(x)
calc x value is [0.9472269022674134, 1.0487033185947015, 0.024999999999999998]
calc x value is [0.8893603370715508, 1.0946376878598068, 0.049999999999999996]
calc x value is [0.8271667883479003, 1.1375671528881417, 0.075]
X_v array is
[0.9472269022674134, 1.0487033185947015, 0.024999999999999998, 0.8893603370715508, 1.0946376878598068, 0.049999999999999996, 0.8271667883479003, 1.1375671528881417, 0.075]
If we try to use .extend([x]) to get that list of lists the results are the same as using append. If I try X_v = X_v + [x] I similarly get the same odd result.
If you want to get real confused run the below, where despite redefining x before appending, I get the old value of the append.
def VDP1():
f = [fa1, fb1, fc1]
x = [1, 1, 0]
X_v = []
hs = 0.05
for i in range(3):
x = rKN(x, f, 3, hs)
# print(type(x))
# print(len(x))
x = [i-1,i,i+1]
# print(type(x))
# print(len(x))
print('calc x value is {} '.format(x))
X_v.append(x)
print('X_v array is \n{}'.format(X_v))
VDP1()
Output:
calc x value is [-1, 0, 1]
calc x value is [0, 1, 2]
calc x value is [1, 2, 3]
X_v array is
[[-1.0013470352949683, -0.0500470769424533, 1.025], [-0.00479801080448152, 0.9998822513537077, 2.025], [1, 2, 3]]
If I instead add to a np.array using the i as an index location, I can get what I would have expected:
def VDP1():
f = [fa1, fb1, fc1]
x = [1, 1, 0]
n = 4
X_v = np.zeros([n,3])
hs = 0.05
for i in range(n):
x = rKN(x, f, 3, hs)
print('calc x value is {} '.format(x))
X_v[i] = x
print('X_v array is \n{}'.format(X_v))
VDP1()
Output:
X_v array is
[[0.9472269 1.04870332 0.025 ]
[0.88936034 1.09463769 0.05 ]
[0.82716679 1.13756715 0.075 ]
[0.7615015 1.17729645 0.1 ]]
So seems clear there is some kind of append/extend or list behavior goign on that I don't understand.
Thanks for any assistance!
def rKN(x, fx, n, hs):
k1 = []
k2 = []
k3 = []
k4 = []
xk = []
result = []
for i in range(n):
k1.append(fx[i](x)*hs)
for i in range(n):
xk.append(x[i] + k1[i]*0.5)
for i in range(n):
k2.append(fx[i](xk)*hs)
for i in range(n):
xk[i] = x[i] + k2[i]*0.5
for i in range(n):
k3.append(fx[i](xk)*hs)
for i in range(n):
xk[i] = x[i] + k3[i]
for i in range(n):
k4.append(fx[i](xk)*hs)
for i in range(n):
#x[i] = x[i] + (k1[i] + 2*(k2[i] + k3[i]) + k4[i])/6# I changed
result.append(x[i] + (k1[i] + 2*(k2[i] + k3[i]) + k4[i])/6)
return result #x[i]=value != result.append(value) at the memory
def fa1(x):
return 0.9*(1 - x[1]*x[1])*x[0] - x[1] + np.sin(x[2])
def fb1(x):
return x[0]
def fc1(x):
return 0.5
def VDP1():
f = [fa1, fb1, fc1]
x = [1, 1, 0]
X_v = []
hs = 0.05
for i in range(3):
x = rKN(x, f, 3, hs)
# x = [1,i,3]
print(x)
X_v.append(x)
print(X_v)
VDP1()
print:
[0.9472269022674134, 1.0487033185947015, 0.024999999999999998]
[0.8893603370715508, 1.0946376878598068, 0.049999999999999996]
[0.8271667883479003, 1.1375671528881417, 0.075]
[[0.9472269022674134, 1.0487033185947015, 0.024999999999999998], [0.8893603370715508, 1.0946376878598068, 0.049999999999999996], [0.8271667883479003, 1.1375671528881417, 0.075]]

Numpy: fill conditional subarray with increasing numbers

I often come across an idiom like the following: say I have data like
N = 20 # or some other number
a = np.random.randint(0, 10, N) # or any other 1D np.array
predicate = lambda x: x%2 == 0 # or any other predicate
The idiom I encounter is along the lines
b = np.full_like(a, -1)
i1 = 0
for i, x in enumerate(a):
if predicate(x):
b[i] = i1
i1 += 1
How do I translate this to numpy? The following:
b = np.full_like(a, -1)
m = some_predicate(a)
b[m] = np.arange(np.count_nonzero(m))
looks a bit odd to me: this is three lines for such a simple task. In particular, it disturbs me that I need to store m, which I do since I need to reference it twice (because I have no way to say "arange with as many values as necessary").
Walrus operator to the rescue (starting with Python 3.8):
i = -1
b = np.array([ -1 if not predicate(val) else (i := i+1) for val in a ])
or (presumably significantly faster for large arrays)
b = np.full_like(a, -1)
b[sel] = np.arange(np.count_nonzero(sel := predicate(a)))

Finding first pair of numbers in array that sum to value

Im trying to solve the following Codewars problem: https://www.codewars.com/kata/sum-of-pairs/train/python
Here is my current implementation in Python:
def sum_pairs(ints, s):
right = float("inf")
n = len(ints)
m = {}
dup = {}
for i, x in enumerate(ints):
if x not in m.keys():
m[x] = i # Track first index of x using hash map.
elif x in m.keys() and x not in dup.keys():
dup[x] = i
for x in m.keys():
if s - x in m.keys():
if x == s-x and x in dup.keys():
j = m[x]
k = dup[x]
else:
j = m[x]
k = m[s-x]
comp = max(j,k)
if comp < right and j!= k:
right = comp
if right > n:
return None
return [s - ints[right],ints[right]]
The code seems to produce correct results, however the input can consist of array with up to 10 000 000 elements, so the execution times out for large inputs. I need help with optimizing/modifying the code so that it can handle sufficiently large arrays.
Your code inefficient for large list test cases so it gives timeout error. Instead you can do:
def sum_pairs(lst, s):
seen = set()
for item in lst:
if s - item in seen:
return [s - item, item]
seen.add(item)
We put the values in seen until we find a value that produces the specified sum with one of the seen values.
For more information go: Referance link
Maybe this code:
def sum_pairs(lst, s):
c = 0
while c<len(lst)-1:
if c != len(lst)-1:
x= lst[c]
spam = c+1
while spam < len(lst):
nxt= lst[spam]
if nxt + x== s:
return [x, nxt]
spam += 1
else:
return None
c +=1
lst = [5, 6, 5, 8]
s = 14
print(sum_pairs(lst, s))
Output:
[6, 8]
This answer unfortunately still times out, even though it's supposed to run in O(n^3) (since it is dominated by the sort, the rest of the algorithm running in O(n)). I'm not sure how you can obtain better than this complexity, but I thought I might put this idea out there.
def sum_pairs(ints, s):
ints_with_idx = enumerate(ints)
# Sort the array of ints
ints_with_idx = sorted(ints_with_idx, key = lambda (idx, num) : num)
diff = 1000000
l = 0
r = len(ints) - 1
# Indexes of the sum operands in sorted array
lSum = 0
rSum = 0
while l < r:
# Compute the absolute difference between the current sum and the desired sum
sum = ints_with_idx[l][1] + ints_with_idx[r][1]
absDiff = abs(sum - s)
if absDiff < diff:
# Update the best difference
lSum = l
rSum = r
diff = absDiff
elif sum > s:
# Decrease the large value
r -= 1
else:
# Test to see if the indexes are better (more to the left) for the same difference
if absDiff == diff:
rightmostIdx = max(ints_with_idx[l][0], ints_with_idx[r][0])
if rightmostIdx < max(ints_with_idx[lSum][0], ints_with_idx[rSum][0]):
lSum = l
rSum = r
# Increase the small value
l += 1
# Retrieve indexes of sum operands
aSumIdx = ints_with_idx[lSum][0]
bSumIdx = ints_with_idx[rSum][0]
# Retrieve values of operands for sum in correct order
aSum = ints[min(aSumIdx, bSumIdx)]
bSum = ints[max(aSumIdx, bSumIdx)]
if aSum + bSum == s:
return [aSum, bSum]
else:
return None

Optimize iteration through numpy array when averaging adjacent values

I have a definition in python that
Iterates over a sorted distinct array of Floats
Gets the previous and next item
Finds out if they are within a certain range of each other
averages them out, and replaces the original values with the averaged value
rerun through that loop until there are no more changes
returns a distinct array
The issue is that it is extremely slow. the array "a" could be 100k+ and it takes 7-10 minutes to complete
I found that I needed to iterate over the array after the initial iteration because after averaging, sometimes the average values could be within range to be averaged again
I thought about breaking it into chunks and use multiprocessing, my concern is the end of one chunk, and the beginning of the next chunk would need to be averaged too.
def reshape_arr(a, close):
"""Iterates through 'a' to find values +- 'close', and averages them, then returns a distinct array of values"""
flag = True
while flag:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
#find matching values in a, and replace with the average
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
# find matching values in a, and replace with the average
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a.unique()
a is a Pandas.Series from a DataFrame of anything between 0 and 200k rows, and close is an int (100 for example)
it works, just very slow.
First, if the length of the input array a is large and close is relatively small, your proposed algorithm may be numerically unstable.
That being said, here are some ideas that reduce the time complexity from O(N^3) to O(N) (for an approximate implementation) or O(N^2) (for an equivalent implementation). For N=100, this gives a speedup up to a factor of 6000 for some choices of arr and close.
Consider an input array arr = [a,b,c,d], and suppose that close > d - a. In this case, the algorithm proceeds as follows:
[a,b,c,d]
[(a+b)/2,(b+c)/2,(c+d)/2]
[(a+2b+c)/4,(b+2c+d)/4]
[(a+3b+3c+d)/8]
One can recognize that if [x_1, x_2, ..., x_n] is a maximal contiguous subarray of arr s.t. x_i - x_{i-1} < close, then [x_1, x_2, ..., x_n] eventually evaluates to (sum_{k=0}^{k=n} x_k * c_{n,k})/(2^(n-1)) where c_{n,k} is the binomial coefficient n choose k.
This gives an O(N) implementation as follows:
import numpy as np
from scipy.stats import binom
from scipy.special import comb
def binom_mean(arr, scipy_cutoff=64):
"""
Given an array arr, returns an average of arr
weighted by binomial coefficients.
"""
n = arr.shape[0]
if arr.shape[0] == 1:
return arr[0]
# initializing a scipy binomial random variable can be slow
# so, if short runs are likely, we can speed things up
# by doing explicit computations
elif n < scipy_cutoff:
return np.average(arr, weights=comb(n-1, np.arange(n), exact=False))
else:
f = binom(n-1, 0.5).pmf
return np.average(arr, weights=f(np.arange(n)))
def reshape_arr_binom(arr, close):
d = np.ediff1d(arr, to_begin=0) < close
close_chunks = np.split(arr, np.where(~d)[0])
return np.fromiter(
(binom_mean(c) for c in close_chunks),
dtype=np.float
)
The result is within 10e-15 of your implementation for np.random.seed(0);N=1000;cost=1/N;arr=np.random.rand(N). However, for large N, this may not be meaningful unless cost is small. For the above parameter values, this is 270 times faster than the original code on my machine.
However, if we choose a modest value of N = 100 and set close to a large value like 1, the speedup is by a factor of 6000. This is because for large values of close, the original implementation is O(N^3); specifically, a.replace is potentially called O(N^2) times and has a cost O(N). So, maximal speedup is achieved when contiguous elements are likely to be close.
For the reference, here is an O(N^2) implementation that is equivalent to your code (I do not recommend using this in practice).
import pandas as pd
import numpy as np
np.random.seed(0)
def custom_avg(arr, indices, close):
new_indices = list()
last = indices[-1]
for i in indices:
if arr[i] - arr[i-1] < close:
new_indices.append(i)
avg = (arr[i-1] + arr[i]) / 2
arr[i-1] = avg
if i != last and arr[i+1] - arr[i] >= close:
arr[i] = avg
return new_indices
def filter_indices(indices):
new_indices = list()
second_dups = list()
# handle empty index case
if not indices:
return new_indices, second_dups
for i, j in zip(indices, indices[1:]):
if i + 1 == j:
# arr[i] is guaranteed to be different from arr[i-1]
new_indices.append(i)
else:
# arr[i+1] is guaranteed to be a duplicate of arr[i]
second_dups.append(i)
second_dups.append(indices[-1])
return new_indices, second_dups
def reshape_arr_(arr, close):
indices = range(1, len(arr))
dup_mask = np.zeros(arr.shape, bool)
while indices:
indices, second_dups = filter_indices(custom_avg(arr, indices, close))
# print(f"n_inds = {len(indices)};\tn_dups = {len(second_dups)}")
dup_mask[second_dups] = True
return np.unique(arr[~dup_mask])
The basic ideas are the following:
First, consider two adjacent elements (i,j) with j = i + 1. If arr[j] - arr[i] >= close in current iteration, arr[j] - arr[i] >= close also holds after the current iteration. This is because arr[i] can only decrease and arr[j] can only increase. So, if (i,j) pair is not averaged in the current iteration, it will not be averaged in any of the subsequent iterations. So, we can avoid looking at (i,j) in the future.
Second, if (i,j) is averaged and (i+1,j+1) is not, we know that arr[i] is a duplicate of arr[j]. Also, the last modified element in each iteration is always a duplicate.
Based on these observations, we need to process fewer and fewer indices in each iteration. The worst case is still O(N^2), which can be witnessed by setting close = arr.max() - arr.min() + 1.
Some benchmarks:
from timeit import timeit
make_setup = """
from __main__ import np, pd, reshape_arr, reshape_arr_, reshape_arr_binom
np.random.seed(0)
arr = np.sort(np.unique(np.random.rand({N})))
close = {close}""".format
def benchmark(N, close):
np.random.seed(0)
setup = make_setup(N=N, close=close)
print('Original:')
print(timeit(
stmt='reshape_arr(pd.Series(arr.copy()), close)',
# setup='from __main__ import reshape_arr; import pandas as pd',
setup=setup,
number=1,
))
print('Quadratic:')
print(timeit(
stmt='reshape_arr_(arr.copy(), close)',
setup=setup,
number=10,
))
print('Binomial:')
print(timeit(
stmt='reshape_arr_binom(arr.copy(), close)',
setup=setup,
number=10,
))
if __name__ == '__main__':
print('N=10_000, close=1/N')
benchmark(10_000, 1/10_000)
print('N=100, close=1')
benchmark(100, 1)
# N=10_000, close=1/N
# Original:
# 14.855983458999999
# Quadratic:
# 0.35902471400000024
# Binomial:
# 0.7207887170000014
# N=100, close=1
# Original:
# 4.132993569
# Quadratic:
# 0.11140068399999947
# Binomial:
# 0.007650813999998007
The following table shows how the number of pairs we need to look at in the quadratic algorithm goes down each iteration.
n_inds = 39967; n_dups = 23273
n_inds = 25304; n_dups = 14663
n_inds = 16032; n_dups = 9272
n_inds = 10204; n_dups = 5828
n_inds = 6503; n_dups = 3701
n_inds = 4156; n_dups = 2347
n_inds = 2675; n_dups = 1481
n_inds = 1747; n_dups = 928
n_inds = 1135; n_dups = 612
n_inds = 741; n_dups = 394
n_inds = 495; n_dups = 246
n_inds = 327; n_dups = 168
n_inds = 219; n_dups = 108
n_inds = 145; n_dups = 74
n_inds = 95; n_dups = 50
n_inds = 66; n_dups = 29
n_inds = 48; n_dups = 18
n_inds = 36; n_dups = 12
n_inds = 26; n_dups = 10
n_inds = 20; n_dups = 6
n_inds = 15; n_dups = 5
n_inds = 10; n_dups = 5
n_inds = 6; n_dups = 4
n_inds = 3; n_dups = 3
n_inds = 1; n_dups = 2
n_inds = 0; n_dups = 1
You can use the following function to produce similar output to yours (with the difference that the result from your function is unsorted since a is never sorted outside the loop and pd.Series.unique returns values in order of appearance; if this is actually desired, check the second function). Sorting the array on every loop iteration is not required since replacing with the average of two subsequent (unique) items in a sorted array will not invalidate the sorting. Since on every iteration the comparison with next_item will be the comparison with prev_item during the next iteration you can just compare subsequent elements in a pairwise manner once.
def solve_sorted(a, close):
"""Returns the reduced unique values as a sorted array."""
a = a.sort_values().values.astype(float)
while True:
a = np.unique(a)
comp = a[1:] - a[:-1] < close
if not comp.sum():
break
indices = np.tile(comp.nonzero()[0][:, None], (1, 2))
indices[:, 1] += 1
avg = a[indices].mean(axis=1)
a[indices.ravel()] = np.repeat(avg, 2)
return np.unique(a)
If it is important to preserve the original order of elements then you can store the reverse sorting indices once at the beginning to restore the original order at the end:
def solve_preserve_order(a, close):
"""Returns the reduced unique values in their original order."""
reverse_indices = np.argsort(np.argsort(a.values))
a = a.sort_values()
while True:
b = a.unique()
comp = b[1:] - b[:-1] < close
if not comp.sum():
break
indices = np.tile(comp.nonzero()[0][:, None], (1, 2))
indices[:, 1] += 1
avg = b[indices].mean(axis=1)
a.replace(b[indices.ravel()], np.repeat(avg, 2), inplace=True)
return a.iloc[reverse_indices].unique()
Performance comparison
Testing the performance of the different presented algorithms for sorted, unique-valued input arrays (code attached below). Functions:
reshape_arr_binom, reshape_arr_
solve_sorted
Performance scaling with the size of the input array
Using close = 1 / arr.size.
Scaling with the interval length
Using arr.size == 1_000; close is the interval length.
Source code
"""Performance plots.
Assuming a sorted, unique-valued array as an input.
Function names have format `a<id>_*` where <id> is the answer's id."""
from timeit import timeit
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import perfplot
from scipy.stats import binom
from scipy.special import comb
def OP_reshape_arr(a, close):
flag = True
while flag:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a.unique()
def _binom_mean(arr, scipy_cutoff=64):
n = arr.shape[0]
if arr.shape[0] == 1:
return arr[0]
elif n < scipy_cutoff:
return np.average(arr, weights=comb(n-1, np.arange(n), exact=False))
else:
f = binom(n-1, 0.5).pmf
return np.average(arr, weights=f(np.arange(n)))
def a57438948_reshape_arr_binom(arr, close):
d = np.ediff1d(arr, to_begin=0) < close
close_chunks = np.split(arr, np.where(~d)[0])
return np.fromiter(
(_binom_mean(c) for c in close_chunks),
dtype=np.float
)
def _custom_avg(arr, indices, close):
new_indices = list()
last = indices[-1]
for i in indices:
if arr[i] - arr[i-1] < close:
new_indices.append(i)
avg = (arr[i-1] + arr[i]) / 2
arr[i-1] = avg
if i != last and arr[i+1] - arr[i] >= close:
arr[i] = avg
return new_indices
def _filter_indices(indices):
new_indices = list()
second_dups = list()
if not indices:
return new_indices, second_dups
for i, j in zip(indices, indices[1:]):
if i + 1 == j:
new_indices.append(i)
else:
second_dups.append(i)
second_dups.append(indices[-1])
return new_indices, second_dups
def a57438948_reshape_arr_(arr, close):
indices = range(1, len(arr))
dup_mask = np.zeros(arr.shape, bool)
while indices:
indices, second_dups = _filter_indices(_custom_avg(arr, indices, close))
dup_mask[second_dups] = True
return np.unique(arr[~dup_mask])
def a57438149_solve_sorted(a, close):
while True:
comp = a[1:] - a[:-1] < close
if not comp.sum():
break
indices = np.tile(comp.nonzero()[0][:, None], (1, 2))
indices[:, 1] += 1
avg = a[indices].mean(axis=1)
a[indices.ravel()] = np.repeat(avg, 2)
a = np.unique(a)
return a
np.random.seed(0)
a = np.unique(np.random.rand(10_000))
c = 1/a.size
ref = OP_reshape_arr(pd.Series(a.copy()), c)
test = [
a57438948_reshape_arr_binom(a.copy(), c),
a57438948_reshape_arr_(a.copy(), c),
a57438149_solve_sorted(a, c),
]
assert all(x.shape == ref.shape and np.allclose(x, ref) for x in test)
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
perfplot.bench(
setup = lambda n: np.random.seed(0) or (np.unique(np.random.rand(n)), 1/n),
kernels=[
lambda x: OP_reshape_arr(pd.Series(x[0].copy()), x[1]),
lambda x: a57438948_reshape_arr_binom(x[0].copy(), x[1]),
lambda x: a57438948_reshape_arr_(x[0].copy(), x[1]),
lambda x: a57438149_solve_sorted(x[0], x[1]),
],
labels=['OP_reshape_arr', 'reshape_arr_binom', 'reshape_arr_', 'solve_sorted'],
n_range=np.logspace(2, 4, 8).astype(int),
xlabel='size of initial array (before np.unique; using interval length of 1/n)',
logx=True,
logy=True,
colors=colors,
automatic_order=False,
).plot()
plt.gca().set_xlim([1e2, 1e4])
plt.gca().set_ylim([1e-4, 20])
plt.savefig('scaling_with_array_size.png')
plt.close()
np.random.seed(0)
a = np.unique(np.random.rand(1_000_000))
c = 1/a.size
test = [
a57438948_reshape_arr_binom(a.copy(), c),
a57438948_reshape_arr_(a.copy(), c),
a57438149_solve_sorted(a, c),
]
assert all(x.shape == test[0].shape and np.allclose(x, test[0]) for x in test)
perfplot.bench(
setup = lambda n: np.random.seed(0) or (np.unique(np.random.rand(n)), 1/n),
kernels=[
lambda x: a57438948_reshape_arr_binom(x[0].copy(), x[1]),
lambda x: a57438948_reshape_arr_(x[0].copy(), x[1]),
lambda x: a57438149_solve_sorted(x[0], x[1]),
],
labels=['reshape_arr_binom', 'reshape_arr_', 'solve_sorted'],
n_range=np.logspace(4, 6, 5).astype(int),
xlabel='size of initial array (before np.unique; using interval length of 1/n)',
logx=True,
logy=True,
colors=colors[1:],
automatic_order=False,
).plot()
plt.gca().set_xlim([1e4, 1e6])
plt.gca().set_ylim([5e-4, 10])
plt.savefig('scaling_with_array_size_2.png')
plt.close()
perfplot.bench(
setup = lambda n: np.random.seed(0) or (np.unique(np.random.rand(1_000)), n),
kernels=[
lambda x: OP_reshape_arr(pd.Series(x[0].copy()), x[1]),
lambda x: a57438948_reshape_arr_binom(x[0].copy(), x[1]),
lambda x: a57438948_reshape_arr_(x[0].copy(), x[1]),
lambda x: a57438149_solve_sorted(x[0], x[1]),
],
labels=['OP_reshape_arr', 'reshape_arr_binom', 'reshape_arr_', 'solve_sorted'],
n_range=np.logspace(-6, -2, 16),
xlabel='length of interval (using array of size 1,000)',
logx=True,
logy=True,
colors=colors,
automatic_order=False,
).plot()
plt.gca().set_xlim([1e-6, 1e-2])
plt.gca().set_ylim([2e-5, 1e3])
plt.savefig('scaling_with_interval_length.png')
plt.close()

numpy shuffle with constraint

I would like to shuffle a 1-d numpy array, with the constraint that no elements match the corresponding elements (ie., same index) from another array of the same shape. It can be assumed that all elements of each array are unique.
For example,
a = np.arange(10)
b = a.copy()
np.random.shuffle(b)
np.where(a==b) # This should be empty
What's the best way? Any ideas?
Adapted from georg's answer here
def random_derangement(n):
while True:
v = np.arange(n)
for j in np.arange(n - 1, -1, -1):
p = np.random.randint(0, j+1)
if v[p] == j:
break
else:
v[j], v[p] = v[p], v[j]
else:
if v[0] != 0:
return v
def random_derangement(N):
original = np.arange(N)
new = np.random.permutation(N)
same = np.where(original == new)[0]
while len(same) != 0:
swap = same[np.random.permutation(len(same))]
new[same] = new[swap]
same = np.where(original == new)[0]
if len(same) == 1:
swap = np.random.randint(0, N)
new[[same[0], swap]] = new[[swap, same[0]]]
return new

Categories