Python pandas iterating rows with complicated calculation faster way of current code

Python pandas iterating rows with complicated calculation faster way of current code - python

I've implemented some sort of object stability calculator in pandas. But performance time is horrible. Can someone help me, please.
def calculate_stability(ind, df, sidx, max_k):
indexes = sidx[:, ind]
indexes = np.delete(indexes, np.where(indexes == ind))
d = 0
last_crtit_obj_count = 0
for j in range(max_k):
if df.at[ind, "Class"] == df.at[indexes[j], "Class"]:
d = d + 1
if d / (j+1) > 1/2:
last_crtit_obj_count = (j+1)
print(f'\t Object {ind} = {last_crtit_obj_count / max_k}')
return last_crtit_obj_count / max_k
df.iloc was very slow. That's why I changed to df.at.
Code is here
Need to vectorized version of loop.

Here is the version without the loop:
def calculate_stability(ind, df, sidx, max_k):
indexes = sidx[:, ind]
indexes = indexes[indexes != ind][:max_k]
# `d` contains all values from the first condition from the original loop:
d = (df["Class"][ind] == df["Class"][indexes]).cumsum()
# `j` contains all values from the original `range` + 1:
j = np.arange(1, len(d) + 1)
# select `last_crtit_obj_count` values:
crtit_objs = j[(d / j > 1 / 2)]
# calculate `last_crtit_obj_count / max_k`
result = crtit_objs[-1] / max_k if len(crtit_objs) else 0
print(f"\t Object {ind} = {result}")
return result

Related

merge, heap, and quick sort counts are not coming out properly

import random, timeit
#Qucik sort
def quick_sort(A,first,last):
global Qs,Qc
if first>=last: return
left, right= first+1, last
pivot = A[first]
while left <= right:
while left <=last and A[left]<pivot:
Qc= Qc+1
left= left + 1
while right > first and A[right] >= pivot:
Qc=Qc+1
right = right -1
if left <= right:
A[left],A[right]=A[right],A[left]
Qs = Qs+1
left= left +1
right= right-1
A[first],A[right]=A[right],A[first]
Qs=Qs+1
quick_sort(A,first,right-1)
quick_sort(A,right+1,last)
#Merge sort
def merge_sort(A, first, last): # merge sort A[first] ~ A[last]
global Ms,Mc
if first >= last: return
middle = (first+last)//2
merge_sort(A, first, middle)
merge_sort(A, middle+1, last)
B = []
i = first
j = middle+1
while i <= middle and j <= last:
Mc=Mc+1
if A[i] <= A[j]:
B.append(A[i])
i += 1
else:
B.append(A[j])
j += 1
for i in range(i, middle+1):
B.append(A[i])
Ms=Ms+1
for j in range(j, last+1):
B.append(A[j])
for k in range(first, last+1): A[k] = B[k-first]
#Heap sort
def heap_sort(A):
global Hs, Hc
n = len(A)
for i in range(n - 1, -1, -1):
while 2 * i + 1 < n:
left, right = 2 * i + 1, 2 * i + 2
if left < n and A[left] > A[i]:
m = left
Hc += 1
else:
m = i
Hc += 1
if right < n and A[right] > A[m]:
m = right
Hc += 1
if m != i:
A[i], A[m] = A[m], A[i]
i = m
Hs += 1
else:
break
for i in range(n - 1, -1, -1):
A[0], A[i] = A[i], A[0]
n -= 1
k = 0
while 2 * k + 1 < n:
left, right = 2 * k + 1, 2 * k + 2
if left < n and A[left] > A[k]:
m = left
Hc += 1
else:
m = k
Hc += 1
if right < n and A[right] > A[m]:
m = right
Hc += 1
if m != k:
A[k], A[m] = A[m], A[k]
k = m
Hs += 1
else:
break
#
def check_sorted(A):
for i in range(n-1):
if A[i] > A[i+1]: return False
return True
#
#
Qc, Qs, Mc, Ms, Hc, Hs = 0, 0, 0, 0, 0, 0
n = int(input())
random.seed()
A = []
for i in range(n):
A.append(random.randint(-1000,1000))
B = A[:]
C = A[:]
print("")
print("Quick sort:")
print("time =", timeit.timeit("quick_sort(A, 0, n-1)", globals=globals(), number=1))
print(" comparisons = {:10d}, swaps = {:10d}\n".format(Qc, Qs))
print("Merge sort:")
print("time =", timeit.timeit("merge_sort(B, 0, n-1)", globals=globals(), number=1))
print(" comparisons = {:10d}, swaps = {:10d}\n".format(Mc, Ms))
print("Heap sort:")
print("time =", timeit.timeit("heap_sort(C)", globals=globals(), number=1))
print(" comparisons = {:10d}, swaps = {:10d}\n".format(Hc, Hs))
assert(check_sorted(A))
assert(check_sorted(B))
assert(check_sorted(C))
I made the code that tells how much time it takes to sort list size n(number input) with 3 ways of sorts. However, I found that my result is quite unexpected.
Quick sort:
time = 0.0001289689971599728
comparisons = 474, swaps = 168
Merge sort:
time = 0.00027709499408956617
comparisons = 541, swaps = 80
Heap sort:
time = 0.0002578190033091232
comparisons = 744, swaps = 478
Quick sort:
time = 1.1767549149953993
comparisons = 3489112, swaps = 352047
Merge sort:
time = 0.9040642600011779
comparisons = 1536584, swaps = 77011
Heap sort:
time = 1.665754442990874
comparisons = 2227949, swaps = 1474542
Quick sort:
time = 4.749891302999458
comparisons = 11884246, swaps = 709221
Merge sort:
time = 3.1966246420051903
comparisons = 3272492, swaps = 154723
Heap sort:
time = 6.2041203819972
comparisons = 4754829, swaps = 3148479
as you see, my results are very different from what I learned. Can you please tell me why quick sort is not the fastest in my code? and why merge is the fastest one.

I can see that you are choosing the first element of the array as the pivot in quicksort. Now, consider the order of the elements of the unsorted array. Is it random? How do you generate the input array?
You see, if the pivot was either the min or max value of the aray, or somewhere close to the mind/max value, the running time of quicksort in that case (worst case) will be in the order of O(n^2). That is because on each iteration, you are partitioning the arry by breaking off only one element.
For optimal quicksort performance of O(n log n), your pivot should be as close to the median value as possible. In order to increase the likelihood of that being the case, consider initially picking 3 values at random in from the array, and use the median value as the pivot. Obviously, the more values you choose the median from initially the better the probability that your pivot is more efficient, but you are adding extra moves by choosing those values to begin with, so it's a trade off. I imagine one would even be able to calculate exactly how many elements should be selected in relation to the size of the array for optimal performance.
Merge sort on the other hand, always has the complexity in the order of O(n log n) irrespective of input, which is why you got consistent results with it over different samples.
TL:DR my guess is that the input array's first element is very close to being the smallest or largest value of that array, and it ends up being the pivot of your quicksort algorithm.

How to find the max of the sums of the absolute values of each column in a matrix

I am trying to write a function to find the maximum value of the sums of each value in each column of a matrix without using a numpy function.
For example, given the following array, I want the answer 2.7657527806024733.
A = np.array([[0.94369777, 0.34434054, 0.80366952, 0.665736],
[0.82367659, 0.13791176, 0.6993436, 0.44473609],
[0.82337673, 0.56936686, 0.46648214, 0.50403736]])
This is the code I have so far:
def L1Norm(M):
x = 0
S = np.shape(M)
N = S[0]
P = S[1]
answer = np.zeros((1, P))
for j in range(P):
t = 0
for i in M:
t += np.abs(i[j])
answer = np.append(answer, t)
s = np.shape(answer)
n = s[0]
p = s[1]
for j in range(p):
if answer[0][j] > x:
x = answer[0][j]
return x
But I keep getting the following error:
IndexError Traceback (most recent call last)
<ipython-input-113-e06e08ab836c> in <module>
----> 1 L1Norm(A)
<ipython-input-112-624908415c12> in L1Norm(M)
12 s = np.shape(answer)
13 n = s[0]
---> 14 p = s[1]
15 for j in range(p):
16 if answer[0][j] > x:
IndexError: tuple index out of range
Any ideas about how I could fix this?

Heres my solve. I loop over the columns and push each sum into an array. Then i loop over that array to find the largest value. It's very verbose but it doesn't use numpy for anything but creating the matrix.
import numpy as np
matrix = np.array([[0.94369777, 0.34434054, 0.80366952, 0.665736],
[0.82367659, 0.13791176, 0.6993436, 0.44473609],
[0.82337673, 0.56936686, 0.46648214, 0.50403736]])
matrixShape = np.shape(matrix)
i = 0
j = 0
sumsOfColumns = []
while j < matrixShape[1]:
sumOfElems = 0
i = 0
while i < matrixShape[0]:
sumOfElems += matrix[i,j]
i += 1
sumsOfColumns.append(sumOfElems)
j += 1
print(sumsOfColumns)
maxValue = 0
for value in sumsOfColumns:
if value > maxValue:
maxValue = value
print(maxValue)
repl: https://repl.it/#ShroomCode/FrequentFunnyDisplaymanager

If you're looking to get a max sum of columns, here is a super simple approach using a pandas.DataFrame:
import numpy as np
import pandas as pd
vals = np.array([[0.94369777, 0.34434054, 0.80366952, 0.665736],
[0.82367659, 0.13791176, 0.6993436, 0.44473609],
[0.82337673, 0.56936686, 0.46648214, 0.50403736]])
# Store values to a DataFrame.
df = pd.DataFrame(vals)
# Get the max of column sums.
max_sum = df.sum(axis=0).max()
As a Function:
def max_col_sum(vals):
max_sum = pd.DataFrame(vals).sum(axis=0).max()
return max_sum
Output:
2.59075109

With numpy you can get each column as an array by using my_np_array[:,column_number]
So using this you can do a for loop:
sums = []
for i in range(0, np.shape(my_np_array)[0] + 1):
sums.append(sum(my_np_array[:,i]))
max_sum = max(sums)
To solve without numpy, we can go through each row adding each value to its corresponding column tally:
import numpy as np
answer = np.array([[0.94369777, 0.34434054, 0.80366952, 0.665736],
[0.82367659, 0.13791176, 0.6993436, 0.44473609],
[0.82337673, 0.56936686, 0.46648214, 0.50403736]])
# Convert our numpy array to a normal array
a = answer.tolist()
# list comprehension to initialise list
sums = [0 for x in range(len(a) + 1)]
for i in range(0, len(a)):
for j in range(0, len(a[i])):
sums[j] += a[i][j]
# Get the max sum
max_sum = max(sums)
print(max_sum)

Simple answer using zip, np.sum
Code
def L1Norm(M):
return max([np.sum(column) for column in zip(*M)])for column in zip(*M)]
Result
2.59075109
Explanation
List comprehension to loop over data in each column with:
[... for column in zip(*M)]
Sum column values with
np.sum(column)
Compute Max of list comprehension with:
max([...])

Please try the following?-
A.sum(0).max()
or
max(sum(A))
Both should give you the desired answer!

Finding first pair of numbers in array that sum to value

Im trying to solve the following Codewars problem: https://www.codewars.com/kata/sum-of-pairs/train/python
Here is my current implementation in Python:
def sum_pairs(ints, s):
right = float("inf")
n = len(ints)
m = {}
dup = {}
for i, x in enumerate(ints):
if x not in m.keys():
m[x] = i # Track first index of x using hash map.
elif x in m.keys() and x not in dup.keys():
dup[x] = i
for x in m.keys():
if s - x in m.keys():
if x == s-x and x in dup.keys():
j = m[x]
k = dup[x]
else:
j = m[x]
k = m[s-x]
comp = max(j,k)
if comp < right and j!= k:
right = comp
if right > n:
return None
return [s - ints[right],ints[right]]
The code seems to produce correct results, however the input can consist of array with up to 10 000 000 elements, so the execution times out for large inputs. I need help with optimizing/modifying the code so that it can handle sufficiently large arrays.

Your code inefficient for large list test cases so it gives timeout error. Instead you can do:
def sum_pairs(lst, s):
seen = set()
for item in lst:
if s - item in seen:
return [s - item, item]
seen.add(item)
We put the values in seen until we find a value that produces the specified sum with one of the seen values.
For more information go: Referance link

Maybe this code:
def sum_pairs(lst, s):
c = 0
while c<len(lst)-1:
if c != len(lst)-1:
x= lst[c]
spam = c+1
while spam < len(lst):
nxt= lst[spam]
if nxt + x== s:
return [x, nxt]
spam += 1
else:
return None
c +=1
lst = [5, 6, 5, 8]
s = 14
print(sum_pairs(lst, s))
Output:
[6, 8]

This answer unfortunately still times out, even though it's supposed to run in O(n^3) (since it is dominated by the sort, the rest of the algorithm running in O(n)). I'm not sure how you can obtain better than this complexity, but I thought I might put this idea out there.
def sum_pairs(ints, s):
ints_with_idx = enumerate(ints)
# Sort the array of ints
ints_with_idx = sorted(ints_with_idx, key = lambda (idx, num) : num)
diff = 1000000
l = 0
r = len(ints) - 1
# Indexes of the sum operands in sorted array
lSum = 0
rSum = 0
while l < r:
# Compute the absolute difference between the current sum and the desired sum
sum = ints_with_idx[l][1] + ints_with_idx[r][1]
absDiff = abs(sum - s)
if absDiff < diff:
# Update the best difference
lSum = l
rSum = r
diff = absDiff
elif sum > s:
# Decrease the large value
r -= 1
else:
# Test to see if the indexes are better (more to the left) for the same difference
if absDiff == diff:
rightmostIdx = max(ints_with_idx[l][0], ints_with_idx[r][0])
if rightmostIdx < max(ints_with_idx[lSum][0], ints_with_idx[rSum][0]):
lSum = l
rSum = r
# Increase the small value
l += 1
# Retrieve indexes of sum operands
aSumIdx = ints_with_idx[lSum][0]
bSumIdx = ints_with_idx[rSum][0]
# Retrieve values of operands for sum in correct order
aSum = ints[min(aSumIdx, bSumIdx)]
bSum = ints[max(aSumIdx, bSumIdx)]
if aSum + bSum == s:
return [aSum, bSum]
else:
return None

Optimize iteration through numpy array when averaging adjacent values

I have a definition in python that
Iterates over a sorted distinct array of Floats
Gets the previous and next item
Finds out if they are within a certain range of each other
averages them out, and replaces the original values with the averaged value
rerun through that loop until there are no more changes
returns a distinct array
The issue is that it is extremely slow. the array "a" could be 100k+ and it takes 7-10 minutes to complete
I found that I needed to iterate over the array after the initial iteration because after averaging, sometimes the average values could be within range to be averaged again
I thought about breaking it into chunks and use multiprocessing, my concern is the end of one chunk, and the beginning of the next chunk would need to be averaged too.
def reshape_arr(a, close):
"""Iterates through 'a' to find values +- 'close', and averages them, then returns a distinct array of values"""
flag = True
while flag:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
#find matching values in a, and replace with the average
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
# find matching values in a, and replace with the average
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a.unique()
a is a Pandas.Series from a DataFrame of anything between 0 and 200k rows, and close is an int (100 for example)
it works, just very slow.

First, if the length of the input array a is large and close is relatively small, your proposed algorithm may be numerically unstable.
That being said, here are some ideas that reduce the time complexity from O(N^3) to O(N) (for an approximate implementation) or O(N^2) (for an equivalent implementation). For N=100, this gives a speedup up to a factor of 6000 for some choices of arr and close.
Consider an input array arr = [a,b,c,d], and suppose that close > d - a. In this case, the algorithm proceeds as follows:
[a,b,c,d]
[(a+b)/2,(b+c)/2,(c+d)/2]
[(a+2b+c)/4,(b+2c+d)/4]
[(a+3b+3c+d)/8]
One can recognize that if [x_1, x_2, ..., x_n] is a maximal contiguous subarray of arr s.t. x_i - x_{i-1} < close, then [x_1, x_2, ..., x_n] eventually evaluates to (sum_{k=0}^{k=n} x_k * c_{n,k})/(2^(n-1)) where c_{n,k} is the binomial coefficient n choose k.
This gives an O(N) implementation as follows:
import numpy as np
from scipy.stats import binom
from scipy.special import comb
def binom_mean(arr, scipy_cutoff=64):
"""
Given an array arr, returns an average of arr
weighted by binomial coefficients.
"""
n = arr.shape[0]
if arr.shape[0] == 1:
return arr[0]
# initializing a scipy binomial random variable can be slow
# so, if short runs are likely, we can speed things up
# by doing explicit computations
elif n < scipy_cutoff:
return np.average(arr, weights=comb(n-1, np.arange(n), exact=False))
else:
f = binom(n-1, 0.5).pmf
return np.average(arr, weights=f(np.arange(n)))
def reshape_arr_binom(arr, close):
d = np.ediff1d(arr, to_begin=0) < close
close_chunks = np.split(arr, np.where(~d)[0])
return np.fromiter(
(binom_mean(c) for c in close_chunks),
dtype=np.float
)
The result is within 10e-15 of your implementation for np.random.seed(0);N=1000;cost=1/N;arr=np.random.rand(N). However, for large N, this may not be meaningful unless cost is small. For the above parameter values, this is 270 times faster than the original code on my machine.
However, if we choose a modest value of N = 100 and set close to a large value like 1, the speedup is by a factor of 6000. This is because for large values of close, the original implementation is O(N^3); specifically, a.replace is potentially called O(N^2) times and has a cost O(N). So, maximal speedup is achieved when contiguous elements are likely to be close.
For the reference, here is an O(N^2) implementation that is equivalent to your code (I do not recommend using this in practice).
import pandas as pd
import numpy as np
np.random.seed(0)
def custom_avg(arr, indices, close):
new_indices = list()
last = indices[-1]
for i in indices:
if arr[i] - arr[i-1] < close:
new_indices.append(i)
avg = (arr[i-1] + arr[i]) / 2
arr[i-1] = avg
if i != last and arr[i+1] - arr[i] >= close:
arr[i] = avg
return new_indices
def filter_indices(indices):
new_indices = list()
second_dups = list()
# handle empty index case
if not indices:
return new_indices, second_dups
for i, j in zip(indices, indices[1:]):
if i + 1 == j:
# arr[i] is guaranteed to be different from arr[i-1]
new_indices.append(i)
else:
# arr[i+1] is guaranteed to be a duplicate of arr[i]
second_dups.append(i)
second_dups.append(indices[-1])
return new_indices, second_dups
def reshape_arr_(arr, close):
indices = range(1, len(arr))
dup_mask = np.zeros(arr.shape, bool)
while indices:
indices, second_dups = filter_indices(custom_avg(arr, indices, close))
# print(f"n_inds = {len(indices)};\tn_dups = {len(second_dups)}")
dup_mask[second_dups] = True
return np.unique(arr[~dup_mask])
The basic ideas are the following:
First, consider two adjacent elements (i,j) with j = i + 1. If arr[j] - arr[i] >= close in current iteration, arr[j] - arr[i] >= close also holds after the current iteration. This is because arr[i] can only decrease and arr[j] can only increase. So, if (i,j) pair is not averaged in the current iteration, it will not be averaged in any of the subsequent iterations. So, we can avoid looking at (i,j) in the future.
Second, if (i,j) is averaged and (i+1,j+1) is not, we know that arr[i] is a duplicate of arr[j]. Also, the last modified element in each iteration is always a duplicate.
Based on these observations, we need to process fewer and fewer indices in each iteration. The worst case is still O(N^2), which can be witnessed by setting close = arr.max() - arr.min() + 1.
Some benchmarks:
from timeit import timeit
make_setup = """
from __main__ import np, pd, reshape_arr, reshape_arr_, reshape_arr_binom
np.random.seed(0)
arr = np.sort(np.unique(np.random.rand({N})))
close = {close}""".format
def benchmark(N, close):
np.random.seed(0)
setup = make_setup(N=N, close=close)
print('Original:')
print(timeit(
stmt='reshape_arr(pd.Series(arr.copy()), close)',
# setup='from __main__ import reshape_arr; import pandas as pd',
setup=setup,
number=1,
))
print('Quadratic:')
print(timeit(
stmt='reshape_arr_(arr.copy(), close)',
setup=setup,
number=10,
))
print('Binomial:')
print(timeit(
stmt='reshape_arr_binom(arr.copy(), close)',
setup=setup,
number=10,
))
if __name__ == '__main__':
print('N=10_000, close=1/N')
benchmark(10_000, 1/10_000)
print('N=100, close=1')
benchmark(100, 1)
# N=10_000, close=1/N
# Original:
# 14.855983458999999
# Quadratic:
# 0.35902471400000024
# Binomial:
# 0.7207887170000014
# N=100, close=1
# Original:
# 4.132993569
# Quadratic:
# 0.11140068399999947
# Binomial:
# 0.007650813999998007
The following table shows how the number of pairs we need to look at in the quadratic algorithm goes down each iteration.
n_inds = 39967; n_dups = 23273
n_inds = 25304; n_dups = 14663
n_inds = 16032; n_dups = 9272
n_inds = 10204; n_dups = 5828
n_inds = 6503; n_dups = 3701
n_inds = 4156; n_dups = 2347
n_inds = 2675; n_dups = 1481
n_inds = 1747; n_dups = 928
n_inds = 1135; n_dups = 612
n_inds = 741; n_dups = 394
n_inds = 495; n_dups = 246
n_inds = 327; n_dups = 168
n_inds = 219; n_dups = 108
n_inds = 145; n_dups = 74
n_inds = 95; n_dups = 50
n_inds = 66; n_dups = 29
n_inds = 48; n_dups = 18
n_inds = 36; n_dups = 12
n_inds = 26; n_dups = 10
n_inds = 20; n_dups = 6
n_inds = 15; n_dups = 5
n_inds = 10; n_dups = 5
n_inds = 6; n_dups = 4
n_inds = 3; n_dups = 3
n_inds = 1; n_dups = 2
n_inds = 0; n_dups = 1

You can use the following function to produce similar output to yours (with the difference that the result from your function is unsorted since a is never sorted outside the loop and pd.Series.unique returns values in order of appearance; if this is actually desired, check the second function). Sorting the array on every loop iteration is not required since replacing with the average of two subsequent (unique) items in a sorted array will not invalidate the sorting. Since on every iteration the comparison with next_item will be the comparison with prev_item during the next iteration you can just compare subsequent elements in a pairwise manner once.
def solve_sorted(a, close):
"""Returns the reduced unique values as a sorted array."""
a = a.sort_values().values.astype(float)
while True:
a = np.unique(a)
comp = a[1:] - a[:-1] < close
if not comp.sum():
break
indices = np.tile(comp.nonzero()[0][:, None], (1, 2))
indices[:, 1] += 1
avg = a[indices].mean(axis=1)
a[indices.ravel()] = np.repeat(avg, 2)
return np.unique(a)
If it is important to preserve the original order of elements then you can store the reverse sorting indices once at the beginning to restore the original order at the end:
def solve_preserve_order(a, close):
"""Returns the reduced unique values in their original order."""
reverse_indices = np.argsort(np.argsort(a.values))
a = a.sort_values()
while True:
b = a.unique()
comp = b[1:] - b[:-1] < close
if not comp.sum():
break
indices = np.tile(comp.nonzero()[0][:, None], (1, 2))
indices[:, 1] += 1
avg = b[indices].mean(axis=1)
a.replace(b[indices.ravel()], np.repeat(avg, 2), inplace=True)
return a.iloc[reverse_indices].unique()

Performance comparison
Testing the performance of the different presented algorithms for sorted, unique-valued input arrays (code attached below). Functions:
reshape_arr_binom, reshape_arr_
solve_sorted
Performance scaling with the size of the input array
Using close = 1 / arr.size.
Scaling with the interval length
Using arr.size == 1_000; close is the interval length.
Source code
"""Performance plots.
Assuming a sorted, unique-valued array as an input.
Function names have format `a<id>_*` where <id> is the answer's id."""
from timeit import timeit
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import perfplot
from scipy.stats import binom
from scipy.special import comb
def OP_reshape_arr(a, close):
flag = True
while flag:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a.unique()
def _binom_mean(arr, scipy_cutoff=64):
n = arr.shape[0]
if arr.shape[0] == 1:
return arr[0]
elif n < scipy_cutoff:
return np.average(arr, weights=comb(n-1, np.arange(n), exact=False))
else:
f = binom(n-1, 0.5).pmf
return np.average(arr, weights=f(np.arange(n)))
def a57438948_reshape_arr_binom(arr, close):
d = np.ediff1d(arr, to_begin=0) < close
close_chunks = np.split(arr, np.where(~d)[0])
return np.fromiter(
(_binom_mean(c) for c in close_chunks),
dtype=np.float
)
def _custom_avg(arr, indices, close):
new_indices = list()
last = indices[-1]
for i in indices:
if arr[i] - arr[i-1] < close:
new_indices.append(i)
avg = (arr[i-1] + arr[i]) / 2
arr[i-1] = avg
if i != last and arr[i+1] - arr[i] >= close:
arr[i] = avg
return new_indices
def _filter_indices(indices):
new_indices = list()
second_dups = list()
if not indices:
return new_indices, second_dups
for i, j in zip(indices, indices[1:]):
if i + 1 == j:
new_indices.append(i)
else:
second_dups.append(i)
second_dups.append(indices[-1])
return new_indices, second_dups
def a57438948_reshape_arr_(arr, close):
indices = range(1, len(arr))
dup_mask = np.zeros(arr.shape, bool)
while indices:
indices, second_dups = _filter_indices(_custom_avg(arr, indices, close))
dup_mask[second_dups] = True
return np.unique(arr[~dup_mask])
def a57438149_solve_sorted(a, close):
while True:
comp = a[1:] - a[:-1] < close
if not comp.sum():
break
indices = np.tile(comp.nonzero()[0][:, None], (1, 2))
indices[:, 1] += 1
avg = a[indices].mean(axis=1)
a[indices.ravel()] = np.repeat(avg, 2)
a = np.unique(a)
return a
np.random.seed(0)
a = np.unique(np.random.rand(10_000))
c = 1/a.size
ref = OP_reshape_arr(pd.Series(a.copy()), c)
test = [
a57438948_reshape_arr_binom(a.copy(), c),
a57438948_reshape_arr_(a.copy(), c),
a57438149_solve_sorted(a, c),
]
assert all(x.shape == ref.shape and np.allclose(x, ref) for x in test)
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
perfplot.bench(
setup = lambda n: np.random.seed(0) or (np.unique(np.random.rand(n)), 1/n),
kernels=[
lambda x: OP_reshape_arr(pd.Series(x[0].copy()), x[1]),
lambda x: a57438948_reshape_arr_binom(x[0].copy(), x[1]),
lambda x: a57438948_reshape_arr_(x[0].copy(), x[1]),
lambda x: a57438149_solve_sorted(x[0], x[1]),
],
labels=['OP_reshape_arr', 'reshape_arr_binom', 'reshape_arr_', 'solve_sorted'],
n_range=np.logspace(2, 4, 8).astype(int),
xlabel='size of initial array (before np.unique; using interval length of 1/n)',
logx=True,
logy=True,
colors=colors,
automatic_order=False,
).plot()
plt.gca().set_xlim([1e2, 1e4])
plt.gca().set_ylim([1e-4, 20])
plt.savefig('scaling_with_array_size.png')
plt.close()
np.random.seed(0)
a = np.unique(np.random.rand(1_000_000))
c = 1/a.size
test = [
a57438948_reshape_arr_binom(a.copy(), c),
a57438948_reshape_arr_(a.copy(), c),
a57438149_solve_sorted(a, c),
]
assert all(x.shape == test[0].shape and np.allclose(x, test[0]) for x in test)
perfplot.bench(
setup = lambda n: np.random.seed(0) or (np.unique(np.random.rand(n)), 1/n),
kernels=[
lambda x: a57438948_reshape_arr_binom(x[0].copy(), x[1]),
lambda x: a57438948_reshape_arr_(x[0].copy(), x[1]),
lambda x: a57438149_solve_sorted(x[0], x[1]),
],
labels=['reshape_arr_binom', 'reshape_arr_', 'solve_sorted'],
n_range=np.logspace(4, 6, 5).astype(int),
xlabel='size of initial array (before np.unique; using interval length of 1/n)',
logx=True,
logy=True,
colors=colors[1:],
automatic_order=False,
).plot()
plt.gca().set_xlim([1e4, 1e6])
plt.gca().set_ylim([5e-4, 10])
plt.savefig('scaling_with_array_size_2.png')
plt.close()
perfplot.bench(
setup = lambda n: np.random.seed(0) or (np.unique(np.random.rand(1_000)), n),
kernels=[
lambda x: OP_reshape_arr(pd.Series(x[0].copy()), x[1]),
lambda x: a57438948_reshape_arr_binom(x[0].copy(), x[1]),
lambda x: a57438948_reshape_arr_(x[0].copy(), x[1]),
lambda x: a57438149_solve_sorted(x[0], x[1]),
],
labels=['OP_reshape_arr', 'reshape_arr_binom', 'reshape_arr_', 'solve_sorted'],
n_range=np.logspace(-6, -2, 16),
xlabel='length of interval (using array of size 1,000)',
logx=True,
logy=True,
colors=colors,
automatic_order=False,
).plot()
plt.gca().set_xlim([1e-6, 1e-2])
plt.gca().set_ylim([2e-5, 1e3])
plt.savefig('scaling_with_interval_length.png')
plt.close()

How would quickselectg act differently if pivot wasn't the middle term

Alright so I have developed a generic quickselect function and it is used to find the median of a list.
k = len(aList)//2 and the list is aList = [1,2,3,4,5]
So how would the program act differently if pivot started at the first item of the list each time. Do I have to make it at the center? Also where should I start the time.clock() in order to find the elapsed time of the function. Here is the code
def quickSelect(aList, k)
if len(aList)!=0:
pivot=aList[(len(aList)//2)]
smallerList = []
for i in aList:
if i<pivot:
smallerList.append(i)
largerList=[]
for i in aList:
if i>pivot:
largerList.append(i)
m=len(smallerList)
count=len(aList)-len(smallerList)-len(largerList)
if k >= m and k<m + count:
return pivot
elif m > k:
return quickSelect(smallerList,k)
else:
return quickSelect(largerList, k - m - count)

I don't see any issue in placing the pivot at the beginning. But that would be just to initialize the pivot. The whole idea of pivot is normally to find the middle element.
Please try this for your time calculation:
import time
start_time = 0
aList = [1,2,3,4,5]
k = len(aList)//2
def quickSelect(aList, k):
start_time = time.time()
# print "%10.6f"%start_time
# pivot = aList[0]
if len(aList) != 0:
pivot = aList[(len(aList) // 2)]
smallerList = []
for i in aList:
if i < pivot:
smallerList.append(i)
largerList = []
for i in aList:
if i > pivot:
largerList.append(i)
m = len(smallerList)
count = len(aList) - len(smallerList) - len(largerList)
if k >= m and k < m + count:
print "Pivot", pivot
# print "%10.6f"%time.time()
print "time: ", time.time() -start_time
return pivot
elif m > k:
return quickSelect(smallerList, k)
else:
return quickSelect(largerList, k - m - count)
quickSelect(aList, k)
In this case the time comes to be zero for your list is very small.
Please let me know, if I misinterpreted your question.
OUTPUT:
Pivot 3
time: 0.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.