Consider two arrays of different length:
A = np.array([58, 22, 86, 37, 64])
B = np.array([105, 212, 5, 311, 253, 419, 123, 461, 256, 464])
For each value in A, I want to find the smallest absolute difference between values in A and B. I use Pandas because my actual arrays are subsets of Pandas dataframes but also because the apply method is a convenient (albeit slow) approach to taking the difference between two different-sized arrays:
In [22]: pd.Series(A).apply(lambda x: np.min(np.abs(x-B)))
Out[22]:
0 47
1 17
2 19
3 32
4 41
dtype: int64
BUT I also want to keep the sign, so the desired output is:
0 -47
1 17
2 -19
3 32
4 -41
dtype: int64
[update] my actual arrays A and B are approximately of 5e4 and 1e6 in length so a low memory solution would be ideal. Also, I wish to avoid using Pandas because it is very slow on the actual arrays.
Let's use broadcasted subtraction here. We then use argmin to find the absolute minimum, then extract the values in a subsequent step.
u = A[:,None] - B
idx = np.abs(u).argmin(axis=1)
u[np.arange(len(u)), idx]
# array([-47, 17, -19, 32, -41])
This uses pure NumPy broadcasting, so it should be quite fast.
Comprehension
I couldn't help myself. This is not what you should do! But, it is cute.
[min(x - B, key=abs) for x in A]
[-47, 17, -19, 32, -41]
Reduced Big-O Numpy solution
If N = len(A) and M = len(B) then this solution should be O(N + M log(M))
If B is already sorted, then the sorting step is unnecessary. and this becomes O(N + M)
C = np.sort(B)
a = C.searchsorted(A)
# It is possible that `i` has a value equal to the length of `C`
# in which case the searched value exceeds all those found in `C`.
# In that case, we want to clip the index value to the right most index
# which is `len(C) - 1`
right = np.minimum(a, len(C) - 1)
# For those searched values that are less than the first value in `C`
# the index value will be `0`. When I subtract `1`, we'll end up with
# `-1` which makes no sense. So we clip it to `0`.
left = np.maximum(a - 1, 0)
For clipped values, we'll end up comparing a value to itself and therefore it is safe.
right_diff = A - C[right]
left_diff = A - C[left ]
np.where(np.abs(right_diff) <= left_diff, right_diff, left_diff)
array([-47, 17, -19, 32, -41])
Since you tagged pandas:
# compute the diff by broadcasting
diff = pd.DataFrame(A[None,:] - B[:,None])
# mininum value
min_val = diff.abs().min()
# mask with where and stack to drop na
diff.where(diff.abs().eq(min_val)).stack()
Output:
0 0 -47.0
2 -19.0
4 -41.0
2 1 17.0
3 32.0
dtype: float64
np.argmin can find the position of the minimum value. Therefore you can simply do this:
pd.Series(A).apply(lambda x: x-B[np.argmin(np.abs(x-B))])
Related
I have an array of subarrays as follows:
[
[...]
[...]
⋮
[...]
]
The lengths of each subarray are the same.
I need to bin each subarray and calculate the mean, standard deviation, median and other percentiles for each bin. I need separate results for binning by fixed width and by fixed frequency.
The method should be vectorized i.e. no 'for loops' (or at least as few as possible and those that are not too costly, though of course separate methodologies for each binning technique are required). I don't know if this is even possible in a reasonably understandable manner (understandable for me as I am quite the noob, but if it works I'll do my best). For the fixed-width binning method you may assume that we are binning by the data ranges of the first subarray for ease.
How should I proceed?
Possibilities:
For fixed frequency binning the steps I had in mind were somehow doing a np.array_split at once by specifying the right axis argument, then filling the bins that are a one shorter with nan by using np.pad and now that the the subarrays are no longer composed of ragged sequences we will hopefully be able to apply np.nanmedian using again whatever axis designation that worked for the np.array_split. However, I don't know if any suitable such axis can be specified for the splitting and median operations and additionally I have seen that there is no way to avoid iterating through (not just each of the rows, but,) each of the bins to pad the shorter of these ragged sequences with the extra nan. Even if these iterations don't prove to costly and everything else works as fine I wouldn't know how to actually implement any step of this process. Nor do I know where to even begin for fixed-width binning.
Here is a vectorized solution that accomplishes what I want for only the mean for only a single array; I would certainly like to avoid iterating over each one of my subarrays and also do not understand the method enough to extend it to calculating the standard deviation, medians or any other percentiles.
If your suggested approach is through the pandas library e.g. using cut or qcut, is there a way this could be done without using for loops?
This is all very much related to my earlier question.
As I am new to this platform I'm not sure what the best practices are, I would ideally not like to delete that post since it serves to cast a wider net to solve my problem, whilst this post pursues a slightly more specific avenue described in that. I also would not want someone who has worked on an answer to that post to find it deleted. But, if it is quite clear that I should delete the earlier post do let me know.
EDIT: example with expected output, assume all objects are numpy arrays not lists
Example array:
[
[0, 1, 2, 3, 4, 5, 6],
[90, 45, 9, 88, 21, 59, 32],
⋮
]
Fixed-frequency of 3 objects per bin binned example
[
[[0, 1, 2], [3, 4], [5, 6]],
[[90, 45, 9], [88, 21], [59, 32]],
⋮
]
The above intermediate step need not be explicitly returned at any point but illustrates what will be occurring behind the scenes.
Output of medians of Fixed-frequency binned example
[
[1, 3.5, 5.5],
[45, 54.5, 45.5],
⋮
]
Edit 2: extended question using #hilberts_drinking_problem answer as accepted solution for the original problem
If x = [0, 1, 2, 3, 4, 5, 6] and y = [90, 45, 9, 88, 21, 59, 32] then you have already calculated everything (except the percentiles) I want for the data sorted by x. If I also want the same statistics but the data sorted by y with a multi-index such that df_2's row indices print as follows:
# x_srtd x
# y
# y_srtd x
# y
How would I get this (including sorting x and y again by y) without for loops. (In case it matters note I plan on transposing the entire df_2 using.T at the end, for readability, such that 'x_srtd', 'y_srtd', 'x' and 'y' become column headers.
Also which of the methods in Pass percentiles to pandas agg function would you recommend?
Almost forgot, any ideas for how I would approach fixed-width binning keeping in mind x-sorted binning is going to be different to the y-sorted binning. As examples, take bin_width_x = 1.5 for binning by x and similarly bin_width_y = 25.
You could split the columns of your DataFrame into a MultiIndex so that the zeroth level of the multiindex represents a group of columns you wish to aggregate. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame([
[0, 1, 2, 3, 4, 5, 6],
[90, 45, 9, 88, 21, 59, 32],
])
df.columns = pd.MultiIndex.from_tuples(
[(i, c) for i, gp in enumerate(np.array_split(df.columns, 3)) for c in gp]
)
# print(df)
# 0 1 2
# 0 1 2 3 4 5 6
# 0 0 1 2 3 4 5 6
# 1 90 45 9 88 21 59 32
print(df.groupby(axis=1, level=0).agg("mean"))
# 0 1 2
# 0 1.0 3.5 5.5
# 1 48.0 54.5 45.5
# the following raises not implemented error on Pandas version 1.1.5
# print(df.groupby(axis=1, level=0).agg(["mean", "std"]))
# as a workaround:
operations = ["mean", "std", "median"]
df2 = pd.concat((
df.groupby(axis=1, level=0).agg(operation)
for operation in operations
), axis=1)
df2.columns = pd.MultiIndex.from_product([
operations, np.unique(df.columns.get_level_values(0))])
print(df2)
# mean std median
# 0 1 2 0 1 2 0 1 2
# 0 1.0 3.5 5.5 1.000000 0.707107 0.707107 1.0 3.5 5.5
# 1 48.0 54.5 45.5 40.583248 47.376154 19.091883 45.0 54.5 45.5
I have a dataframe of 10,000 rows that I am trying to sum all possible combinations of those rows. According to my math, that's about 50 million combinations. I'll give a small example to simplify what my data looks like:
df = Ratio Count Score
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15
And here's the desired result:
results = Min Ratio Max Ratio Total Count Total Score
1 2 13 23
1 3 21 36
1 4 30 50
1 5 40 65
2 3 15 25
2 4 24 39
2 5 34 54
3 4 17 27
3 5 27 42
4 5 19 29
This is the code that I came up with to complete the calculation:
for i in range(len(df)):
j = i + 1
while j <= len(df):
range_to_calc = df.iloc[i:j]
total_count = range_to_calc['Count'].sum()
total_score = range_to_calc['Score'].sum()
new_row = {'Min Ratio': range_to_calc.at[range_to_calc.first_valid_index(),'Ratio'],
'Max Ratio': range_to_calc.at[range_to_calc.last_valid_index(),'Ratio'],
'Total Count': total_count,
'Total Score': total_score}
results = results.append(new_row, ignore_index=True)
j = j + 1
This code works, but according to my estimates after running it for a few minutes, it would take 200 hours to complete. I understand that using numpy would be a lot faster, but I can't wrap my head around how to build multiple arrays to add together. (I think it would be easy if I was doing just 1+2, 2+3, 3+4, etc., but it's a lot harder because I need 1+2, 1+2+3, 1+2+3+4, etc.) Is there a more efficient way to complete this calculation so it can run in a reasonable amount of time? Thank you!
P.S.: If you're wondering what I want to do with a 50 million-row dataframe, I don't actually need that in my final results. I'm ultimately looking to divide the Total Score of each row in the results by its Total Count to get a Total Score Per Total Count value, and then display the 1,000 highest Total Scores Per Total Count, along with each associated Min Ratio, Max Ratio, Total Count, and Total Score.
After these improvements it takes ~2 minutes to run for 10k rows.
For the sum computation, you can pre-compute cumulative sum(cumsum) and save it. sum(i to j) is equal to sum(0 to j) - sum(0 to i-1).
Now sum(0 to j) is cumsum[j] and sum(0 to i - 1) is cumsum[i-1].
So sum(i to j) = cumsum[j] - cumsum[i - 1].
This gives significant improvement over computing sum each time for different combination.
Operation over numpy array is faster than the operation on pandas series, hence convert every colum to numpy array and then do the computation over it.
(From other answers): Instead of appending in list, initialise an empty numpy array of size ((n*(n+1)//2) -n , 4) and use it to save the results.
Use:
count_cumsum = np.cumsum(df.Count.values)
score_cumsum = np.cumsum(df.Score.values)
ratios = df.Ratio.values
n = len(df)
rowInCombination = (n * (n + 1) // 2) - n
arr = np.empty(shape = (rowInCombination, 4), dtype = int)
k = 0
for i in range(len(df)):
for j in range(i + 1, len(df)):
arr[k, :] = ([
count_cumsum[j] - count_cumsum[i-1] if i > 0 else count_cumsum[j],
score_cumsum[j] - score_cumsum[i-1] if i > 0 else score_cumsum[j],
ratios[i],
ratios[j]])
k = k + 1
out = pd.DataFrame(arr, columns = ['Total_Count', 'Total_Score',
'Min_Ratio', 'Max_Ratio'])
Input:
df = pd.DataFrame({'Ratio': [1, 2, 3, 4, 5],
'Count': [6, 7, 8, 9, 10],
'Score': [11, 12, 13, 14, 15]})
Output:
>>>out
Min_Ratio Max_Ratio Total_Count Total_Score
0 1 2 13 23
1 1 3 21 36
2 1 4 30 50
3 1 5 40 65
4 2 3 15 25
5 2 4 24 39
6 2 5 34 54
7 3 4 17 27
8 3 5 27 42
9 4 5 19 29
First of all, you can improve the algorithm. Then, you can speed up the computation using Numpy vectorization/broadcasting.
Here are the interesting point to improve the performance of the algorithm:
append of Pandas is slow because it recreate a new dataframe. You should never use it in a costly loop. Instead, you can append the lines to a Python list or even directly write the items in a pre-allocated Numpy vector.
computing partial sums take an O(n) time while you can pre-compute the cumulative sums and then just find the partial sum in constant time.
CPython loops are very slow, but the inner loop can be vectorized using Numpy thanks to broadcasting.
Here is the resulting code:
import numpy as np
import pandas as pd
def fastImpl(df):
n = len(df)
resRowCount = (n * (n+1)) // 2
k = 0
cumCounts = np.concatenate(([0], df['Count'].astype(int).cumsum()))
cumScores = np.concatenate(([0], df['Score'].astype(int).cumsum()))
ratios = df['Ratio'].astype(int)
minRatio = np.empty(resRowCount, dtype=int)
maxRatio = np.empty(resRowCount, dtype=int)
count = np.empty(resRowCount, dtype=int)
score = np.empty(resRowCount, dtype=int)
for i in range(n):
kStart, kEnd = k, k+(n-i)
jStart, jEnd = i+1, n+1
minRatio[kStart:kEnd] = ratios[i]
maxRatio[kStart:kEnd] = ratios[i:n]
count[kStart:kEnd] = cumCounts[jStart:jEnd] - cumCounts[i]
score[kStart:kEnd] = cumScores[jStart:jEnd] - cumScores[i]
k = kEnd
assert k == resRowCount
return pd.DataFrame({
'Min Ratio': minRatio,
'Max Ratio': maxRatio,
'Total Count': count,
'Total Score': score
})
Note that this code give the same results than the code in your question, but the original code does not give the expected results stated in the question. Note also that since inputs are integers, I forced Numpy to use integers for sake of performance (despite the algorithm should work with floats too).
This code is hundreds of thousand times faster than the original code on big dataframes and it succeeds to compute a dataframe of 10,000 rows in 0.7 second.
Others have explained why your algorithm was so slow so I will dive into that.
Let's take a different approach to your problem. In particular, look at how the Total Count and Total Score columns are calculated:
Calculate the cumulative sum for every row from 1 to n
Calculate the cumulative sum for every row from 2 to n
...
Calculate the cumulative sum for every row from n to n
Since cumulative sums are accumulative, we only need to calculate it once for row 1 to row n:
The cumsum of (2 to n) is the cumsum of (1 to n) - (row 1)
The cumsum of (3 to n) is the cumsum of (2 to n) - (row 2)
And so on...
In other words, the current cumsum is the previous cumsum minus its first row, then dropping the first row.
As you have theorized, pandas is a lot slower than numpy so we will convert everthing into numpy for speed:
arr = df[['Ratio', 'Count', 'Score']].to_numpy() # Convert to numpy array
tmp = np.cumsum(arr[:, 1:3], axis=0) # calculate cumsum for row 1 to n
tmp = np.insert(tmp, 0, arr[0, 0], axis=1) # create the Min Ratio column
tmp = np.insert(tmp, 1, arr[:, 0], axis=1) # create the Max Ratio column
results2 = [tmp]
for i in range(1, len(arr)):
tmp = results2[-1][1:] # current cumsum is the previous cumsum without the first row
diff = results2[-1][0] # the previous cumsum's first row
tmp -= diff # adjust the current cumsum
tmp[:, 0] = arr[i, 0] # new Min Ratio
tmp[:, 1] = arr[i:, 0] # new Max Ratio
results2.append(tmp)
# Assemble the result
results2 = np.concatenate(results2).reshape(-1,4)
results2 = pd.DataFrame(results2, columns=['Min Ratio', 'Max Ratio', 'Total Count', 'Total Score'])
During my test, this produces the results for a 10k row data frame in about 2 seconds.
Sorry to write late for this topic, but I'm just looking for a solution for a similar topic. The solution for this issue is simple because the combination is only in pairs. This is solved by uploading the dataframe to any DB and executing the following query whose duration is less than 10 seconds:
SEL f1.*,f2.*,f1.score+f2.score
FROM table_with_data_source f1, table_with_data_source f2
where f1.ratio<>f2.ratio;
The database will do it very fast even if there are 100,000 records or more.
However, none of the algorithms that I saw in the answers, actually perform a combinatorial of values. He only does it in pairs. The problem really gets complicated when it's a true combinatorial, for example:
Given: a, b, c, d and e as records:
a
b
c
d
e
The real combination would be:
a+b
a+c
a+d
a+e
a+b+c
a+b+d
a+b+e
a+c+d
a+c+e
a+d+e
a+b+c+d
a+b+c+e
a+c+d+e
a+b+c+d+e
b+c
b+d
b+e
b+c+d
b+c+e
b+d+e
c+d
c+e
c+d+e
d+e
This is a real combinatorial, which covers all possible combinations. For this case I have not been able to find a suitable solution since it really affects the performance of any HW. Anyone have any idea how to perform a real combinatorial using python? At the database level it affects the general performance of the database.
After some help in the forum I managed to do what I was looking for and now I need to get to the next level. ( the long explanation is here:
Python Data Frame: cumulative sum of column until condition is reached and return the index):
I have a data frame:
In [3]: df
Out[3]:
index Num_Albums Num_authors
0 0 10 4
1 1 1 5
2 2 4 4
3 3 7 1000
4 4 1 44
5 5 3 8
I add a column with the cumulative sum of another column.
In [4]: df['cumsum'] = df['Num_Albums'].cumsum()
In [5]: df
Out[5]:
index Num_Albums Num_authors cumsum
0 0 10 4 10
1 1 1 5 11
2 2 4 4 15
3 3 7 1000 22
4 4 1 44 23
5 5 3 8 26
Then I apply a condition to the cumsumcolumn and I extract the corresponding values of the row where the condition is met with a given tolerance:
In [18]: tol = 2
In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()
In [20]: cond
Out[20]:
index Num_Albums Num_authors cumsum
2 2.0 4.0 4.0 15.0
Now, what I want to do is to substitute to the condition 15 in the example, the conditions stored in an array. Check when the condition is met and retrieve not the entire row, but only the value of the column Num_Albums. Finally, all these retrieved values (one per condition) are stored in an array or list.
Coming from matlab, I would do something like this (I apologize for this mixed matlab/python syntax):
conditions = np.array([10, 15, 23])
for i=0:len(conditions)
retrieved_values(i) = df.where((df['cumsum']>=conditions(i)-tol)&(df['cumsum']<=conditions(i)+tol)).dropna()
So for the data frame above I would get (for tol=0):
retrieved_values = [10, 4, 1]
I would like a solution that lets me keep the .where function if possible..
A quick way to do would be to leverage NumPy's broadcasting techniques as an extension of this answer from the same post linked, although an answer related to the use of DF.where was actually asked.
Broadcasting eliminates the need to iterate through every element of the array and it's highly efficient at the same time.
The only addition to this post is the use of np.argmax to grab the indices of the first True instance along each column (traversing ↓ direction).
conditions = np.array([10, 15, 23])
tol = 0
num_albums = df.Num_Albums.values
num_albums_cumsum = df.Num_Albums.cumsum().values
slices = np.argmax(np.isclose(num_albums_cumsum[:, None], conditions, atol=tol), axis=0)
Retrieved slices:
slices
Out[692]:
array([0, 2, 4], dtype=int64)
Corresponding array produced:
num_albums[slices]
Out[693]:
array([10, 4, 1], dtype=int64)
If you still prefer using DF.where, here is another solution using list-comprehension -
[df.where((df['cumsum'] >= cond - tol) & (df['cumsum'] <= cond + tol), -1)['Num_Albums']
.max() for cond in conditions]
Out[695]:
[10, 4, 1]
The conditions not fulfilling the given criteria would be replaced by -1. Doing this way preserves the dtype at the end.
well the output not always be 1 number right?
in case the ouput is exact 1 number you can write this code
tol = 0
#condition
c = [5,15,25]
value = []
for i in c:
if len(df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a']) > 0:
value = value + [df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a'].values[0]]
else:
value = value + [[]]
print(value)
the output should be like
[1,2,3]
in case the output can be multiple number and want to be like this
[[1.0, 5.0], [12.0, 15.0], [25.0]]
you can use this code
tol = 5
c = [5,15,25]
value = []
for i in c:
getdatas = df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a'].values
value.append([x for x in getdatas])
print(value)
(update: added desired data frame)
Let me start by saying that I'm reasonably confident that I found a solution to this problem several years ago, but I have not been able to re-find that solution.
Questions that address similar problems, but don't solve my particular problem include:
Efficiently select rows that match one of several values in Pandas DataFrame
Efficiently adding calculated rows based on index values to a pandas DataFrame
Compare Python Pandas DataFrames for matching rows
The Question
Let's say I have a data frame with many columns that I am working on:
big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
big
match_1
0 11
1 12
2 51
3 52
I also have smaller data frame that, in theory, maps some conditional statement to a desired value:
# A smaller dataframe that we use to map values into the larger dataframe
small = pd.DataFrame({'is_even': [True, False], 'score': [10, 200]})
small
is_even score
0 True 10
1 False 200
The goal here would be to use a conditional statement to match each row in big to a single row in small. Assume that small is constructed such that there was always be one, and only one, match for each row in big. (If there has to be multiple rows in small that match, just pick the first one.)
The desired output would be something like:
desired = pd.DataFrame({'match_1': [11, 12, 51, 52], 'metric': [200, 10, 200, 10]})
desired
match_1 metric
0 11 200
1 12 10
2 51 200
3 52 10
I'm pretty sure that the syntax would look similar to:
big['score'] = small.loc[small['is_even'] == ( (big['match_1'] / 2) == 0), 'score']
This won't work because small['is_even'] is a Series of length 2, while ( (big['match_1'] / 2) == 0) is a Series of length 4. What I'm looking to do is, for each row in big, find the one row in small that matches based on a conditional.
If I can get a sequence that contains the correct row in small that matches each row in big, then I could do something like:
`big['score'] = small.loc[matching_rows, 'score']
The question I have is: how do I generate the Sequence matching rows?
Things that (I think) aren't quite what I want:
If the columns in big and small were to match simply on constant values, this would be a straight forward use of either big.merge() or big.groupby(), however, in my case, the mapping can be an arbitrarily complex boolean conditional, for example:
(big['val1'] > small['threshold']) & (big['val2'] == small['val2']) & (big['val3'] > small['min_val']) & (big['val3'] < small['max_val'])
Solutions that rely on isin(), any(), etc, don't work, because the conditional check can be arbitrarily complex.
I could certainly create a function to apply() to the bigger DataFrame, but again, I'm pretty sure there was a simpler solution.
The answer may come down to 'calculate some intermediate columns until you can do a simple merge' or 'just use apply(), but I could swear that there was a way to do what I've described above.
One approach is to use a merge in which the on_left is not a column, but a vector of keys. It's made simpler by setting the index of small to be is_even:
>>> small.set_index('is_even', inplace=True)
>>> condition = big['match_1'] % 2 == 0
>>> pd.merge(big, small, left_on=condition, right_index=True, how='left')
match_1 score
0 11 200
1 12 10
2 51 200
3 52 10
You can index small with True and False and just do a straight .ix lookup on it. Not sure it's all that much tidier than the intermediate column/merge:
In [127]: big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
In [128]: small = pd.DataFrame({'score': [10, 200]}, index=[True, False])
In [129]: big['score'] = small.ix[pd.Index(list(big.match_1 % 2 == 0))].score.values
In [130]: big
Out[130]:
match_1 score
0 11 200
1 12 10
2 51 200
3 52 10
Problem:
What'd I like to do is step-by-step reduce a value in a Series by a continuously decreasing base figure.
I'm not sure of the terminology for this - I did think I could do something with cumsum and diff but I think I'm leading myself on a wild goose chase there...
Starting code:
import pandas as pd
ALLOWANCE = 100
values = pd.Series([85, 10, 25, 30])
Desired output:
desired = pd.Series([0, 0, 20, 30])
Rationale:
Starting with a base of ALLOWANCE - each value in the Series is reduced by the amount remaining, as is the allowance itself, so the following steps occur:
Start with 100, we can completely remove 85 so it becomes 0, we now have 15 left as ALLOWANCE
The next value is 10 and we still have 15 available, so this becomes 0 again and we have 5 left.
The next value is 25 - we only have 5 left, so this becomes 20 and now we have no further allowance.
The next value is 30, and since there's no allowance, the value remains as 30.
Following your initial idea of cumsum and diff, you could write:
>>> (values.cumsum() - ALLOWANCE).clip_lower(0).diff().fillna(0)
0 0
1 0
2 20
3 30
dtype: float64
This is the cumulative sum of values minus the allowance. Negative values are clipped to zeros (since we don't care about numbers until we have overdrawn our allowance). From there, you can calculate the difference.
However, if the first value might be greater than the allowance, the following two-line variation is preferred:
s = (values.cumsum() - ALLOWANCE).clip_lower(0)
desired = s.diff().fillna(s)
This fills the first NaN value with the "first value - allowance" value. So in the case where ALLOWANCE is lowered to 75, it returns desired as Series([10, 10, 25, 30]).
Your idea with cumsum and diff works. It doesn't look too complicated; not sure if there's an even shorter solution. First, we compute the cumulative sum, operate on that, and then go back (diff is kinda sorta the inverse function of cumsum).
import math
c = values.cumsum() - ALLOWANCE
# now we've got [-15, -5, 20, 50]
c[c < 0] = 0 # negative values don't make sense here
# (c - c.shift(1)) # <-- what I had first: diff by accident
# it is important that we don't fill with 0, in case that the first
# value is greater than ALLOWANCE
c.diff().fillna(math.max(0, values[0] - ALLOWANCE))
This is probably not so performant but at the moment this is a Pandas way of doing this using rolling_apply:
In [53]:
ALLOWANCE = 100
def reduce(x):
global ALLOWANCE
# short circuit if we've already reached 0
if ALLOWANCE == 0:
return x
val = max(0, x - ALLOWANCE)
ALLOWANCE = max(0, ALLOWANCE - x)
return val
pd.rolling_apply(values, window=1, func=reduce)
Out[53]:
0 0
1 0
2 20
3 30
dtype: float64
Or more simply:
In [58]:
values.apply(reduce)
Out[58]:
0 0
1 0
2 20
3 30
dtype: int64
It should work with a while loop :
ii = 0
while (ALLOWANCE > 0 and ii < len(values)):
if (ALLOWANCE > values[ii]):
ALLOWANCE -= values[ii]
values[ii] = 0
else:
values[ii] -= ALLOWANCE
ALLOWANCE = 0
ii += 1