Problem:
What'd I like to do is step-by-step reduce a value in a Series by a continuously decreasing base figure.
I'm not sure of the terminology for this - I did think I could do something with cumsum and diff but I think I'm leading myself on a wild goose chase there...
Starting code:
import pandas as pd
ALLOWANCE = 100
values = pd.Series([85, 10, 25, 30])
Desired output:
desired = pd.Series([0, 0, 20, 30])
Rationale:
Starting with a base of ALLOWANCE - each value in the Series is reduced by the amount remaining, as is the allowance itself, so the following steps occur:
Start with 100, we can completely remove 85 so it becomes 0, we now have 15 left as ALLOWANCE
The next value is 10 and we still have 15 available, so this becomes 0 again and we have 5 left.
The next value is 25 - we only have 5 left, so this becomes 20 and now we have no further allowance.
The next value is 30, and since there's no allowance, the value remains as 30.
Following your initial idea of cumsum and diff, you could write:
>>> (values.cumsum() - ALLOWANCE).clip_lower(0).diff().fillna(0)
0 0
1 0
2 20
3 30
dtype: float64
This is the cumulative sum of values minus the allowance. Negative values are clipped to zeros (since we don't care about numbers until we have overdrawn our allowance). From there, you can calculate the difference.
However, if the first value might be greater than the allowance, the following two-line variation is preferred:
s = (values.cumsum() - ALLOWANCE).clip_lower(0)
desired = s.diff().fillna(s)
This fills the first NaN value with the "first value - allowance" value. So in the case where ALLOWANCE is lowered to 75, it returns desired as Series([10, 10, 25, 30]).
Your idea with cumsum and diff works. It doesn't look too complicated; not sure if there's an even shorter solution. First, we compute the cumulative sum, operate on that, and then go back (diff is kinda sorta the inverse function of cumsum).
import math
c = values.cumsum() - ALLOWANCE
# now we've got [-15, -5, 20, 50]
c[c < 0] = 0 # negative values don't make sense here
# (c - c.shift(1)) # <-- what I had first: diff by accident
# it is important that we don't fill with 0, in case that the first
# value is greater than ALLOWANCE
c.diff().fillna(math.max(0, values[0] - ALLOWANCE))
This is probably not so performant but at the moment this is a Pandas way of doing this using rolling_apply:
In [53]:
ALLOWANCE = 100
def reduce(x):
global ALLOWANCE
# short circuit if we've already reached 0
if ALLOWANCE == 0:
return x
val = max(0, x - ALLOWANCE)
ALLOWANCE = max(0, ALLOWANCE - x)
return val
pd.rolling_apply(values, window=1, func=reduce)
Out[53]:
0 0
1 0
2 20
3 30
dtype: float64
Or more simply:
In [58]:
values.apply(reduce)
Out[58]:
0 0
1 0
2 20
3 30
dtype: int64
It should work with a while loop :
ii = 0
while (ALLOWANCE > 0 and ii < len(values)):
if (ALLOWANCE > values[ii]):
ALLOWANCE -= values[ii]
values[ii] = 0
else:
values[ii] -= ALLOWANCE
ALLOWANCE = 0
ii += 1
Related
Our problem statement looks like:
np.random.seed(0)
df = pd.DataFrame({'Close': np.random.uniform(0, 100, size=10)})
This is sample data taken,other actual data is of a company's stock price
Close change
0 54.881350 NaN
1 71.518937 16.637586
2 60.276338 -11.242599
3 54.488318 -5.788019
4 42.365480 -12.122838
We have assinged a threshold with a range(0-1)
First, diff in change in index 1 and index 2 value are compared with threshold value,
if result is positive and greater than threshold, then assign = 1
if result is negative and less than threshold, then assign = -1
if result is within the range of threshold, then assign = 0
Same will be done for index 2 and index 3, and then index 3 and index 4
Now say the results are, final result will be through majority of voting:
index 1&2 index 2&3 index 3&4 Majority of voting
1 0 1 1
Exception
if results are 1, 0, -1 then the result would be 0
Now, the final result by majority of voting will be assigned to a new column at index 0, and so on.
EXPECTED RESULT(example)
Close change Result
0 54.881350 NaN 0
1 71.518937 16.637586 1
2 60.276338 -11.242599 -1
3 54.488318 -5.788019 1
4 42.365480 -12.122838 -1
I tried few times, but couldn't figure out how it will finally executed.
np.select is what you are looking for:
lbound, ubound = 0, 1
change = df["Close"].diff()
df["Change"] = change
df["Result"] = np.select(
[
# The exceptions. Floats are not exact. If they are "close enough" to
# 1, we consider them to be equal to 1, etc.
np.isclose(change, 1) | np.isclose(change, 0) | np.isclose(change, -1),
# The other conditions
(change > 0) & (change > ubound),
(change < 0) & (change < lbound),
change.between(lbound, ubound),
],
[0, 1, -1, 0],
)
I have a dataframe of 10,000 rows that I am trying to sum all possible combinations of those rows. According to my math, that's about 50 million combinations. I'll give a small example to simplify what my data looks like:
df = Ratio Count Score
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15
And here's the desired result:
results = Min Ratio Max Ratio Total Count Total Score
1 2 13 23
1 3 21 36
1 4 30 50
1 5 40 65
2 3 15 25
2 4 24 39
2 5 34 54
3 4 17 27
3 5 27 42
4 5 19 29
This is the code that I came up with to complete the calculation:
for i in range(len(df)):
j = i + 1
while j <= len(df):
range_to_calc = df.iloc[i:j]
total_count = range_to_calc['Count'].sum()
total_score = range_to_calc['Score'].sum()
new_row = {'Min Ratio': range_to_calc.at[range_to_calc.first_valid_index(),'Ratio'],
'Max Ratio': range_to_calc.at[range_to_calc.last_valid_index(),'Ratio'],
'Total Count': total_count,
'Total Score': total_score}
results = results.append(new_row, ignore_index=True)
j = j + 1
This code works, but according to my estimates after running it for a few minutes, it would take 200 hours to complete. I understand that using numpy would be a lot faster, but I can't wrap my head around how to build multiple arrays to add together. (I think it would be easy if I was doing just 1+2, 2+3, 3+4, etc., but it's a lot harder because I need 1+2, 1+2+3, 1+2+3+4, etc.) Is there a more efficient way to complete this calculation so it can run in a reasonable amount of time? Thank you!
P.S.: If you're wondering what I want to do with a 50 million-row dataframe, I don't actually need that in my final results. I'm ultimately looking to divide the Total Score of each row in the results by its Total Count to get a Total Score Per Total Count value, and then display the 1,000 highest Total Scores Per Total Count, along with each associated Min Ratio, Max Ratio, Total Count, and Total Score.
After these improvements it takes ~2 minutes to run for 10k rows.
For the sum computation, you can pre-compute cumulative sum(cumsum) and save it. sum(i to j) is equal to sum(0 to j) - sum(0 to i-1).
Now sum(0 to j) is cumsum[j] and sum(0 to i - 1) is cumsum[i-1].
So sum(i to j) = cumsum[j] - cumsum[i - 1].
This gives significant improvement over computing sum each time for different combination.
Operation over numpy array is faster than the operation on pandas series, hence convert every colum to numpy array and then do the computation over it.
(From other answers): Instead of appending in list, initialise an empty numpy array of size ((n*(n+1)//2) -n , 4) and use it to save the results.
Use:
count_cumsum = np.cumsum(df.Count.values)
score_cumsum = np.cumsum(df.Score.values)
ratios = df.Ratio.values
n = len(df)
rowInCombination = (n * (n + 1) // 2) - n
arr = np.empty(shape = (rowInCombination, 4), dtype = int)
k = 0
for i in range(len(df)):
for j in range(i + 1, len(df)):
arr[k, :] = ([
count_cumsum[j] - count_cumsum[i-1] if i > 0 else count_cumsum[j],
score_cumsum[j] - score_cumsum[i-1] if i > 0 else score_cumsum[j],
ratios[i],
ratios[j]])
k = k + 1
out = pd.DataFrame(arr, columns = ['Total_Count', 'Total_Score',
'Min_Ratio', 'Max_Ratio'])
Input:
df = pd.DataFrame({'Ratio': [1, 2, 3, 4, 5],
'Count': [6, 7, 8, 9, 10],
'Score': [11, 12, 13, 14, 15]})
Output:
>>>out
Min_Ratio Max_Ratio Total_Count Total_Score
0 1 2 13 23
1 1 3 21 36
2 1 4 30 50
3 1 5 40 65
4 2 3 15 25
5 2 4 24 39
6 2 5 34 54
7 3 4 17 27
8 3 5 27 42
9 4 5 19 29
First of all, you can improve the algorithm. Then, you can speed up the computation using Numpy vectorization/broadcasting.
Here are the interesting point to improve the performance of the algorithm:
append of Pandas is slow because it recreate a new dataframe. You should never use it in a costly loop. Instead, you can append the lines to a Python list or even directly write the items in a pre-allocated Numpy vector.
computing partial sums take an O(n) time while you can pre-compute the cumulative sums and then just find the partial sum in constant time.
CPython loops are very slow, but the inner loop can be vectorized using Numpy thanks to broadcasting.
Here is the resulting code:
import numpy as np
import pandas as pd
def fastImpl(df):
n = len(df)
resRowCount = (n * (n+1)) // 2
k = 0
cumCounts = np.concatenate(([0], df['Count'].astype(int).cumsum()))
cumScores = np.concatenate(([0], df['Score'].astype(int).cumsum()))
ratios = df['Ratio'].astype(int)
minRatio = np.empty(resRowCount, dtype=int)
maxRatio = np.empty(resRowCount, dtype=int)
count = np.empty(resRowCount, dtype=int)
score = np.empty(resRowCount, dtype=int)
for i in range(n):
kStart, kEnd = k, k+(n-i)
jStart, jEnd = i+1, n+1
minRatio[kStart:kEnd] = ratios[i]
maxRatio[kStart:kEnd] = ratios[i:n]
count[kStart:kEnd] = cumCounts[jStart:jEnd] - cumCounts[i]
score[kStart:kEnd] = cumScores[jStart:jEnd] - cumScores[i]
k = kEnd
assert k == resRowCount
return pd.DataFrame({
'Min Ratio': minRatio,
'Max Ratio': maxRatio,
'Total Count': count,
'Total Score': score
})
Note that this code give the same results than the code in your question, but the original code does not give the expected results stated in the question. Note also that since inputs are integers, I forced Numpy to use integers for sake of performance (despite the algorithm should work with floats too).
This code is hundreds of thousand times faster than the original code on big dataframes and it succeeds to compute a dataframe of 10,000 rows in 0.7 second.
Others have explained why your algorithm was so slow so I will dive into that.
Let's take a different approach to your problem. In particular, look at how the Total Count and Total Score columns are calculated:
Calculate the cumulative sum for every row from 1 to n
Calculate the cumulative sum for every row from 2 to n
...
Calculate the cumulative sum for every row from n to n
Since cumulative sums are accumulative, we only need to calculate it once for row 1 to row n:
The cumsum of (2 to n) is the cumsum of (1 to n) - (row 1)
The cumsum of (3 to n) is the cumsum of (2 to n) - (row 2)
And so on...
In other words, the current cumsum is the previous cumsum minus its first row, then dropping the first row.
As you have theorized, pandas is a lot slower than numpy so we will convert everthing into numpy for speed:
arr = df[['Ratio', 'Count', 'Score']].to_numpy() # Convert to numpy array
tmp = np.cumsum(arr[:, 1:3], axis=0) # calculate cumsum for row 1 to n
tmp = np.insert(tmp, 0, arr[0, 0], axis=1) # create the Min Ratio column
tmp = np.insert(tmp, 1, arr[:, 0], axis=1) # create the Max Ratio column
results2 = [tmp]
for i in range(1, len(arr)):
tmp = results2[-1][1:] # current cumsum is the previous cumsum without the first row
diff = results2[-1][0] # the previous cumsum's first row
tmp -= diff # adjust the current cumsum
tmp[:, 0] = arr[i, 0] # new Min Ratio
tmp[:, 1] = arr[i:, 0] # new Max Ratio
results2.append(tmp)
# Assemble the result
results2 = np.concatenate(results2).reshape(-1,4)
results2 = pd.DataFrame(results2, columns=['Min Ratio', 'Max Ratio', 'Total Count', 'Total Score'])
During my test, this produces the results for a 10k row data frame in about 2 seconds.
Sorry to write late for this topic, but I'm just looking for a solution for a similar topic. The solution for this issue is simple because the combination is only in pairs. This is solved by uploading the dataframe to any DB and executing the following query whose duration is less than 10 seconds:
SEL f1.*,f2.*,f1.score+f2.score
FROM table_with_data_source f1, table_with_data_source f2
where f1.ratio<>f2.ratio;
The database will do it very fast even if there are 100,000 records or more.
However, none of the algorithms that I saw in the answers, actually perform a combinatorial of values. He only does it in pairs. The problem really gets complicated when it's a true combinatorial, for example:
Given: a, b, c, d and e as records:
a
b
c
d
e
The real combination would be:
a+b
a+c
a+d
a+e
a+b+c
a+b+d
a+b+e
a+c+d
a+c+e
a+d+e
a+b+c+d
a+b+c+e
a+c+d+e
a+b+c+d+e
b+c
b+d
b+e
b+c+d
b+c+e
b+d+e
c+d
c+e
c+d+e
d+e
This is a real combinatorial, which covers all possible combinations. For this case I have not been able to find a suitable solution since it really affects the performance of any HW. Anyone have any idea how to perform a real combinatorial using python? At the database level it affects the general performance of the database.
I'm trying to avoid for loops applying a function on a per row basis of a pandas df. I have looked at many vectorization examples but have not come across anything that will work completely. Ultimately I am trying to add an additional df column with the summation of successful conditions with a specified value per each condition by row.
I have looked at np.apply_along_axis but that's just a hidden loop, np.where but I could not see this working for 25 conditions that I am checking
A B C ... R S T
0 0.279610 0.307119 0.553411 ... 0.897890 0.757151 0.735718
1 0.718537 0.974766 0.040607 ... 0.470836 0.103732 0.322093
2 0.222187 0.130348 0.894208 ... 0.480049 0.348090 0.844101
3 0.834743 0.473529 0.031600 ... 0.049258 0.594022 0.562006
4 0.087919 0.044066 0.936441 ... 0.259909 0.979909 0.403292
[5 rows x 20 columns]
def point_calc(row):
points = 0
if row[2] >= row[13]:
points += 1
if row[2] < 0:
points -= 3
if row[4] >= row[8]:
points += 2
if row[4] < row[12]:
points += 1
if row[16] == row[18]:
points += 4
return points
points_list = []
for indx, row in df.iterrows():
value = point_calc(row)
points_list.append(value)
df['points'] = points_list
This is obviously not efficient but I am not sure how I can vectorize my code since it requires the values per row for each column in the df to get a custom summation of the conditions.
Any help in pointing me in the right direction would be much appreciated.
Thank you.
UPDATE:
I was able to get a little more speed replacing the df.iterrows section with df.apply.
df['points'] = df.apply(lambda row: point_calc(row), axis=1)
UPDATE2:
I updated the function as follows and have substantially decreased the run time with a 10x speed increase from using df.apply and the initial function.
def point_calc(row):
a1 = np.where(row[:,2]) >= row[:,13], 1,0)
a2 = np.where(row[:,2] < 0, -3, 0)
a3 = np.where(row[:,4] >= row[:,8])
etc.
all_points = a1 + a2 + a3 + etc.
return all_points
df['points'] = point_calc(df.to_numpy())
What I am still working on is using np.vectorize on the function itself to see if that can be improved upon as well.
You can try it it the following way:
# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))
It looks like that:
A B C D
0 0.724198 0.444924 0.554168 0.368286
1 0.512431 0.633557 0.571369 0.812635
2 0.680520 0.666035 0.946170 0.652588
3 0.467660 0.277428 0.964336 0.751566
4 0.762783 0.685524 0.294148 0.515455
5 0.588832 0.276401 0.336392 0.997571
6 0.652105 0.072181 0.426501 0.755760
7 0.238815 0.620558 0.309208 0.427332
8 0.740555 0.566231 0.114300 0.353880
9 0.664978 0.711948 0.929396 0.014719
You can create a Series which counts your points and is initialized with zeros:
points = pd.Series(0, index=df.index)
It looks like that:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Afterwards you can add and subtract values line by line if you want:
The condition within the brackets selects the rows, where the condition is true.
Therefore -= and += is only applied in those rows.
points.loc[df.A < df.C] += 1
points.loc[df.B < 0] -= 3
At the end you can extract the values of the series as numpy array if you want (optional):
point_list = points.values
Does this solve your problem?
Consider two arrays of different length:
A = np.array([58, 22, 86, 37, 64])
B = np.array([105, 212, 5, 311, 253, 419, 123, 461, 256, 464])
For each value in A, I want to find the smallest absolute difference between values in A and B. I use Pandas because my actual arrays are subsets of Pandas dataframes but also because the apply method is a convenient (albeit slow) approach to taking the difference between two different-sized arrays:
In [22]: pd.Series(A).apply(lambda x: np.min(np.abs(x-B)))
Out[22]:
0 47
1 17
2 19
3 32
4 41
dtype: int64
BUT I also want to keep the sign, so the desired output is:
0 -47
1 17
2 -19
3 32
4 -41
dtype: int64
[update] my actual arrays A and B are approximately of 5e4 and 1e6 in length so a low memory solution would be ideal. Also, I wish to avoid using Pandas because it is very slow on the actual arrays.
Let's use broadcasted subtraction here. We then use argmin to find the absolute minimum, then extract the values in a subsequent step.
u = A[:,None] - B
idx = np.abs(u).argmin(axis=1)
u[np.arange(len(u)), idx]
# array([-47, 17, -19, 32, -41])
This uses pure NumPy broadcasting, so it should be quite fast.
Comprehension
I couldn't help myself. This is not what you should do! But, it is cute.
[min(x - B, key=abs) for x in A]
[-47, 17, -19, 32, -41]
Reduced Big-O Numpy solution
If N = len(A) and M = len(B) then this solution should be O(N + M log(M))
If B is already sorted, then the sorting step is unnecessary. and this becomes O(N + M)
C = np.sort(B)
a = C.searchsorted(A)
# It is possible that `i` has a value equal to the length of `C`
# in which case the searched value exceeds all those found in `C`.
# In that case, we want to clip the index value to the right most index
# which is `len(C) - 1`
right = np.minimum(a, len(C) - 1)
# For those searched values that are less than the first value in `C`
# the index value will be `0`. When I subtract `1`, we'll end up with
# `-1` which makes no sense. So we clip it to `0`.
left = np.maximum(a - 1, 0)
For clipped values, we'll end up comparing a value to itself and therefore it is safe.
right_diff = A - C[right]
left_diff = A - C[left ]
np.where(np.abs(right_diff) <= left_diff, right_diff, left_diff)
array([-47, 17, -19, 32, -41])
Since you tagged pandas:
# compute the diff by broadcasting
diff = pd.DataFrame(A[None,:] - B[:,None])
# mininum value
min_val = diff.abs().min()
# mask with where and stack to drop na
diff.where(diff.abs().eq(min_val)).stack()
Output:
0 0 -47.0
2 -19.0
4 -41.0
2 1 17.0
3 32.0
dtype: float64
np.argmin can find the position of the minimum value. Therefore you can simply do this:
pd.Series(A).apply(lambda x: x-B[np.argmin(np.abs(x-B))])
I am trying to understand how Pandas DataFrames works to copy information downward, and then reset until the next variables changes... Specifically below, how do I make Share_Amt_To_Buy reset to 0 once my Signal or Signal_Diff switches from 1 to 0?
Using .cumsum() on Share_Amt_To_Buy ends up bringing down the values and accumulating which is not exactly what I would like to do.
My goal is that when Signal changes from 0 to 1, the Share_Amt_To_Buy is calculated and copied until Signal switches back to 0. Then if Signal turns to 1 again, I want Share_Amt_To_Buy to be recalculated based on that point in time.
Hopefully this makes sense - please let me know.
Signal Signal_Diff Share_Amt_To_Buy (Correctly) Share_Amt_To_Buy (Currently)
0 0 0 0
0 0 0 0
0 0 0 0
1 1 100 100
1 0 100 100
1 0 100 100
0 -1 0 100
0 0 0 100
1 1 180 280
1 0 180 280
As you can see, my signals alternate from 0 to 1, and this means the following:
0 = no trade (or position)
1 = trade (with a position)
Signal_Diff is calculated as follows
portfolio['Signal_Diff'] = portfolio['Signal'].diff().fillna(0.0)
The column 'Share_Amt_To_Buy' is calculated when signal changes from 0 to 1. I have used the following as an example to calculate this
initial_cap = 100000.0
portfolio['close'] = my stock's closing prices as a float
portfolio['Share_Amt'] = np.where(variables['Signal']== 1.0, np.round(initial_cap / portfolio['close'] * 0.25 * portfolio['Signal']), 0.0).cumsum()
portfolio['Share_Amt_To_Buy'] = (portfolio['Share_Amt']*portfolio['Signal'])
From what I understand, there is no built-in formula module for pandas. You can perform formulas on columns, cells, arrays and generate different arrays or values from them (df[column].count() is an example), and do plenty of work like that, but there is no method for dynamically updating the array itself based on another value in the array (like an Excel formula).
You could always do the procedure iteratively and say:
>>> for index in df.index:
>>> if df['Signal_Diff'] == 0:
>>> df.loc[index, 'Signal_Diff'] = some_value
>>> elif df['Signal_Diff'] == 1:
>>> df.loc[index, 'Signal_Diff'] = some_other_value
Or you could create a custom function via the map tool:
https://stackoverflow.com/a/19226745/4131059
EDIT:
Another solution would be to query for all indexes with a value of 1 in the old array and the new array upon some change to the array:
>>> df_old_list = df[df.Signal_Diff == 1].index.tolist()
>>> ...
>>> df_new_list = df[df.Signal_Diff == 1].index.tolist()
>>>
>>> for x in df_old_list:
>>> if x in df_new_list:
>>> df_new_list.remove(x)
Then recalculate for only the indexes in df_new_list.