I need to write a code to create the new column B embedding the cumsum of the first column A.
When the cumsum value is < 0, then the value in B should be 0.
Then cumsum starts again until next value <0.
I search similar answer but I was not able to find an answer fitting my case. Thanks for your help.
A B
1 1
3 4
5 9
7 16
-6 10
-8 2
-10 *0*
6 6
-15 *0*
11 11
Set up a loop over A and have the total. If the total is less than 0 then set it to 0. Then append the new total to B
You have a A = [ 1, ...,], total = 0, B = []
total = 0
B = []
for i in range(len(A)):
# process the sum
total += A[i]
if total < 0:
total = 0
B.append(total)
Here is a non-Pandas answer that iteratively loops through the values in column A and creates column B by never going below 0.
result = []
cur_res = 0
for i in df.A:
cur_res = max(cur_res + i, 0)
result.append(cur_res)
df['B'] = result
Related
I have a pandas dataframe(100,000 obs) with 11 columns.
I'm trying to assign df['trade_sign'] values based on the df['diff'] (which is a pd.series object of integer values)
If diff is positive, then trade_sign = 1
if diff is negative, then trade_sign = -1
if diff is 0, then trade_sign = 0
What I've tried so far:
pos['trade_sign'] = (pos['trade_sign']>0) <br>
pos['trade_sign'].replace({False: -1, True: 1}, inplace=True)
But this obviously doesn't take into account 0 values.
I also tried for loops with if conditions but that didn't work.
Essentially, how do I fix my .replace function to take account of diff values of 0.
Ideally, I'd prefer a solution that uses numpy over for loops with if conditions.
There's a sign function in numpy:
df["trade_sign"] = np.sign(df["diff"])
If you want integers,
df["trade_sign"] = np.sign(df["diff"]).astype(int)
a = [-1 if df['diff'].values[i] < 0 else 1 for i in range(len(df['diff'].values))]
df['trade_sign'] = a
You could do it this way:
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
The boolean results of the element-wise > and < comparisons automatically get converted to int in order to allow multiplication with 1 and -1, respectively.
This sample input and test code:
import pandas as pd
pos = pd.DataFrame({'diff':[-9,0,9,-8,0,8,-7-6-5,4,3,2,0]})
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
print(pos)
... gives this output:
diff trade_sign
0 -9 -1
1 0 0
2 9 1
3 -8 -1
4 0 0
5 8 1
6 -18 -1
7 4 1
8 3 1
9 2 1
10 0 0
UPDATE: In addition to the solution above, as well as some of the other excellent ideas in other answers, you can use numpy where:
pos['trade_sign'] = np.where(pos['diff'] > 0, 1, np.where(pos['diff'] < 0, -1, 0))
Assuming a dataframe like this
In [5]: data = pd.DataFrame([[9,4],[5,4],[1,3],[26,7]])
In [6]: data
Out[6]:
0 1
0 9 4
1 5 4
2 1 3
3 26 7
I want to count how many times the values in a rolling window/slice of 2 on column 0 are greater or equal to the value in col 1 (4).
On the first number 4 at col 1, a slice of 2 on column 0 yields 5 and 1, so the output would be 2 since both numbers are greater than 4, then on the second 4 the next slice values on col 0 would be 1 and 26, so the output would be 1 because only 26 is greater than 4 but not 1. I can't use rolling window since iterating through rolling window values is not implemented.
I need something like a slice of the previous n rows and then I can iterate, compare and count how many times any of the values in that slice are above the current row.
I have done this using list instead of doing it in data frame. Check the code below:
list1, list2 = df['0'].values.tolist(), df['1'].values.tolist()
outList = []
for ix in range(len(list1)):
if ix < len(list1) - 2:
if list2[ix] < list1[ix + 1] and list2[ix] < list1[ix + 2]:
outList.append(2)
elif list2[ix] < list1[ix + 1] or list2[ix] < list1[ix + 2]:
outList.append(1)
else:
outList.append(0)
else:
outList.append(0)
df['2_rows_forward_moving_tag'] = pd.Series(outList)
Output:
0 1 2_rows_forward_moving_tag
0 9 4 1
1 5 4 1
2 1 3 0
3 26 7 0
I am running this code on a large csv file (1.5 million rows). Is there a way to optimise ?
df is a pandas dataframe.
I take a row and want to know what happens 1st in the 1000 folowing rows :
I find my value + 0.0004 or i find my value - 0.0004
result = []
for row in range(len(df)-1000):
start = df.get_value(row,'A')
win = start + 0.0004
lose = start - 0.0004
for n in range(1000):
ref = df.get_value(row + n,'B')
if ref > win:
result.append(1)
break
elif ref <= lose:
result.append(-1)
break
elif n==999 :
result.append(0)
the dataframe is like :
timestamp A B
0 20190401 00:00:00.127 1.12230 1.12236
1 20190401 00:00:00.395 1.12230 1.12237
2 20190401 00:00:00.533 1.12229 1.12234
3 20190401 00:00:00.631 1.12228 1.12233
4 20190401 00:00:01.019 1.12230 1.12234
5 20190401 00:00:01.169 1.12231 1.12236
the result is : result[0,0,1,0,0,1,-1,1,…]
this is working but takes a long time to process on such large files.
To generate values for the "first outlier", define the following function:
def firstOutlier(row, dltRow = 4, dltVal = 0.1):
''' Find the value for the first "outlier". Parameters:
row - the current row
dltRow - number of rows to check, starting from the current
dltVal - delta in value of "B", compared to "A" in the current row
'''
rowInd = row.name # Index of the current row
df2 = df.iloc[rowInd : rowInd + dltRow] # "dltRow" rows from the current
outliers = df2[abs(df2.B - row.A) >= dlt]
if outliers.index.size == 0: # No outliers within the range of rows
return 0
return int(np.sign(outliers.iloc[0].B - row.A))
Then apply it to each row:
df.apply(firstOutlier, axis=1)
This function relies on the fact that the DataFrame has the index consisting
of consecutive numbers, starting from 0, so that having ind - the index of
any row we can access it calling df.iloc[ind] and a slice of n rows,
starting from this row, calling df.iloc[ind : ind + n].
For my test, I set the default values of parameters to:
dltRow = 4 - look at 4 rows, starting from the current one,
dltVal = 0.1 - look for rows with B column "distant by" 0.1
or more from A in the current row.
My test DataFrame was:
A B
0 1.00 1.00
1 0.99 1.00
2 1.00 0.80
3 1.00 1.05
4 1.00 1.20
5 1.00 1.00
6 1.00 0.80
7 1.00 1.00
8 1.00 1.00
The result (for my data and default values of parameters) was:
0 -1
1 -1
2 -1
3 1
4 1
5 -1
6 -1
7 0
8 0
dtype: int64
For your needs, change default values of params to 1000 and 0.0004 respectively.
The idea is to loop through A and B while maintaining a sorted list of A values. Then, for each B, find the highest A that loses and the lowest A that wins. Since it's a sorted list it's O(log(n)) to search. Only those A's that have index in the last 1000 are used for setting the result vector. After that the A's that are no longer waiting for a B are removed from this sorted list to keep it small.
import numpy as np
import bisect
import time
N = 10
M = 3
#N=int(1e6)
#M=int(1e3)
thresh = 0.4
A = np.random.rand(N)
B = np.random.rand(N)
result = np.zeros(N)
l = []
t_start = time.time()
for i in range(N):
a = (A[i],i)
bisect.insort(l,a)
b = B[i]
firstLoseInd = bisect.bisect_left(l,(b+thresh,-1))
lastWinInd = bisect.bisect_right(l,(b-thresh,-1))
for j in range(lastWinInd):
curInd = l[j][1]
if curInd > i-M:
result[curInd] = 1
for j in range(firstLoseInd,len(l)):
curInd = l[j][1]
if curInd > i-M:
result[curInd] = -1
del l[firstLoseInd:]
del l[:lastWinInd]
t_done = time.time()
print(A)
print(B)
print(result)
print(t_done - t_start)
This is a sample output:
[ 0.22643589 0.96092354 0.30098532 0.15569044 0.88474775 0.25458535
0.78248271 0.07530432 0.3460113 0.0785128 ]
[ 0.83610433 0.33384085 0.51055061 0.54209458 0.13556121 0.61257179
0.51273686 0.54850825 0.24302884 0.68037965]
[ 1. -1. 0. 1. -1. 0. -1. 1. 0. 1.]
For N = int(1e6) and M = int(1e3) it took about 3.4 seconds on my computer.
I am iterating over a Python dataframe and finding it to be extremely slow. I understand that in Pandas you try to vectorize everything, but in this case I specifically need to iterate (or if it is possible to vectorize, I'm unclear how to do it).
The logic is simple: you have two columns "A" and "B" and a result column "signal." If A equals 1, then you set signal to 1. If B equals 1, then you set signal to 0. Otherwise, signals is whatever it was previously. In other words, column A is an "on" signal, column B is an "off" signal, and "signal" represents the state.
Here is my code:
def signals(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
data['signal'] = 0
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
data['signal'].iloc[i] = 1
elif data['B'].iloc[i] == 1:
data['signal'].iloc[i] = 0
else:
data['signal'].iloc[i] = data['signal'].iloc[i-1]
return data
Example input/output:
indata = pd.DataFrame(index = range(0,10))
indata['A'] = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
indata['B'] = [1, 0, 0, 0, 1, 0, 0, 0, 1, 1]
signals(indata)
Output:
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
This simple logic takes my computer 46 seconds to run on a dataframe of 2000 rows with randomly generated data.
df['signal'] = df.A.groupby((df.A != df.B).cumsum()).transform('head', 1)
df
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
The logic here involves dividing your series into groups based on the inequality between A and B, and every group's value is determined by A.
You dont need to iterate at all you can do some Boolean indexing
#set condition for A
indata.loc[indata.A == 1,'signal'] = 1
#set condition for B
indata.loc[indata.B == 1,'signal'] = 0
#forward fill NaN values
indata.signal.fillna(method='ffill',inplace=True)
The simplest answer to my problem was to not write to the dataframe while iterating through it. I created an array of zeros in numpy, then did my iterative logic in the array. Then I wrote the array to the column in my dataframe.
def signals3(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
out_signal = np.zeros(numrows)
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
out_signal[i] = 1
elif data['B'].iloc[i] == 1:
out_signal[i] = 0
else:
out_signal[i] = out_signal[i-1]
data['signal'] = out_signal
return data
On a dataframe of 2000 rows of random data, this takes only 43 milliseconds as opposed to 46 seconds (~1,000x faster).
I also tried a variant where I assigned the dataframe columns A and B to series, and then iterated through the series. This was a bit faster (27 milliseconds). But it appears most of the slowness is in writing to a dataframe.
Both coldspeed and djk's answers were faster than my solution (about 4.5ms) but in practice I'll probably just iterate through series even though that is not optimal.
I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.
Say I have a simple data frame with two columns:
index Num_Albums Num_authors
0 10 4
1 1 5
2 4 4
3 7 1000
4 1 44
5 3 8
Num_Abums_tot = sum(Num_Albums) = 30
I need to do a cumulative sum of the data in Num_Albums until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.
Example:
cumulative sum of Num_Albums until the sum equals 50% ± 1/15 of 30 (--> 15±2):
10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop.
Condition reached at index 2. Then get Num_Authors at that index: Num_Authors(2)=4
I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....
[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].
Opt - 1:
You could compute the cumulative sum using cumsum. Then use np.isclose with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.
Through np.flatnonzero, return the ordinal values of the indices for which the True condition holds. We select the first instance of a True value.
Finally, use .iloc to retrieve value of the column name you require based on the index computed earlier.
val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val] # for faster access, use .iat
4
When performing np.isclose on the series later converted to an array:
np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False, True, False, False, False], dtype=bool)
Opt - 2:
Use pd.Index.get_loc on the cumsum calculated series which also supports a tolerance parameter on the nearest method.
val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4
Opt - 3:
Use idxmax to find the first index of a True value for the boolean mask created after sub and abs operations on the cumsum series:
df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4
I think you can directly add a column with the cumulative sum as:
In [3]: df
Out[3]:
index Num_Albums Num_authors
0 0 10 4
1 1 1 5
2 2 4 4
3 3 7 1000
4 4 1 44
5 5 3 8
In [4]: df['cumsum'] = df['Num_Albums'].cumsum()
In [5]: df
Out[5]:
index Num_Albums Num_authors cumsum
0 0 10 4 10
1 1 1 5 11
2 2 4 4 15
3 3 7 1000 22
4 4 1 44 23
5 5 3 8 26
And then apply the condition you want on the cumsum column. For instance you can use where to get the full row according to the filter. Setting the tolerance tol:
In [18]: tol = 2
In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()
In [20]: cond
Out[20]:
index Num_Albums Num_authors cumsum
2 2.0 4.0 4.0 15.0
This could even be done as following code:
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return (index,df.loc([df.Num_Albums==i,'Num_authors']))
This would actually return a tuple of your index and the corresponding value of Num_authors as soon as the "your condition" is reached.
or could even be returned as an array by
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return df.loc([df.Num_Albums==i,'Num_authors']).index.values
I am not able to figure out the condition you mentioned of the cumulative sum as when to stop summing so I mentioned it as " your_condition " in the code!!
I am also new so hope it helps !!