I'm a new Python user (making the shift from VBA) and am having trouble figuring out Python's loop function. I have a dataframe df, and I want to create a column of variables based on some condition being met in another column, based on a loop. Something like the below:
cycle = 5
dummy = 1
for i = 1 to cycle
if df["high"].iloc[i] >= df["exit"].iloc[i] and
df["low"].iloc[i] <= df["exit"].iloc[i] then
df["signal"] = dummy
break
elif i = cycle
df["signal"] = cycle + 1
break
else:
dummy = dummy + 1
next i
Basically trying to find in which column over the next columns up to the cycle variable are the conditions in the if statement met, and if they're never met, assign cycle + 1. So df["signal"] will be a column of numbers ranging 1 -> (cycle + 1). Also, there are some NaN values in df["exit"], not sure how that affects the loop.
I've found fairly extensive documentation on row iterations on the site, I feel like this is close to where I need to get to, but can't figure out how to adapt it. Thanks for any advice!
EDIT: INCLUDED DATA SAMPLE FROM EXCEL CELLS BELOW:
high low EXIT test signal/(OUTPUT COLUMN)
4 3 4 1 1
2 2 2 1 1
2 3 5 0 6
4 3 1 0 5
2 5 2 0 4
5 5 1 0 3
3 1 5 0 2
5 1 5 1 1
1 1 4 0 0
EDIT 2: FURTHER CLARIFICATION AROUND SCRIPT
Once the condition
df["high"].iloc[i] >= df["exit"].iloc[i] and
df["low"].iloc[i] <= df["exit"].iloc[i]
is met in the loop, it should terminate for that particular instance/row.
EDIT 3: EXPECTED OUTPUT
The expected output is the df["signal"] column - it is the first instance in the loop where the condition
df["high"].iloc[i] >= df["exit"].iloc[i] and
df["low"].iloc[i] <= df["exit"].iloc[i]
is met in any given row. The output in df["signal"] is effectively i from the loop, or the given iteration.
here is how I would solve the problem, the column 'gr' must not exist before doing this:
# first check all the rows meeting the conditions and add 1 in a temporary column gr
df.loc[(df["high"] >= df["exit"]) & (df["low"] <= df["exit"]), 'gr'] = 1
# manipulate column gr to use groupby after
df['gr'] = df['gr'].cumsum().bfill()
# use cumcount after groupby to recalculate signal
df.loc[:,'signal'] = df.groupby('gr').cumcount(ascending=False).add(1)
# cut the value in signal to the value cycle + 1
df.loc[df['signal'] > cycle, 'signal'] = cycle + 1
# drop column gr
df = df.drop('gr',1)
and you get
high low exit signal
0 4 3 4 1
1 2 2 2 1
2 2 3 5 6
3 4 3 1 5
4 2 5 2 4
5 5 5 1 3
6 3 1 5 2
7 5 1 5 1
8 1 1 4 1
Note: The last row is not working properly as never a row with the condition is met after, and not sure how it will be in the full data or how to handle this. You may consider to add df = df.dropna(subset=['gr']) after the line starting with df['gr'] = ... to drop these last rows, up to you.
Related
In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
I created a dataframe with pandas and calculated the percentage of earning or losing:
and I hope I can design 2 column like entered market and trail for a trailing stop backtest for example 5% like:
earning/losing entered market trail
0 0 0
1 1 1
2 1 2
3 1 3
7 1 7
4 1 7
5 1 7
8 1 8
2 0 0
5 0 0
4 0 0
I had tried using numpy condition to create like but I can't complete
the rest of condition:
condition = [(df['earning/losing'] > 0) & (df['earning/losing'] >df['earning/losing'].shift(-1)) & (df['earning/losing'] - df['earning/losing'].shift(-1) < 5),]
value = [df.earning/losing,]
df['trail'] = np.select(condition,value,default = 0)
I think if I could create a column like trail, then I can judge the trailing condition,
But I dont know how to create the trail column in pandas
can anyone help me out? thanks alot!
Given the following exemplary dataframe/series,
I have - for some given reason - identified row number 6 as the relevant base row and I now want to find the row where the unbroken series of ones started (in this case that is row 3).
I explicitly do not want to find the first row containing a one (which would be row 0) but I want to find the row for which the following holds: Starting from our base row (row 6) go up until you don't find a one anymore. Then return the index of this row.
A
0 1
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 0
9 0
I hope this is somewhat clear. Thanks for any suggestions!
Also I am grateful for approaches that can be generically adapted to cases where for example
the base row itself does not have a one itself (in that case its about finding the start of some previous series of ones within this column)
maybe I am not interested in the one-series that was previous but rather the one that followed.
Here's a solution that should be performant:
n = 6
# a new group id is formed every time the value changes in A
df["group_id"] = np.cumsum(df.A != df.A.shift())
# get group for n, return first column of that group
group = df.group_id.iloc[n]
df[df.group_id == group].head(1)
df with group_id; this can be dropped as needed:
A group_id
0 1 1
1 0 2
2 0 2
3 1 3
4 1 3
5 1 3
6 1 3
7 1 3
8 0 4
9 0 4
result:
A group_id
3 1 3
I am not sure if there is any built-in method for that, but you can most certainly use a for loop.
start = 6 # By any method you have identified
while start > 0:
if df['A'][start] == 1
start -= 1
else
break
print(start + 1)
Now, to access that row you can do something like
df.iloc[start+1, :]
If you want to go downwards
while start < df.shape[1]:
# same if
start += 1
I have a dataframe df like this:
trial id run rt acc
0 1 1 1 0.941836 1
1 2 1 1 0.913791 1
2 3 1 1 0.128986 1
3 4 1 1 0.155720 0
4 1 1 2 0.414175 0
5 2 1 2 0.699326 1
6 3 1 2 0.781877 1
7 4 1 2 0.554666 1
There are 2 runs per id, and 70+ trials per run. Each row contains one trial. So the hierarchy is id - run - trial.
I want to retain only runs where mean acc is above 0.5, so I used temp = df.groupby(['id', 'run']).agg(np.average) and keep = temp[temp['acc']] > 0.5 .
Now I want to remove all trials from runs that are not in keep.
I tried to use df[df['id'].isin(keep['id'])&df['run'].isin(keep['run'])], but this doesn't seem to work correctly. df.query doesn't seem to work either as the indices and columns differ between the dataframes.
Is there another way of doing this?
I want to retain only runs where mean acc is above 0.5
Using groupby + transform, you can use a single Boolean series for indexing:
df = df[df.groupby(['id', 'run'])['acc'].transform('mean') > 0.5]
I am trying to add a running count to a pandas df.
For the values in Column A, I want to add '5' and for values in Column B I want to add '1'.
So for the df below I'm hoping to produce:
A B Total
0 0 0 0
1 0 0 0
2 1 0 5
3 1 1 6
4 1 1 6
5 2 1 11
6 2 2 12
So for every incremental integer in Column A, it equal '5' in the total. While Column B is the '+1'.
I tried:
df['Total'] = df['A'].cumsum(axis = 0)
But this doesn't include Column B
df['Total'] = df['A'] * 5 + df['B']
As far as I can tell, you are simply doing row wise operations, not a cumulative sum. This snippet calculates the row value of A times 5 and adds the row value of B for each row. Please don't make it any more complicated than it really is.
What is a cumulative sum (also called running total)?
From Wikipedia:
Consider the sequence < 5 8 3 2 >. What is the total of this sequence?
Answer: 5 + 8 + 3 + 2 = 18. This is arrived at by simple summation of the sequence.
Now we insert the number 6 at the end of the sequence to get < 5 8 3 2 6 >. What is the total of that sequence?
Answer: 5 + 8 + 3 + 2 + 6 = 24. This is arrived at by simple summation of the sequence. But if we regarded 18 as the running total, we need only add 6 to 18 to get 24. So, 18 was, and 24 now is, the running total. In fact, we would not even need to know the sequence at all, but simply add 6 to 18 to get the new running total; as each new number is added, we get a new running total.