How to follow changes in a dataframe, but only in one direction

How to follow changes in a dataframe, but only in one direction - python

I am trying to simulate a trailing stop, used in trading.
Some data:
(input) (output)
price peg
1000 0995 set a price - 5
1001 0996 following price up
1002 0997 following price up
1001 0997 not following price down
1010 1005 following price up
1012 1007 following price up
1010 1007 not following price down
1006 STOP the price went below the last peg
The logic is the following:
I start by setting a peg at -5, so it takes price - 5 and makes 995.
Each time the price goes up, the peg follows up, always keeping a -5 gap
If the price goes down, the peg does NOT go down
If the price goes below, or equal to the peg, I need to know the index and the process RESTARTS
Is there a Pandas idiomatic way to do this? I've implemented it as a loop, but it is very slow.
This is some code I've done for the loop:
# i is the index at which we take a trade in
# and I want to go through the rest of the dataframe to see if it would
# hit a trailing stop
if direction == +1: # only long trades in this example
peg_price = entry_price -5
for j in range(i + 1, len(df)):
low = df['low'][j]
if low <= peg_price:
date = df['date'][i]
trade_date.append(df['date'][i])
trade_exit_date.append(df['date'][j])
trade_price.append(entry_price)
trade_exit.append(peg_price)
trade_profit.append(peg_price - entry_price)
skip_to = j + 1
else:
low = df['high'][j]
peg_price = max(high - 5, peg_price)
The example is a bit more complex because I need to compare the peg with the 'low' price but update it with the 'high' price; but the idea is there.

IIUC:
data = {"price":[1000,1001,1002,1001,1010,1012,1010,1006]}
df = pd.DataFrame(data)
# first make a column of price-5
df["peg"] = df["price"]-5
# use np.where to check whether price dropped or increased
df["peg"] = np.where(df["price"].shift()>df["price"],df["peg"].shift(),df["peg"])
print (df)
price peg
0 1000 995.0
1 1001 996.0
2 1002 997.0
3 1001 997.0
4 1010 1005.0
5 1012 1007.0
6 1010 1007.0
7 1006 1005.0
# Get the index of STOP
print (df[df["peg"].shift()>df["peg"]])
price peg
7 1006 1005.0

Here is one way,
the idea is to pass all your logical conditions into true false booleans, we can then iteratively step through the assignments and pass them in. Once we have done that we can find the row where the peg is greater then the price we can then assign STOP if you have data that you need to NA after this you can easily to a logical .loc and assign any values after stop to NA.
for this example I've used your peg column name as counter so we can compare.
import pandas as pd
peg1 = df['price'].sub(df['price'].shift(1)) == 1 # rolling cumcount
peg2 = df['price'].sub(df['price'].shift(1)) > 1 # -5 these vals
peg3 = df['price'].sub(df['price'].shift(1)) <= -1 # keep as row above.
#assignments
df.loc[peg1,'counter'] = df['counter'].ffill() + peg1.cumsum()
df.loc[peg2,'counter'] = df['price'] - 5
df.loc[peg3,'counter'] = df['counter'].ffill()
df.loc[df['counter'] > df['price'], 'counter'] = 'STOP'
print(df)
price peg counter
0 1000 0995 995
1 1001 0996 996
2 1002 0997 997
3 1001 0997 997
4 1010 1005 1005
5 1012 1007 1007
6 1010 1007 1007
7 1006 STOP STOP

Related

How to create a rank from a df with Pandas

I have a table that is cronologically sorted, with an state and an amount fore each date. The table looks as follows:
Date
State
Amount
01/01/2022
1
1233.11
02/01/2022
1
16.11
03/01/2022
2
144.58
04/01/2022
1
298.22
05/01/2022
2
152.34
06/01/2022
2
552.01
07/01/2022
3
897.25
To generate the dataset:
pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
I want to add a column called rank that is increased when the state changes. So if you have twenty times state 1, it is just rank 1. If then you have state 2, when the state 1 appears again, the rank is increased. That is, if for two days in a row State is 1, Rank is 1. Then, another state appears. When State 1 appears again, Rank would increment to 2.
I want to add a column called "Rank" which has a value that increments itself if a given state appears again. It is like a counter amount of times that state appear consecutively. That it, if state. An example would be as follows:
Date
State
Amount
Rank
01/01/2022
1
1233.11
1
02/01/2022
1
16.11
1
03/01/2022
2
144.58
1
04/01/2022
1
298.22
2
05/01/2022
2
152.34
2
06/01/2022
2
552.01
2
07/01/2022
3
897.25
1
This could be also understanded as follows:
Date
State
Amount
Rank_State1
Rank_State2
Rank_State2
01/01/2022
1
1233.11
1
02/01/2022
1
16.11
1
03/01/2022
2
144.58
1
04/01/2022
1
298.22
2
05/01/2022
2
152.34
2
06/01/2022
2
552.01
2
07/01/2022
3
897.25
1
Does anyone know how to build that Rank column starting from the previous table?

Your problem is in the general category of state change accumulation, which suggests an approach using cumulative sums and booleans.
Here's one way you can do it - maybe not the most elegant, but I think it does what you need
import pandas as pd
someDF = pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
someDF["StateAccumulator"] = someDF["state"].apply(str).cumsum()
def groupOccurrence(someRow):
sa = someRow["StateAccumulator"]
s = str(someRow["state"])
stateRank = len("".join([i if i != '' else " " for i in sa.split(s)]).split())\
+ int((sa.split(s)[0] == '') or (int(sa.split(s)[-1] == '')) and sa[-1] != s)
return stateRank
someDF["Rank"] = someDF.apply(lambda x: groupOccurrence(x), axis=1)
If I understand correctly, this is the result you want - "Rank" is intended to represent the number of times a given set of contiguous states have appeared:
date state amount StateAccumulator Rank
0 01/08/2022 1 144 1 1
1 02/08/2022 1 142 11 1
2 03/08/2022 2 166 112 1
3 04/08/2022 2 144 1122 1
4 05/08/2022 3 142 11223 1
5 06/08/2022 1 166 112231 2
6 07/08/2022 1 144 1122311 2
7 08/08/2022 2 142 11223112 2
8 09/08/2022 2 166 112231122 2
9 10/08/2022 2 142 1122311222 2
10 11/08/2022 1 166 11223112221 3
Notes:
instead of the somewhat hacky string cumsum method I'm using here, you could probably use a list accumulation function and then use a pandas split-apply-combine method to do the counting in the lambda function
you would then apply a state change boolean, and do a cumsum on the state change boolean, filtered/grouped on the state value (so, how many state changes do we have for any given state)
state change boolean is done like this:
someDF["StateChange"] = someDF["state"] != someDF["state"].shift()
so for a given state at a given row, you'd count how many state changes had occurred in the previous rows.

Can I cluster these records without having to run these loops for every record?

So I want to cluster the records in this table to find which records are 'similar' (i.e. have enough in common). An example of the table is as follows:
author beginpage endpage volume publication year id_old id_new
0 NaN 495 497 NaN 1975 1 1
1 NaN 306 317 14 1997 2 2
2 lowry 265 275 193 1951 3 3
3 smith p k 76 85 150 1985 4 4
4 NaN 248 254 NaN 1976 5 5
5 hamill p 85 100 391 1981 6 6
6 NaN 1513 1523 7 1979 7 7
7 b oregan 737 740 353 1991 8 8
8 NaN 503 517 98 1975 9 9
9 de wijs 503 517 98 1975 10 10
In this small table, the last row should get 'new_id' equal to 9, to show that these two records are similar.
To make this happen I wrote the code below, which works fine for a small number of records. However, I want to use my code for a table with 15000 records. And of course, if you do the maths, with this code this is going to take way too long.
Anyone who could help me make this code more efficient? Thanks in advance!
My code, where 'dfhead' is the table with the records:
for r in range(0,len(dfhead)):
for o_r in range(r+1,len(dfhead)):
if ((dfhead.loc[r,c] == dfhead.loc[o_r,c]).sum() >= 3) :
if (dfhead.loc[o_r,['id_new']] > dfhead.loc[r,['id_new']]).sum() ==1:
dfhead.loc[o_r,['id_new']] = dfhead.loc[r,['id_new']]

If you are only trying to detect whole equalities between "beginpage", "endpage","volume", "publication", "year", you should try to work on duplicates. I'm not sure about this as your code is still a mistery for me.
Something like this might work (your column "id" needs to be named "id_old" at first in the dataframe though):
cols = ["beginpage", "endpage","volume", "publication", "year"]
#isolate duplicated rows
duplicated = df[df.duplicated(cols, keep=False)]
#find the minimum key to keep
temp = duplicated.groupby(cols, as_index=False)['index'].min()
temp.rename({'id_old':'id_new'}, inplace=True, axis=1)
#import the "minimum key" to duplicated by merging the dataframes
duplicated = duplicated.merge(temp, on=cols, how="left")
#gather the "un-duplicated" rows
unduplicated = df[~df.duplicated(cols, keep=False)]
#concatenate both datasets and reset the index
new_df = unduplicated.append(duplicated)
new_df.reset_index(drop=True, inplace=True)
#where "id_new" is empty, then the data comes from "unduplicated"
#and you could fill the datas from id_old
ix = new_df[new_df.id_new.isnull()].index
new_df.loc[ix, 'id_new'] = new_df.loc[ix, 'id_old']

Next "n" Days Sales for every product

I have the following dataframe:
print(dd)
dt_op quantity product_code
20/01/18 1 613
21/01/18 8 611
21/01/18 1 613
...
I am trying to get the sales in the dataframe of the next "n" days, but the following code does not compute them by product_code as well:
dd["Final_Quantity"] = [dd.loc[dd['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \
for d in dd['dt_op']]
I would like to define dd["Final_Quantity"] as sum of df["quantity"] sold in the next "n" days, for every different product in stock;
Ultimately, for i in dt_op and product_code.
print(final_dd)
n = 7
dt_op quantity product_code Final_Quantity
20/01/18 1 613 2
21/01/18 8 611 8
25/01/18 1 613 1
...

Regardless how you wanted to present the output, you can try the following codes to get total of sales for every product for every n days. Let say for every 7 days:
dd.groupby([pd.Grouper(key='dt_op', freq='7D'), 'product_code']).sum()['quantity']

standardizing a value, iterating over a groupby object

I need some help iterating over a groupby object in python. I have people nested under a single ID variable, and then under each one of those, they have balances for anywhere from 3 to 6 months. So, printing the groupby object looks, for example, like this:
(1, Primary BP Product Rpt Month Closing Balance
0 1 CHECK 201708 10.04
1 1 CHECK 201709 11.1
2 1 CHECK 201710 11.16
3 1 CHECK 201711 11.22
4 1 CHECK 201712 11.28
5 1 CHECK 201801 11.34)
(2, Primary BP Product Rpt Month Closing Balance
79 2 CHECK 201711 52.42
85 2 CHECK 201712 31.56
136 2 CHECK 201801 99.91)
I want to create another column that standardizes the closing balance based on their first amount. So the ideal output would then look like this:
(1, Primary BP Product Rpt Month Closing Balance standardized
0 1 CHECK 201708 10.04 0
1 1 CHECK 201709 11.1 1.1
2 1 CHECK 201710 11.16 1.16
3 1 CHECK 201711 11.22 1.22
4 1 CHECK 201712 11.28 1.28
5 1 CHECK 201801 11.34 1.34)
(2, Primary BP Product Rpt Month Closing Balance standardized
79 2 CHECK 201711 52.42 0
85 2 CHECK 201712 31.56 -20.86
136 2 CHECK 201801 99.91 47.79)
I just can't quite figure out how to make a nice for loop, or if there is any other way, that will iterate within the groups of a groupby object, taking the first value for closing balance and subtracting it from each closing balance to essentially create a difference score.

I solved it! Only two weeks later. Did it without the use of a groupby object. Here is how:
bpid = []
diffs = []
# These two lines were just a bit of cleaning needed to make the vals numeric
data['Closing Balance'] = data['Closing Balance'].str.replace(",", "")
data['Closing Balance'] = pd.to_numeric(data['Closing Balance'])
# Create a new variable in monthly_data that simply shows the increase in closing balance for each month,
# setting the first month to 0
for index, row in data.iterrows():
bp = row[0]
if bp not in bpid:
bpid.append(bp)
first = row[3]
bal = row[3]
diff = round(bal-first, 2)
diffs.append(diff)
row['balance increase'] = diff
# Just checking to make sure there are the right number of values. Same as data, so good to go
print(len(diffs))
# Convert my list of differences in closing balance to a series object, and merge with the monthly_data
se = pd.Series(diffs)
data['balance increase'] = se.values

Creating a new column name based on a loop variable and an additional string

I want to create percentage change column for each column that is a float in my dataframe and stored it in a newn column each time with the name of the initial column and the add on "_change"
I tried this but it does not seem to work any idea?
for col in df.columns:
if df[col].dtypes == "float":
df[ col&'_change'] = (df.col - df.groupby(['New_ID']).col.shift(1))/ df.col
for example if my column is df["Expenses"] I would like to save the percentage change in df["Expenses_change"]
Edited for adding example data frame and output
df initially
Index ID Reporting_Date Sales_Am Exp_Am
0 1 01/01/2016 1000 900
1 1 02/01/2016 1050 950
2 1 03/01/2016 1060 960
3 2 01/01/2016 2000 1850
4 2 02/01/2016 2500 2350
4 2 03/01/2016 3000 2850
after the loop
Index ID Reporting_Date Sales_Am Sales_Am_chge Exp_Am Exp_Am_chge
0 1 01/01/2016 1000 Null 900 Null
1 1 02/01/2016 1050 5% 950 6%
2 1 03/01/2016 1060 1% 960 1%
3 2 01/01/2016 2000 Null 1850 Null
4 2 02/01/2016 2500 25% 2350 27%
4 2 03/01/2016 3000 20% 2850 21%
keep in mind that i have more than 2 columns on my dataframe.

Why are you using '&' instead of '+' in
df[ col&'_change']
?

String concatenation is performed in python via the + operator.
So changing to col+'_change' will fix this issue for you.
You might find it helpful to read the relevant python documentation.

As it has been mentioned in other answers, just by changing & for + should do it. I was getting issues with using dots instead of square brackets so I changed them too.
This code has been tested in Python 3 and it works :)
for col in df.columns:
if df[col].dtypes == "float":
df[col+'_change'] = (df[col] - df.groupby(['repeat_present'])[col].shift(1))/ df[col]
Enjoy!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to follow changes in a dataframe, but only in one direction - python

Related

How to create a rank from a df with Pandas

Can I cluster these records without having to run these loops for every record?

Next "n" Days Sales for every product

standardizing a value, iterating over a groupby object

Creating a new column name based on a loop variable and an additional string

Categories

Resources