Updating column value based on previous values - python

I am working with a financial instrument DataFrame. Firstly I used the following code to check all the occurrences where the closing price was higher than the previous high of n periods, and when the closing price was below the previous low of n periods. Then assigned the relevant values to the new column "position".
entry = [self.data['close'] > self.data['open_buy_high'], self.data['close'] < self.data['open_sell_low']]
self.data['position'] = np.select(entry, [1, -1], 0)
This worked well and returned values. The next step is I need the "position" column to stay equal to 1 or -1 until the closing price exceeds the high or low of a shorter period. So I tried the following code:
exit = [(self.data['position'] == 1) & (self.data['close'] > self.data['close_buy_low']),
(self.data['position'] == -1) & (self.data['close'] <
self.data['close_sell_high'])]
self.data['position'] = np.select(exit, [1, -1], 0)
After running this I got exactly the same DataFrame back, and I realized the conditions I used means that where the position was equal to zero, it will stay equal to zero. Because the second block of code will only return true when position is 1 or -1. So obviously I would get the same result as if I just ran the first block.
My issue now is I have no idea how to update the 0 values of the position column to stay equal to 1 (instead of 0) until the closing price falls below the 'close_buy_low', or -1 (instead of 0) until the closing price goes above 'close_sell_high'.
Any suggestions on what I can do/ use to achieve this?
DataFrame

Related

Python dataframe check if column value is greater than previous column

On python im trying to check if from a day to the next one (column by column), by ID, the values, if not all equal to zero, are correctly incremented by one or if at some point the value goes back to 0, then the next day it is either still equal to zero or incremented by one.
I have a dataframe with multiple columns, the first column is named "ID". All other columns are integers which names represent a day followed by the next day like this :
I need to check if for each ID :
all the columns (i.e for all of the days) are equal to 0 then create new column named "CHECK" which equals to 0 meaning there is no error ;
if not then look column after column and check if the value in the next column is greater than the previous column value (i.e the day before) and just incremented by 1 (e.g from 14 to 15 and not 14 to 16) then "CHECK" equals to 0 ;
if these conditions aren't satisfied it means that the next column is either equal to the previous column or lower (but not equal to zero) and in both cases it is an error then "CHECK" equals to 1 ;
But if the next value column is lower and equal to 0 then the next one to it must be either still 0 or incremented by 1. Each time it comes back to zero it must be followed by zero or incremented by 1
Each time there is a column in which the calculation comes back to zero, the next columns should either be equal to zero or getting +1 value.
If i explained everything correctly then in this example, the first two IDs are correct and their variable "CHECK" must be equal to 0 but the next ID should have a "CHECK" value 1.
I hope this is not confusing..thanks.
I tried this but would like to not use the column name but its index/position. Code not finished.`
df['check'] = np.where((df['20110531']<=df['20110530']) & ([df.columns!="ID"] != 0),1,0)
You could write a simple function to go along each row and check for your condition. In the example code below, I first set the index to ID.
df = df.set_index('ID')
def func(r):
start = r[0]
for j,i in enumerate(r):
if j == 0:
continue
if i == 0 or i == start +1:
start = i
else:
return 1
return 0
df['check'] = df.apply(func, axis = 1)
if you want to keep the original index the don't reset and use df['check'] = df.iloc[:,1:].apply(func, axis = 1)

How to get Pyspark window filtering and get only a first row by a condtition?

I have the next dataframe:
I need to get this:
So:
Partition on zrep and pos
Sorting by day_num
For each row sequence where problem_stock == 0, I need to get the first rise from 0 to positive
Partition and sorting is not problem part, but I don't know how to put filter problem_stock == 0, in partition and get only the first rise.

calculate descriptive statistics in pandas dataframe based on a condition cyclically

I have a pandas DataFrame with more than 100 thousands of rows. Index represents the time and two columns represents the sensor data and the condition.
When the condition becomes 1, I want to start calculating score card (average and standard deviation) till the next 1 comes. This needs to be calculated for the whole dataset.
Here is a picture of the DataFrame for a specific time span:
What I thought is to iterate through index and items of the df and when condition is met I start to calculate the descriptive statistics.
cycle = 0
for i, row in df_b.iterrows():
if row['condition'] == 1:
print('Condition is changed')
cycle += 1
print('cycle: ', cycle)
#start = ?
#end = ?
#df_b.loc[start:end]
I am not sure how to calculate start and end for this DataFrame. The end will be the start for the next cycle. Additionally, I think this iteration is not the optimal one because it takes a bit of long time to iterate. I appreciate any idea or solution for this problem.
Maybe start out with getting the rows where condition == 1:
cond_1_df = df.loc[df['condition'] == 1]
This dataframe will only contain the rows that meet your condition (being 1).
From here on, you can access the timestamps pairwise, meaning that the first element is beginning and second element is end, sketched below:
former = 0
stamp_pairs = []
df = cond_1_df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
if former != 0:
beginning = former
end = row["timestamp"]
former = row["timestamp"]
else:
beginning = 0
end = row["timestamp"]
former = row["timestamp"]
stamp_pairs.append([beginning, end])
This should give you something like this:
[[stamp0, stamp1], [stamp1,stamp2], [stamp2, stamp3]...]
for each of these pairs, you can again create a df containing only the subset of rows where stamp_x < timestamp < stamp_x+1:
time_cond_df = df.loc[(df['timestamp'] > stamp_x) & (df['timestamp'] < stamp_x+1)]
Finally, you get one time_cond_df per timestamp tuple, on which you can perform your score calculations.
Just make shure that your timestamps are comparable with operators ">" and "<"! We can't tell since you did not explicate how you produced the timestamps.

How to get values using pd.shift

I am trying to populate values in the column motooutstandingbalance by subtracting the previous row actualmotordeductionfortheweek from previous row motooutstandingbalance. I am using pandas shift command but currently not getting the desired output which should be a consistent reduction in motooutstandingbalance week by week.
Final result should look like this
Here is my code
x['motooutstandingbalance']=np.where(x.salesrepid == x.shift(1).salesrepid, x.shift(1).motooutstandingbalance - x.shift(1).actualmotordeductionfortheweek, x.motooutstandingbalance)
Any ideas on how to achieve this?
This works:
start_value = 468300.0
df['motooutstandingbalance'] = (-df['actualmotordeductionfortheweek'][::-1]).append(pd.Series([start_value], index=[-1]))[::-1].cumsum().reset_index(drop=True)
Basically what I'm doing is I'm—
Taking the actualmotordeductionfortheweek column, negating it (all the values become negative), and reversing it
Adding the start value (which is positive, as opposed to all the other values which are negative) at index -1 (which is before 0, not at the very end as is usual in Python)
Reversing it back, so that the new -1 entry goes to the very beginning
Using cumsum() to add all the of values of the column. This actually work to subtract all the values from the start value, because the first value is positive and the rest of the values are negative (because x + (-y) = x - y)

How to create running total and restart it every time NaN appears?

I want to launch a new running total everytime it runs into nan
For example, from the attached picture it would sum first 3 values [1242536, 379759, 1622295] and then show running total 3244590.0, then it would start new running total from 5th value and till 9th, show sum for these values and so on. I want to place these running total to new column beside these NaN values.
I have tried to approach this issue the following way :
for i in df['Budget_Expenditure_2012_']:
if np.isnan(i) == True:
x = pd.Index(df['Budget_Expenditure_2012_']).get_loc(i)
print(x)
for item in range(0, len(x) - 1, 2):
second_list.append([x[item],x[item + 1]])
print(second_list)
The idea would be to find the sum of values between each pair of rows. This pair would be a start position and last position of each range needed to be sumed.
At this point I got lost as to how I would execute this sum operation.
Use a combination of shift, isna and cumsum to gropuby, then transform and finally assign the resulting values where the column is nan
df.loc[df['Budget_Expenditure_2012_'].isna(), 'new_column'] = (
df.groupby(
df.Budget_Expenditure_2012_.shift()
.isna()
.cumsum()
)['Budget_Expenditure_2012_'].transform('sum')
)
With this code you can get the 'running totals' upto each nan on a new column called 'Totals'.
total = 0
df['Totals'] = 0 # assign 0 initially to all rows of the new column
for i in range(df.shape[0]): # shape[0] return number of rows
expenditure = df.loc[i+1, 'Budget_Expenditure_2012_'] # i+1 coz your indexing starts at 1
if np.isnan(expenditure):
df.loc[i, 'Totals'] = total
total = 0
else:
total += expenditure

Categories