I want to launch a new running total everytime it runs into nan
For example, from the attached picture it would sum first 3 values [1242536, 379759, 1622295] and then show running total 3244590.0, then it would start new running total from 5th value and till 9th, show sum for these values and so on. I want to place these running total to new column beside these NaN values.
I have tried to approach this issue the following way :
for i in df['Budget_Expenditure_2012_']:
if np.isnan(i) == True:
x = pd.Index(df['Budget_Expenditure_2012_']).get_loc(i)
print(x)
for item in range(0, len(x) - 1, 2):
second_list.append([x[item],x[item + 1]])
print(second_list)
The idea would be to find the sum of values between each pair of rows. This pair would be a start position and last position of each range needed to be sumed.
At this point I got lost as to how I would execute this sum operation.
Use a combination of shift, isna and cumsum to gropuby, then transform and finally assign the resulting values where the column is nan
df.loc[df['Budget_Expenditure_2012_'].isna(), 'new_column'] = (
df.groupby(
df.Budget_Expenditure_2012_.shift()
.isna()
.cumsum()
)['Budget_Expenditure_2012_'].transform('sum')
)
With this code you can get the 'running totals' upto each nan on a new column called 'Totals'.
total = 0
df['Totals'] = 0 # assign 0 initially to all rows of the new column
for i in range(df.shape[0]): # shape[0] return number of rows
expenditure = df.loc[i+1, 'Budget_Expenditure_2012_'] # i+1 coz your indexing starts at 1
if np.isnan(expenditure):
df.loc[i, 'Totals'] = total
total = 0
else:
total += expenditure
Related
On python im trying to check if from a day to the next one (column by column), by ID, the values, if not all equal to zero, are correctly incremented by one or if at some point the value goes back to 0, then the next day it is either still equal to zero or incremented by one.
I have a dataframe with multiple columns, the first column is named "ID". All other columns are integers which names represent a day followed by the next day like this :
I need to check if for each ID :
all the columns (i.e for all of the days) are equal to 0 then create new column named "CHECK" which equals to 0 meaning there is no error ;
if not then look column after column and check if the value in the next column is greater than the previous column value (i.e the day before) and just incremented by 1 (e.g from 14 to 15 and not 14 to 16) then "CHECK" equals to 0 ;
if these conditions aren't satisfied it means that the next column is either equal to the previous column or lower (but not equal to zero) and in both cases it is an error then "CHECK" equals to 1 ;
But if the next value column is lower and equal to 0 then the next one to it must be either still 0 or incremented by 1. Each time it comes back to zero it must be followed by zero or incremented by 1
Each time there is a column in which the calculation comes back to zero, the next columns should either be equal to zero or getting +1 value.
If i explained everything correctly then in this example, the first two IDs are correct and their variable "CHECK" must be equal to 0 but the next ID should have a "CHECK" value 1.
I hope this is not confusing..thanks.
I tried this but would like to not use the column name but its index/position. Code not finished.`
df['check'] = np.where((df['20110531']<=df['20110530']) & ([df.columns!="ID"] != 0),1,0)
You could write a simple function to go along each row and check for your condition. In the example code below, I first set the index to ID.
df = df.set_index('ID')
def func(r):
start = r[0]
for j,i in enumerate(r):
if j == 0:
continue
if i == 0 or i == start +1:
start = i
else:
return 1
return 0
df['check'] = df.apply(func, axis = 1)
if you want to keep the original index the don't reset and use df['check'] = df.iloc[:,1:].apply(func, axis = 1)
I have a pandas DataFrame with more than 100 thousands of rows. Index represents the time and two columns represents the sensor data and the condition.
When the condition becomes 1, I want to start calculating score card (average and standard deviation) till the next 1 comes. This needs to be calculated for the whole dataset.
Here is a picture of the DataFrame for a specific time span:
What I thought is to iterate through index and items of the df and when condition is met I start to calculate the descriptive statistics.
cycle = 0
for i, row in df_b.iterrows():
if row['condition'] == 1:
print('Condition is changed')
cycle += 1
print('cycle: ', cycle)
#start = ?
#end = ?
#df_b.loc[start:end]
I am not sure how to calculate start and end for this DataFrame. The end will be the start for the next cycle. Additionally, I think this iteration is not the optimal one because it takes a bit of long time to iterate. I appreciate any idea or solution for this problem.
Maybe start out with getting the rows where condition == 1:
cond_1_df = df.loc[df['condition'] == 1]
This dataframe will only contain the rows that meet your condition (being 1).
From here on, you can access the timestamps pairwise, meaning that the first element is beginning and second element is end, sketched below:
former = 0
stamp_pairs = []
df = cond_1_df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
if former != 0:
beginning = former
end = row["timestamp"]
former = row["timestamp"]
else:
beginning = 0
end = row["timestamp"]
former = row["timestamp"]
stamp_pairs.append([beginning, end])
This should give you something like this:
[[stamp0, stamp1], [stamp1,stamp2], [stamp2, stamp3]...]
for each of these pairs, you can again create a df containing only the subset of rows where stamp_x < timestamp < stamp_x+1:
time_cond_df = df.loc[(df['timestamp'] > stamp_x) & (df['timestamp'] < stamp_x+1)]
Finally, you get one time_cond_df per timestamp tuple, on which you can perform your score calculations.
Just make shure that your timestamps are comparable with operators ">" and "<"! We can't tell since you did not explicate how you produced the timestamps.
I need to perform the following steps on a data-frame:
Assign a starting value to the "balance" attribute of the first row.
Calculate the "balance" values for the subsequent rows based on value of the previous row using the formula for eg : (previous row balance + 1)
I have tried the following steps:
Created the data-frame:
df = pd.DataFrame(pd.date_range(start = '2019-01-01', end = '2019-12-31'),columns = ['dt_id'])
Created attribute called 'balance':
df["balance"] = 0
Tried to conditionally update the data-frame:
df["balance"] = np.where(df.index == 0, 100, df["balance"].shift(1) + 1)
Results:
From what I can observe, the value is being retrieved for subsequent update before it can be updated in the original data-frame.
The desired output for "balance" attribute :
Row 0 : 100
Row 1: 101
Row 2 : 102
And so on
If I understand correctly if you add this line of code after yours, you are ready:
df["balance"].cumsum()
0 100.0
1 101.0
2 102.0
3 103.0
4 104.0
...
360 460.0
361 461.0
362 462.0
363 463.0
364 464.0
It is a cumulative sum, it sums its value with the previous one and since you have the starting value and then ones it will do what you want.
The problem you have is, that you want to calculate an array and the elements are dependent on each other. So, e.g., element 2 depends on elemen 1 in your array. Element 3 depends on element 2, and so on.
If there is a simple solution, depends on the formula you use, i.e., if you can vectorize it. Here is a good explanation on that topic: Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?
In your case a simple loop should do it:
balance = np.empty(len(df.index))
balance[0] = 100
for i in range(1, len(df.index)):
balance[i] = balance[i-1] + 1 # or whatever formula you want to use
Please note, that above is the general solution. Your formula can be vectorized, thus also be generated using:
balance = 100 + np.arange(0, len(df.index))
Data:
day cost new_column
1 1 0
2 1 0
3 5 enter position
4 3 stay in position
5 10 stay in position
6 1 exit position
7 1 0
Hi, I'm wondering if there is a way to reference a previous row in a calculated column without looping and using loc/iloc. In the above example, I want to calculate this 'new_column'. Once the cost hits 5, I want to enter the position. Once I am in a position, I want to be able to check on the next line if I am already in a position and check that the price is not 1. If so then I stay in the position. The first row I hit where the cost is 1 and the previous "new_column" is 'stay in position' I want to make 'exit position'. Then the next row with 1, 'new_column' should go back to zero.
How I am solving this now is by looping over each row, looking at the cost on row i and the status of new_column on row i-1, then inserting the result in new_column on row i.
This takes a while on large data sets and I would like to find a more efficient way to do things. I looked into list(map()), but i don't see a way to reference a previous row, because I don't think that data will have been created yet to reference. Any ideas?
Thank you
Hey as smj suggested one option is using shift.
day = list(range(1,8))
cost = [1,1,5,3,10,1,1]
import pandas as pd
import numpy as np
df = pd.DataFrame({'day':day,'cost':cost}, columns = ['day', 'cost'])
print(df)
df['new'] = np.where(df['cost']> 1, np.where(
df['cost'].shift(-1) >=1,
'stay','a'
),
np.where(
df['cost'].shift()>1, 'exit', 0
)
)
print(df)
let's say I have the following dataframe:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
Now I want to multiply the variable Shots for a random value (multiplier in the code) and recaclucate the StG variable that is nothing but Shots/Goals, the code I used is:
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Shots'] *= multiplier
row['StG']=float(row['Shots'])/float(row['Goals'])
Then I saved the .csv and it was identically at the original one, so after the for I simply used print(df) to obtain:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
If I print the values row per row during the for iteration I see they change, but its like they don't save in the df.
I think it is because I'm simply accessing to the values,not the actual dataframe.
I should add something like df.row[], but it returns DataFrame has no row property.
Thanks for the help.
____EDIT____
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Impresions']*=multiplier
row['Clicks']*=(np.random.randint(1,multiplier+1))
row['Ctr']= float(row['Clicks'])/float(row['Impresions'])
row['Mult']=multiplier
#print (row['Clicks'],row['Impresions'],row['Ctr'],row['Mult'])
The main condition is that the number of Clicks cant be ever higher than the number of impressions.
Then I recalculate the ratio between Clicks/Impressions on CTR.
I am not sure if multiplying the entire column is the best choice to maintain the condition that for each row Impr >= Clicks, hence I went row by row
Fom the pandas docs about iterrows(): pandas.DataFrame.iterrows
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
The good news is you don't need to iterate over rows - you can perform the operations on columns:
# Generate an array of random integers of same length as your DataFrame
multipliers = np.random.randint(1, 5+1, size=len(df))
# Multiply corresponding elements from df['Shots'] and multipliers
df['Shots'] *= multipliers
# Recalculate df['StG']
df['StG'] = df['Shots']/df['Goals']
Define a function that returns a series:
def f(x):
m = np.random.randint(1,5+1)
return pd.Series([x.Shots * m, x.Shots/x.Goals * m])
Apply the function to the data frame row-wise, it will return another data frame which can be used to replace some columns in the existing data frame, or create new columns in data frame
df[['Shots', 'StG']] = df.apply(f, axis=1)
This approach is very flexible as long as the new column values depend only on other values in the same row.