pandas DataFrame: Calculate Sum based on boolean values in another column

pandas DataFrame: Calculate Sum based on boolean values in another column - python

I am fairly new to Python and I trying to simulate the following logic with in pandas
I am currently looping throw the rows and want to sum the values in the AMOUNT column in the prior rows but only till the last seen 'TRUE' value. It seems inefficient with the actual data (I have a dataframe of about 5 million rows)? Was wondering what the efficient way of handling such a logic in Python would entail?
Logic:
The logic is that if FLAG is TRUE I want to sum the values in the AMOUNT column in the prior rows but only till the last seen 'TRUE' value. Basically sum the values in 'AMOUNT' between the rows where FLAG is TRUE

Check with cumsum and transform sum
df['SUM']=df.groupby(df['FLAG'].cumsum()).Amount.transform('sum').where(df.FLAG)

maybe try something around the following:
import pandas
df = pd.read_csv('name of file.csv')
df['AMOUNT'].sum()

Related

Pandas conditional formula based on comparison of two cells

When calculating a new column called "duration_minutes", some of the results are negative because the values were put in the original columns backwards.
time.started_at=pd.to_datetime(time.started_at)
time.ended_at=pd.to_datetime(time.ended_at)
time["duration_minutes"]=(time.ended_at-time.started_at).dt.total_seconds()/60
time.head()
A quick check for negatives time[time.duration_minutes<0] in the "duration_minutes" column shows many rows with negative values because the start and stop times are in the wrong columns.
Is there a way to create and calculate the "duration_minutes" column to deal with this situation?

Changing pandas DataFrame values based on values from the same and previous rows

I have the following pandas df:
it is sorted by 'patient_id', 'StartTime', 'hour_counter'.
I'm looking to perform two conditional operations on the df:
Change the value of the Delta_Value column
Delete the entire row
Where the condition depends on the values of ParameterID or patient_id in the current row and the row before.
I managed to do that using classic programming (i.e. a simple loop in Python), but not using Pandas.
Specifically, I want to change the 'Delta_Value' to 0 or delete the entire row, if the ParameterID in the current row is different from the one at the row before.
I've tried to use .groupby().first(), but that won't work in some cases because the same patient_id can have multiple occurrences of the same ParameterID with a different
ParameterID in between those occurrences. For example record 10 in the df.
And I need the records to be sorted by the StartTime & hour_counter.
Any suggestions?

Pandas: Find string in a column and replace them with numbers with incrementing values

I am working on a dataframe with where I have multiple columns and in one of the columns where there are many rows approx more than 1000 rows which contains the string values. Kindly check the below table for more details:
In the above image I want to change the string values in the column Group_Number to number by picking the values from the first column (MasterGroup) and increment by one (01) and want values to be like below:
Also need to verify that if the String is duplicating then instead of giving a new number it replaces with already changed number. For example in the above image ANAYSIM is duplicating and instead of giving a new sequence number I want already given number to repeating string.
Have checked different links but they are focusing on giving values from user:
Pandas DataFrame: replace all values in a column, based on condition
Change one value based on another value in pandas
Conditional Replace Pandas
Any help with achieving the desired outcome is highly appreciated.

We could do cumcount with groupby
s=(df.groupby('MasterGroup').cumcount()+1).mul(10).astype(str)
t=pd.to_datetime(df.Group_number, errors='coerce')
Then we assign
df.loc[t.isnull(), 'Group_number']=df.MasterGroup.astype(str)+s

How to find non-zero median/mean of multiple columns in pandas?

I have a long list of columns for which I want to calculate non-zero median,mean & std in a one go. I cannot just delete rows with 0 based on 1 column because the value for another column in same column may not be 0.
Below is the code I currently have which calculates median,mean etc. including zero.
agg_list_oper={'ABC1':[max,np.std,np.mean,np.median],
'ABC2':[max,np.std,np.mean,np.median],
'ABC3':[max,np.std,np.mean,np.median],
'ABC4':[max,np.std,np.mean,np.median],
.....
.....
.....
}
df=df_tmp.groupby(['id']).agg(agg_list_oper).reset_index()
I know I can write long code with loops to process one column at a time.
Is there a way to do this in pandas groupby.agg() or some other functions elegantly?

You can temporarily replace 0's with NaNs. Then, pandas will ignore the NaNs while calculating medians.
df_tmp.replace(0, np.nan).groupby(['id']).agg(agg_list_oper).reset_index()

Removing rows from a data frame when the value In a specific column is less than the previous value

I have a pandas dataframe with multiple rows. Before I can produce plots I must perform filtering on the time column. Typically the value for time will increase at a 1Hz rate however there will be cases when the value for time will go backward. I need to drop any rows that have those "invalid" values for time.

This should work for you:
df = pd.concat([df[:1],df[df.shift(1)['time'] < df['time']]])

If you have a DF with a time column (or some other numeric representation of time):
DF=pd.DataFrame({'Time':[1,2,3,4,3,4,5,6]})
With pandas use the diff method to find negative rows (i.e. rows where the next time value is less), which are then filtered out:
DF[DF.Time.diff()>=0]
True values in the DF are retained.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas DataFrame: Calculate Sum based on boolean values in another column - python

Check with cumsum and transform sum df['SUM']=df.groupby(df['FLAG'].cumsum()).Amount.transform('sum').where(df.FLAG)

maybe try something around the following: import pandas df = pd.read_csv('name of file.csv') df['AMOUNT'].sum()

Related

Pandas conditional formula based on comparison of two cells

Changing pandas DataFrame values based on values from the same and previous rows

Pandas: Find string in a column and replace them with numbers with incrementing values

How to find non-zero median/mean of multiple columns in pandas?

Removing rows from a data frame when the value In a specific column is less than the previous value

Categories

Resources